Earlier this year, Scotrail published a roughly two hour MP3 file containing many of the little snippets of sound used in automatic station announcements. It looks like it was published in June, but it seems that it only really got attention in August. Each snippet was conveniently spaced by two seconds of silence, making splitting it automatically a practical proposition, which was done, and then a great effort made on a Google Docs sheet to label each snippet of audio.
You can read a bit more about the story behind this here. In particular, I quite like Ambient Scotrail Beats.
Several people sent this to me, thinking I'd be interested, and they were right! It's great fun to put together a realistic sounding message which apologises for the inconvenience caused by Scotrail, or says Leeds is cancelled, though there were definitely some much funnier creations.
I decided to try to do something a bit more unexpected (and difficult) with this - to try to use it for its original purpose, automatically generating announcements for real trains in Scotland.
The rail data
National Rail operates a system called Darwin, which acts as the single source of truth for passenger rail information in the UK, virtually all station departure boards and announcements are fed by this. Happily, the general public are able to use these feeds. It's available in three main forms - one is a message queue with quite significant overheads for storing/parsing the information, and the other two are SOAP APIs with a query/response model. Since I'm not trying to announce every station at once, and polling is fine, I opted for the latter, specifically LDBSVWS.
Darwin has standardised delay/cancellation reason codes, covering everything from wartime bombs being discovered near the railway to supermarket trolleys on the tracks. You can see a list I dumped earlier this year from another project here.
Darwin also has standardised location codes, in two forms - there's the three-letter "CRS" code you may already be familiar with (e.g. EDB for Edinburgh Waverley), with one code per station, but there's also another sort of location code, the Timing Point Location, or TIPLOC, four to seven characters long. TIPLOCs are used for scheduling by Network Rail, and they encompass a lot more than just stations - junctions, depots, and a variety of other points of interest - everything necessary to define the route of a train over the network.
Importantly, TIPLOCs allow us to make a distinction that's also made in announcements. Some stations have multiple TIPLOCs, where trains take multiple routes through them. The best example is probably St Pancras London, which has three (STPXBOX - Low Level/Thameslink, STPX - Midland Main Line, STPANCI - High Speed 1).
Of specific interest to us are Glasgow's Queen St and Central stations, since these have high and low level platforms, which have different announcements.
- GLGQLL (Glasgow Queen St Low Level - announced as "Glasgow Queen Street Low Level")
GLGQHL (Glasgow Queen St High Level - announced as "Glasgow Queen Street")
GLGC (Glasgow Central High Level - "Glasgow Central")
- GLGCLL (Glasgow Central Low Level - "Glasgow Central Low Level")
Since all of the files have already been conveniently split, and people have helpfully transcribed them, it's quite easy to get a sense of how announcements are constructed.
For a practical example, let's imagine a nonsense single carriage train departing at 0127 for Euston and then Milngavie.
"Platform 1 for the delayed 0127 Scotrail Service to Milngavie, calling at London Euston, and Milngavie. This train is formed of 1 coach only. A trolley service of drinks and light refreshments is available on this train"
This could be formed by thirteen files:
- Platform 1 for the delayed
- Scotrail Service to
- Milngavie (mid inflection)
- Calling at
- London Euston (mid inflection)
- Milngavie (end inflection)
- This train is formed of 1 coach only
- A trolley service
- of drinks and light refreshments
- is available on this train
Some of these could be further broken up:
"Platform 1 for the delayed"
- for the
"Scotrail service to"
- service to
"This train is formed of 1 coach"
- This train is formed of
Although this isn't preferable, since longer fragments sound more natural when they're possible.
Most stations have two versions, a mid and an end version, with the first sounding more natural when in the middle of a list, and the latter sounding natural at the end.
Station coverage is not necessarily great, we've got a lot (but not everything) in Scotland, but coverage in England and Wales is limited to stations which trains going to or from Scotland call at.
I found this previous release from Network Rail quite helpful in offering some inspiration for the strategy I took - standardised filenames.
So, the filenames have to be relatively short, and they have to either match Darwin data (stations, reason codes), or they should have descriptive names.
I wrote a few crude python scripts to do this, I'll explain what they do in order.
This takes a table dump from another project of mine, a row per each TIPLOC, with the matching CRS (if any), Darwin name, as well as the TIPLOC-specific name derived from CORPUS/BPLAN, and it transforms it in preparation for use later on.
This script uses the transcriptions table to try to match sound files to TIPLOCs.
The vast majority of transcriptions were fairly straightforward to match, just by matching both the CRS and Darwin station name at once. In this case, the script starts with the assumption that the announcement covers every TIPLOC represented by the station's CRS code.
In a few cases, the transcriptions were matched with normalised CORPUS/ BPLAN names, but not Darwin. There were a few reasons for this, one is the names of stations changing, another is the separate announcements for Glasgow Queen St and Central stations' low level platforms. Since BPLAN names are TIPLOC-specific, these were used to override the one-size-fits-all assumption made for Darwin station names.
I opted to require the CRS and name to both match at once because I expected that people were likely to make mistakes, either typos or because they thought they knew the code and got it wrong. I found a few cases like this, and various errors or edge cases were handled with manually specified remappings at the top of the script.
Remappings are dumped in maps/stations.txt in the format
This script matches most other things, most Darwin delay/cancellation codes are easily matched since the announcements specifically cover them and usually do so word-for-word.
Some words/phrases are manually specified, a few are matched with a list of regexes/substitutions.
Virtually all stations have two sound files, one for middle and end. The transcription sheet does note this in a few cases, but not in anywhere near enough, so I wrote a script which played all of the sound files for a station and then prompted me on whether they were in the right order or not.
Fairly quick for all the stations with two sound files, stations with a different number of sound files were all manually specified.
This script clears the output directory, then copies sound files with the new names defined in the maps, taking input from the inflection list to rename station calls as needed.
No, you don't want to see these scripts
And isn't it the results that matter, anyway?
Actually announcing trains
There's honestly not much to say about this, it's here, it's open source, it only really works for stations in Scotland, some of the time, it probably has a lot of exciting bugs. At the moment, it announces trains when they're about to depart, and when they change platforms.
I think it's interesting how much of this I was able to extract just by judicious use of regular expressions and simple comparison. The transcription spreadsheet wasn't made by people familiar with open rail data, which presented some difficulties, but it still saved an enormous amount of time, and made this project feasible.
Very cool, anonymous volunteers, thank you.
I'd also like to thank Peter Hicks for encouraging me to do this, and to the members of the Railfurs group for their input, insight, and memes.