Station announcements, from a spreadsheet and some files back to announcements

Earlier this year, Scotrail published a roughly two hour MP3 file containing many of the little snippets of sound used in automatic station announcements. It looks like it was published in June, but it seems that it only really got attention in August. Each snippet was conveniently spaced by two seconds of silence, making splitting it automatically a practical proposition, which was done, and then a great effort made on a Google Docs sheet to label each snippet of audio.

You can read a bit more about the story behind this here. In particular, I quite like Ambient Scotrail Beats.

Several people sent this to me, thinking I'd be interested, and they were right! It's great fun to put together a realistic sounding message which apologises for the inconvenience caused by Scotrail, or says Leeds is cancelled, though there were definitely some much funnier creations.

I decided to try to do something a bit more unexpected (and difficult) with this - to try to use it for its original purpose, automatically generating announcements for real trains in Scotland.

The rail data

National Rail operates a system called Darwin, which acts as the single source of truth for passenger rail information in the UK, virtually all station departure boards and announcements are fed by this. Happily, the general public are able to use these feeds. It's available in three main forms - one is a message queue with quite significant overheads for storing/parsing the information, and the other two are SOAP APIs with a query/response model. Since I'm not trying to announce every station at once, and polling is fine, I opted for the latter, specifically LDBSVWS.

Darwin has standardised delay/cancellation reason codes, covering everything from wartime bombs being discovered near the railway to supermarket trolleys on the tracks. You can see a list I dumped earlier this year from another project here.

Darwin also has standardised location codes, in two forms - there's the three-letter "CRS" code you may already be familiar with (e.g. EDB for Edinburgh Waverley), with one code per station, but there's also another sort of location code, the Timing Point Location, or TIPLOC, four to seven characters long. TIPLOCs are used for scheduling by Network Rail, and they encompass a lot more than just stations - junctions, depots, and a variety of other points of interest - everything necessary to define the route of a train over the network.

Importantly, TIPLOCs allow us to make a distinction that's also made in announcements. Some stations have multiple TIPLOCs, where trains take multiple routes through them. The best example is probably St Pancras London, which has three (STPXBOX - Low Level/Thameslink, STPX - Midland Main Line, STPANCI - High Speed 1).

Of specific interest to us are Glasgow's Queen St and Central stations, since these have high and low level platforms, which have different announcements.

The announcements

Since all of the files have already been conveniently split, and people have helpfully transcribed them, it's quite easy to get a sense of how announcements are constructed.

For a practical example, let's imagine a nonsense single carriage train departing at 0127 for Euston and then Milngavie.

"Platform 1 for the delayed 0127 Scotrail Service to Milngavie, calling at London Euston, and Milngavie. This train is formed of 1 coach only. A trolley service of drinks and light refreshments is available on this train"

This could be formed by thirteen files:

Some of these could be further broken up:

"Platform 1 for the delayed"

"Scotrail service to"

"This train is formed of 1 coach"

Although this isn't preferable, since longer fragments sound more natural when they're possible.

Most stations have two versions, a mid and an end version, with the first sounding more natural when in the middle of a list, and the latter sounding natural at the end.

Station coverage is not necessarily great, we've got a lot (but not everything) in Scotland, but coverage in England and Wales is limited to stations which trains going to or from Scotland call at.

Renaming everything

I found this previous release from Network Rail quite helpful in offering some inspiration for the strategy I took - standardised filenames.

So, the filenames have to be relatively short, and they have to either match Darwin data (stations, reason codes), or they should have descriptive names.

I wrote a few crude python scripts to do this, I'll explain what they do in order.

generate_locations.py

This takes a table dump from another project of mine, a row per each TIPLOC, with the matching CRS (if any), Darwin name, as well as the TIPLOC-specific name derived from CORPUS/BPLAN, and it transforms it in preparation for use later on.

filter_ann_locations.py

This script uses the transcriptions table to try to match sound files to TIPLOCs.

The vast majority of transcriptions were fairly straightforward to match, just by matching both the CRS and Darwin station name at once. In this case, the script starts with the assumption that the announcement covers every TIPLOC represented by the station's CRS code.

In a few cases, the transcriptions were matched with normalised CORPUS/ BPLAN names, but not Darwin. There were a few reasons for this, one is the names of stations changing, another is the separate announcements for Glasgow Queen St and Central stations' low level platforms. Since BPLAN names are TIPLOC-specific, these were used to override the one-size-fits-all assumption made for Darwin station names.

I opted to require the CRS and name to both match at once because I expected that people were likely to make mistakes, either typos or because they thought they knew the code and got it wrong. I found a few cases like this, and various errors or edge cases were handled with manually specified remappings at the top of the script.

Remappings are dumped in maps/stations.txt in the format NEW_FILE_NAME 0001

Representative output:

LOCATION_BRODICK 1878
STN_HAMLTNW_MID_B 1879
LOCATION_NEWARK_0 1881
STN_NWTL_2 1882
LOCATION_WORCESTER_1 1883
STN_ALXANDR_END 1886
STN_PADTON_MID 1924
STN_STPX_MID_B 1925
STN_MNCROXR_MID_B 1937
STN_TYNDRMU_END 1939
LOCATION_BRODICK 1944
LOCATION_MANCHESTER_1 1946
STN_HAMLTNW_END_B 1947
STN_NWTL_3 1950

filter_ann_general.py

This script matches most other things, most Darwin delay/cancellation codes are easily matched since the announcements specifically cover them and usually do so word-for-word.

Some words/phrases are manually specified, a few are matched with a list of regexes/substitutions.

NUMBER_18_END 1403
NUMBER_19_END 1404
WORD_TO 1406
NUMBER_20_END 1407
WE_APOLOGISE_FOR_INCONVENIENCE_CAUSED 1488
MAY_NOT_BE_AVAILABLE_THIS_TR 1501
DUE_TO 1528
IS_BEING_DELAYED 1530
REASON_896 1536
REASON_184 1593
SERVICE_TO 1644
CHIMES 1646
TR_NOW_APP_PLAT 1679
THIS_SERVICE_FULL 1691
HISTORICAL_OPERATOR_VIRGIN_TRAINS 1757
CONSIDER_ALTERNATIVE_TRANSPORT 1765
TO_REACH_YOUR_DESTINATION 1766
SERVICE_FROM 1778

do_inflection.py

Virtually all stations have two sound files, one for middle and end. The transcription sheet does note this in a few cases, but not in anywhere near enough, so I wrote a script which played all of the sound files for a station and then prompted me on whether they were in the right order or not.

Fairly quick for all the stations with two sound files, stations with a different number of sound files were all manually specified.

reform.py

This script clears the output directory, then copies sound files with the new names defined in the maps, taking input from the inflection list to rename station calls as needed.

No, you don't want to see these scripts

And isn't it the results that matter, anyway?

Actually announcing trains

There's honestly not much to say about this, it's here, it's open source, it only really works for stations in Scotland, some of the time, it probably has a lot of exciting bugs. At the moment, it announces trains when they're about to depart, and when they change platforms.

Conclusion

I think it's interesting how much of this I was able to extract just by judicious use of regular expressions and simple comparison. The transcription spreadsheet wasn't made by people familiar with open rail data, which presented some difficulties, but it still saved an enormous amount of time, and made this project feasible.

Very cool, anonymous volunteers, thank you.

I'd also like to thank Peter Hicks for encouraging me to do this, and to the members of the Railfurs group for their input, insight, and memes.