Import: Bus stops from GTFS

NeatNit · August 24, 2024, 6:40pm

As those of you following the Telegram chat will probably know, I have been working on an import script to import bus stops from GTFS on a regular basis. You can find the code repository on Codeberg.

I’ve been developing the script and testing it against the OSM development/sandbox server (Sandbox for editing - OpenStreetMap Wiki) and I think it’s ready for some live testing! So I want to select an area of the map, a small bounding box, and run the script there.

Please ask any questions you want. If you want to test something specific in the script’s behavior before the live test, now is the opportunity - contact me and we can test together.

Edit: if no one has any objection, I will perform the import in Nesher some time in the next week.

מי שעוקב בטלגרם בטח כבר יודע שאני עובד על סקריפט לייבוא של תחנות אוטובוס מ-GTFS באופן קבוע. ניתן לראות את הקוד כאן.

פיתחתי את הסקריפט ובדקתי אותו כנגד סביבת הפיתוח של OSM, ונראה לי שהוא מוכן לניסיון חי. אז אני רוצה לבחור שטח מלבני קטן ולהריץ את הסקריפט שם.

בבקשה שאלו כל שאלות שתרצו. אם אתם רוצים לבדוק משהו ספציפי בהתנהגות של הייבוא לפני שמנסים אותו על רטוב, עכשיו זה הזמן - צרו איתי קשר ונבדוק ביחד.

NeatNit · August 24, 2024, 7:54pm

To preemptively answer some questions:

This is similar to the old import script, but the results will have some differences:

The script only works on nodes, not ways. We’ve already fixed the only 2 instances of bus stops mapped as ways. overpass turbo (the platforms in Tel Aviv are not relevant)
Stop names are Fixed. For example, the name ד’'ר סטופ/הנרייטה סולד contains '' (two apostrophes), my script corrects it to ״ (that’s גרשיים).
Addresses are applied as object:* tags instead of addr:* tags, because bus stops don’t really have addresses. The addr:* tags should be deleted in bulk, but this is outside of the responsibilities of the script.

There are a number of technical differences as well. The Readme in the Codeberg page has some more details.

In terms of safety and correctness:

The script will only create new stops if they actually have scheduled buses. However, if a stop is already mapped, it will keep it updated regardless.
The script will never move or delete a stop if it’s part of a way. These situations will need to be resolved manually. Stops should never be part of a way, but there are some occurrences of this: overpass turbo

NeatNit · August 27, 2024, 9:48am

Since there are no replies, I will go ahead with the first import in Nesher.

NeatNit · August 27, 2024, 9:53am

on OSMCha:

NeatNit · August 27, 2024, 12:36pm

I would love to hear any feedback, even if it’s just “looks good” or criticism.

I noticed that the GTFS data includes רכבלית (cable car) stops. One was imported as Node: ‪טכניון מרכז‬ (‪12139588201‬) | OpenStreetMap, but I’ve fixed the code and I will delete this node now.

This also made me find that כרמלית (funicular) is also in the data, and I’m now filtering that as well.

skyper · August 29, 2024, 11:27pm

Hi NeatNit,
I only had a quick glance but in general it looks good.
I still find addr:street=*. The information is already present in object:street=* I guess and you probably have just forgotten to delete the former tag. Similar is probably true for gtfs:addr:housenumber=* which does not add any value if object:housenumber=* is present. Both and source=israel_gtfs look like leftovers from the past but could be deleted together with your updates.

Regarding cable cars and funiculars, these could be included but I understand it if you first want to add/update bus stops and leave the rest for the future.

NeatNit · August 30, 2024, 8:05am

Yes, this was silly of me - I thought it would be more appropriate to delete the old address tags in a separate script. On second thought it’s really not, so I’ve now added that in and ran the script again: OSMCha

I’m still using source=israel_gtfs to identify which stops are imported and which are manually added. The script will only delete stops that have the source=israel_gtfs tag.

NeatNit · August 30, 2024, 9:37am

I added Israel’s GTFS feed to the wiki: List of GTFS feeds - OpenStreetMap Wiki
I included a note about their misconfigured server that I had to work around, to save the trouble for any future users of this info.

I’m thinking of deleting source=israel_gtfs like you suggested, and instead adding gtfs:stop_id:IL-MOT=* to identify imported stops. Could also add other data too: GTFS - OpenStreetMap Wiki

Thoughts?

skyper · August 30, 2024, 11:00am

Oh, sorry, I should have mention the gtfs:*[:*] tags and PTNA.

The gtfs tags recently went through a refinement and I am not sure if gtfs:stop_id=* + gtfs:feed=IL-MOT or gtfs:stop_id:IL-MOT=* is the better approach. For sure you can use all other appropriate GTFS tags.
Using these tags instead of source=* is definitely an approvement.

Regarding PTNA you might be interested to add Israel to it. I am sure @ToniE will happily add both, the analysis and the GTFS feed.

NeatNit · August 30, 2024, 11:25am

I have known about PTNA but never looked too closely into it. I get the impression that it’s designed for much smaller GTFS feeds like a single city, whereas Israel has a single giant GTFS feed for the entire country. I hope PTNA will be able to handle that gracefully. I guess the maintainer of PTNA will be able to answer that

Great, I am already working on the code to make the change. Since it might need refinement I’ll run it on the dev instance first, hope you can take a look when I do.

The GTFS wiki page is definitive about preferring the latter, see Interpretation of deprecated gtfs:feed=* and other “Interpretation of deprecated” subsections on the page. It seems this info was added in March 2024 with the comment “Incorporate approved proposal GTFS tagging standard” so I think it’s up to date.

NeatNit · August 30, 2024, 11:47am

Here’s the test changeset: Changeset: 380542 | OpenStreetMap let me know if you spot any obvious problems

skyper · August 30, 2024, 11:55am

Switzerland and the Netherlands have also only one single GTFS feed. So this should not be a problem.
The analysis is broken down into tariff zones but there are pages for complete countries like all railway routes in Germany.

Well the proposal tried to solve the situation for several GTFS feeds and stop ids for one object. It had only a few votes and even the author back-paddled a bit as can be read from [RFC] Feature Proposal - GTFS Tagging standard - #32 by ToniE onward.

spaanse · August 30, 2024, 11:55am

For the proposal I mostly wanted to get a consistent standard through.
The main benefit of gtfs:stop_id:IL-MOT is that if script/feed adds information to the same object it is clear which feed it came from.

The deprecation of gtfs:feed got the most pushback on the vote, which narrowly passed. Since we now have a good standard, making small changes is easier.
One of them is [RFC] Feature Proposal - GTFS Tagging standard - #40 by spaanse .

I want to suggest if you are going to use the gtfs:feed approach, please fallback to gtfs:stop_id:IL-MOT if tags for other GTFS feeds are already present. (or downgrade the other tags to that scheme and have yours be the authorative one)

NeatNit · August 30, 2024, 12:31pm

I see. AFAIK there is no usage of gtfs:* tags in Israel yet, but the previous import (from 4 years ago) used gtfs:addr:housenumber which makes it impossible to check with a simple Overpass query. I’ll try to run a query and filter it with code.

Edit: pretty sure there are no results.

From what you’re saying, I’m getting the sense that gtfs:stop_id:IL-MOT is still the best option, so why would I do anything different?

Edit (again): by the way, I cleaned up some broken usage of the Tag template you seem to have added later. See: GTFS: Difference between revisions - OpenStreetMap Wiki

Previously it was rendered as e.g. "gtfs:stop_id:[[Key:|]]=*

skyper · August 30, 2024, 2:14pm

There is not much difference if you only look at stops. For route relations the story is a bit different and the question is if it is smart to repeat the feed suffix in every gtfs tag or if it is easier to use gtfs:feed=* if only a small number of them are actually covered by more than one GTFS feed and then usually only one of the feeds is the correct source.

ToniE · August 30, 2024, 3:02pm

Yeah, PTNA does not have a problem with larger GTFS feeds. However, “small is beautiful”, “kiss - keep it simple, stupid”, “make small things well”, … still apply. The drawback of lager feeds, such as CH-Alle, is the time it needs to import, aggregate and analyze (!) the feed data before publishing in PTNA (CH-Alle: 3 hours) - but that’s mostly semi-automatic.

For “Stops” I vote for “gtfs:stop_id:*”, it is even the only way to do that if stops are “used” by more than one transport association/authority or appear in more than one feed with different values.

Sure, but allow some time. I’m currently away from my dev environment, have access to Windows laptop though, but not all tools are available (even with WSL). I’ll be back on September 12.

NeatNit · August 30, 2024, 5:10pm

No rush from me, please do it whenever is convenient

Here’s some line counts for comparison (only the files I currently use):

$ wc -l *.txt
     8130 routes.txt
    34326 stops.txt
 12493014 stop_times.txt
    98736 translations.txt
   335553 trips.txt

I also have a sneaking suspicion (read: concrete knowledge) that a lot of it is junk data. Rant warning… Many stops are never referenced in stop_times.txt and don’t exist in real life (I’m told). Translations are also inconsistent: many are missing, sometimes there’s duplicates with minor differences (e.g. '' vs ", which is used as an acronym mark in Hebrew (should be ״ but nobody uses this unicode codepoint IRL)) and in some cases the translation to other languages is different between the two “equivalent” source strings.

Sometimes stop_times.txt refers to stations (location_type = 1), other times to their children. One place has two parent_stations in the data with the same name and coordinates, I’ve reported this to them and it’s not fixed. They don’t use parent stations correctly anyway, they never have entrances and all child stops have the same coordinates as the parent station so it’s impossible for me to import them, haven’t decided yet how to handle that (issue #5 in my repository).

All of this is to say: I agree, it’s too big for its own good. I should perhaps try harder to establish a communication channel with them and help clean up the data, but that’s not where my efforts have been focused so far. I’m pretty introverted, it’s hard to send an email.

ToniE · August 30, 2024, 5:18pm

CH-Alle is even bigger w.r.t. trips, stops and stop_times.
So, I don’t see a problem for PTNA.

NeatNit · August 31, 2024, 9:12am

I now ran the script in Haifa. Before uploading it I did a “dry run” (run all the code except upload) and it alerted me to issues that I cleaned up before running for real.

Manual clean up: Changeset: 156003899 | OpenStreetMap / OSMCha

First upload attempt tried to delete a stop that was a member of a relation and crashed (now fixed so it doesn’t crash): Changeset: 156004633 | OpenStreetMap / OSMCha (OSMCha fails to show it because the changeset is not closed)

Second upload attempt - still ongoing at time of writing: Changeset: 156004855 | OpenStreetMap / OSMCha

NeatNit · September 1, 2024, 11:56pm

I am wondering about the best behavior for handling difference in position between GTFS data and OSM contributions. Right now I’ve pretty much adopted the rules of the old import:

If the difference is less than 3 meters, revert to GTFS position (the idea being that such a tiny distance is an accidental move)
If the GTFS data has an updated position since the last import, update position
Otherwise, allow editors to move the stop to a more accurate position
(my rule) if the distance is more than 200 meters, alert me and don’t make any changes to the stop. This is likely to be a mistake e.g. a typo in ref, or a huge move made by vandalism that might require more care.

I am now rethinking this because there are a lot of cases where stops are moved for no apparent reason. There’s also very little sense in that 3m cutoff. I guess the big question is: is the position from GTFS totally trustworthy? Do we want to allow editors to tweak the positioning of bus stops on the map?

I brought up this question on the Telegram, thought it’s worth mentioning here.

I also just implemented something: if a bus stop is tagged with restore:gtfs=position, then the script will restore its position to the GTFS data no matter what. This is useful is an editor sees that a stop is in the wrong position but can’t determine the right position.