[RFC] Feature Proposal - GTFS Tagging standard

Just brainstorming: Wouldn’t this better work, and completely generic, as an application with a gtfs database that can be queried with an API using either specific id’s or names/refs or simply the geolocation as an argument?
The total workload required would be different (programming and hosting) but less (no mapping at all). And any application could use it without altering its data structure.

There already exists transit.land
In my experience, the site it is slow and gives incomplete information (does not always show all route variants). I don’t know how performant their API is.
Furthermore, the onestop IDs - designed to unify duplicate stops across feeds - do not work.
See s-u1hpwr8cp5-arnhemcentraal and s-u1hpwpxstf-arnhemcentraal.
Or see r-u1h-stoptreinre19 and r-u1h-e19.
These should be the same, but are not.
Note: I cannot find a onestop ID for individial platforms. The dutch feed actually refers to platform 6b for route RE19, the german feed to the whole station.

Another website is Transitfeeds that is purely a repository of feeds and has a good interface to explore those feeds. Unfortunately it is discontinued and not updated anymore. The repository of feeds can be found at database.mobilitydata.org .

I think linking OSM and GTFS is useful from the GTFS side as well. OSM could be a solution to linking the same object across different feeds - it already has a single feature for each platform/station/… .
The positioning of stops in OSM is better; and routes contain the actual roads instead of a GPS trace (if a GTFS trip even has them).

2 Likes

Maybe, we should go one step back and talk about what we try to solve and what the purpose of the feed relation should be.

  • If I have a stop or route and want to know the url of the gtfs feed tagged on the object, the url should also be on the object. Climbing up the ladder to find the url in the feed relation is probably more complicated than linking to some external source (wiki).
  • If I want to know the area covered by a gtfs feed a boundary relation could be the right choice.

I can understand that adjusting the url every month (or even weekly for Switzerland) is not appropriate but that is a problem of GTFS in general and it is not solved by changing the url neither on every object, the feed relation nor the external source as the OSM data needs to be check and adjusted, too.

The goal is finding the URL from a platform, stop, station, route or route_master.
Users would not care for details about the feed (like area of operation), only the timetables in that feed.

On mapping area of operation to find the feed
With an is_in query we could find all surrounding areas of a stop, including the feed relation.
This does not work for the PT relations since they do not have coordinates.
Thus for those we need to climb down the relations to a stop and use their coordinates.

Another problem is that the area of operation does not cover all stops/routes of the feed.
It is normal for routes that cross the border to be included as well.
Extending the area with tentacles around those routes is stupid.
Should we then make the international stops a member of the relation?
What about platforms mapped as ways, how do we not confuse the MP algorithm with them?


Switzerland has a permalink for each yearly timetable:
2023: https://opentransportdata.swiss/de/dataset/timetable-2023-gtfs2020/permalink
2024: https://opentransportdata.swiss/de/dataset/timetable-2024-gtfs2020/permalink
Still it would be a hassle to ask permission for a mechanical edit every year.

Look at the following quote from the GTFS standards website:

Getting Started - General Transit Feed Specification (emphasis mine)
Datasets should be published at a public, permanent URL, including the zip file name.


I think it is helpful to separate the feed into two parts: permanent and temporary routes/stops.
Temporary routes and stops should not be mapped. Most weekly changes in the feed will concern these. Therefore these do not need to be checked or adjusted.
Permanent routes and stops will rarely change during the year.
These will probably only change significantly once a year - when the new timetables are rolled out.

I think we only need to check OSM data when the yearly timetable change occurs.
In that period a lot of changes will be made to the public transport objects.
During that period you don’t want a mechanical edit that changes URLs on all objects.
Such a edit would make reverts of changesets with mistakes before it more difficult.
A single change on a feed relation or wiki is a lot easier and less intrusive.

1 Like

I have updated the proposal.
The feed relation has now been removed, in favour of listing the feeds on a wiki page.
I think this makes the proposal far simpler and more likely to succeed.

Any feedback on this new version is welcome.

I just added a request “Section for Best Practice” - this section not part of the “proposal” though?

I’ve seen so many different GTFS feeds and how they organize their data and how “useful” it can be (or not) for OSM.

Just to avoid adding gtfs:* tags here and there to routes and stops, where the data is no longer valid the next day, with the next update.

Thanks for the update.

Could you please list all new tags without a wiki page so far which are included in the proposal and describe them with a few words, e.g. gtfs:location_type or gtfs:platform_code. Thanks a lot in advance.

I see that you use gtfs:route_long_name and similar. Do you propose to deprecate gtfs:name, gtfs:long_name and gtfs:short_name?

For all of these: they correspond to GTFS columns for which precise documentation can be found on Reference - General Transit Feed Specification
I included the more important columns as a collapsed table in the background section.
The definition of these will be: “The exact value of the corresponding column in the GTFS feed”

  • gtfs:stop_code - “short text/number that identifies the location for riders”
    Whether it is actually public-facing can differ. OVApi has the station code (railway:ref) for stations (public facing) but the last part of the IFOPT (ref:IFOPT) for bus stops (not public facing).
  • gtfs:stop_name - The stop name according to the feed. May differ from name in abbreviation, capitalisation, … .
  • gtfs:location_type - stops.txt lists all sorts of locations, location_type distinguishes between these. In theory the value can be deduced from the type of OSM object. I included it so that a data consumer does not have to and can just use this tag to distinguish between a bus stop and it’s platform in the GTFS feed.
  • gtfs:platform_code - The letter/number that identifies a platform of a bus or train station. Will likely match ref but can again differ in capitalisation.
  • gtfs:route_long_name - Full name of route often with destinations
  • gtfs:route_short_name - Short identifier for route - e.g.g bus number

There are others (like gtfs:wheelchair_boarding) that are also allowed under the rule that any column name can be tagged. I don’t think it is useful to include these in the list since they are not relevant for the main purpose of the proposal - specifying a way to find timetables from a OSM object.

Well, I always find it confusing when a proposal mentions non-established keys which are not part of the proposal itself in the examples.
I am not sure if we want to import the complete GTFS data in OSM. Do we really need e.g. gtfs:location_type? The names of the fields in the GTFS specification can change and even worse many GTFS providers do not follow the specifications strictly enough there for I would rather use common OSM tags with gtfs: as prefix like gtfs:short_name or gtfs:long_name.

You did not tell me how to handle gtfs:name and I do not think that we need different keys for the name of the stops and the name of the routes.

I agree that it is not useful to import all columns into OSM - and if done they should mostly end up in regular tags. (like wheelchair_accessible, route color, …)
The aim of this proposal is to specify how to reference objects in a GTFS feed.

While making the examples I found out that they could be useful for identification as well. In the stops.txt table different sorts of objects are put in the same table. location_type may be the only way to distinguish platforms and the (bus) station (if platforms do not have a platform_code). Note that GTFS also uses the term ‘station’ for regular bus stops.

I doubt this, as it would require massive effort from thousands of transit agencies and apps that consume the data. Unless it would solve a big problem with the specification, it is unlikely to change.

By using the column names we can handle these cases.

The main goal is to eliminate guesswork for the data consumer. Currently gtfs:name has the meaning “the name of this stop according to a GTFS feed”. My proposal changes this to “the precise value of the name column in the GTFS feed (of the feed suffix)”. Having gtfs:name as an alias for gtfs:route_long_name, gtfs:route_short_name and gtfs:stop_name introduces more guesswork for the consumer.

1 Like

Hi, and sorry for a late comment on the deprecation of gtfs:feed

I added some comments to Talk:Key:gtfs:feed and a link to that on Talk:GTFS

Informed: @Mxdanger as the last editor of Key:gtfs:feed having added

However, in practice gtfs:feed=* is still widely used as the preferred method.

@mcliquid @Patchi @skyper @miche101 @wolfy1339 @spaanse

1 Like

I’m not sure what to say other than this is very very technical discussion (I’m not an IT guy and there fore it is not easy to follow everything - this is also the main reason why I didn’t react till yet). In my humble opinion the tag gtfs:feed=* is the very first link between the GTFS world and the OSM world and I personally do not find it a good solution to deprecate this tag as I don’t see a better alternative for the moment.

The value syntax of this tag may be a problem:

The Syntax is the same as operator:guid=* and network:guid=* and follows the ISO 3166-1 alpha-2 on Wikipedia codes, with some exceptions.

but these exceptions are unfortunately necessary. One example: there are least 2 different public transport networks in France with the name STAR, the gtfs:feed standard value syntax will not make a difference between these 2 networks. Either we have another tag to make this difference, or the value syntax has an exception. The later one is usually used on PTNA as a region code is often put after the (ISO 3166-1) country code. I doubt that regions codes are internationally standardized. The same problem would probably occur if we put another tag to ensure the needed difference. I know that the (gtfs:)operator can help to make this difference, but here as well it can be challenging to differentiate both Star networks. In this exposed case it will probably work with the operator combination but generally you will have no guaranty (France has major PT companies and the chance to become the same major operator for both networks is not an illusion).

If I want to link an existing OSM PT network to a GTFS-Feed I will probably start to put a gtfs:feed=* to the master relation(s) (and probably add an gtfs:route_id=* as well even if the ID from the GTFS feed and the “human” ID may differ depending on the GTFS quality).

Then depending on how deep you want to link these 2 worlds, the other tags will be added in several steps. A question at this stage: is there one best practise to use? I doubt it. There probably best practises depending on the details you want to map. Some GTFS feeds deliver dozens of routes for one line (with lot of subroutes). Some users will choose to create each variant, others will put only the main routes and ignore the subroutes. Some users will put the time schedule, others not. Some users will link the stops / platforms to the GTFS world, others not. And this depends as well on what the GTFS feed provides.

That’s probably why I continue to use the tag gtfs:feed for the moment as least bad solution.

Note: this proposal has already been accepted.

Maybe I am mistaken, but I do not see much value in only tagging gtfs:feed.
It becomes useful when you also tag values to identify the corresponding element in the feed.
So gtfs:feed=feed_code will always be used in conjunction with gtfs:stop_id and similar.
However, with this method it is unclear how to tag multiple stops.
That is why I proposed gtfs:stop_id:feed_code=*
With the proposal I deprecated gtfs:feed because the proposed method resolves this ambiguity better.
In my opinion, usage should be discouraged in favour of the subkey on other GTFS tags.

@Patchi The key has not been deprecated because it had issues with it’s syntax. The new method essentially uses the same syntax.
Values used are documented on List of GTFS feeds - OpenStreetMap Wiki

In the worst case, when tagging a route relation in OSM (example)

gtfs:feed=DE-BY-MVV
gtfs:release_date=2024-06-02
gtfs:route_id=19-210-s24-4
gtfs:trip_id:sample=1.T0.19-210-s24-5.1.R

has to be tagged with

gtfs:release_date:DE-BY-MVV=2024-06-02
gtfs:route_id:DE-BY-MVV=19-210-s24-4
gtfs:trip_id:sample:DE-BY-MVV=1.T0.19-210-s24-5.1.R

that’s all and that’s the difference. It needs to be tagged like this, even if there is no other GTFS feed for this GTFS route.

That is my concern, not the values of the keys.

1 Like

I’m with you for that part because a stop / a platform can indeed have multiple routes and therefore multiple networks. I think your solution can resolve the issues I faced. I mean it is a good idea for this part.

But I don’t see a benefit using this scheme onto route relations (bus, train, tram, metro, etc…) because one will usually put one relation for each variant and group all variants into a relation master according to the PTv2. Maybe I’m wrong but how can we have multiple networks on one PT route relation? As written from ToniE in his previous message you will have to put the feed_code on each gtfs:* tags (gtfs:*:<feed_code>=*) for each PT route relation although you have one feed. It is not wrong but it is not the easy way as well.

What I wanted to explain with the syntax value of the gtfs:feed tag before is that the syntax will probably stays quite unclear where it will be pretty clear for most of others gtfs:* tags as they do appear in the GTFS-Feed (and therefore you will copy exactly what appears in the GTFS feed). The OSM tag gtfs:feed=* is an artificial construct to link the 2 worlds as I explained in my previous message. It has the known syntax but as I explained before there are still some issues with this syntax , which leads to an quite arbitrary syntax. And you will have to use this arbitrary syntax on each gtfs:*:<feed_code>=* tags as well which is not exactly the perfect solution in my humble opinion.

2 Likes

I did not like the deprecation of some of the tags in general and therefore voted against the proposal.

As already mentioned besides stops (stop_position, platform and stop_area) multiple feeds are rather rare and most of the time one source is just a copy of the original one. Regarding route relations you need to be very careful that the different feeds really include the identical route especially if opening_hours, interval and duration come into play. My common rule was/is to stick with the IDs from the operator of the route,

GTFS feeds do not contain information about a single line, it contains information about all public transport of a region and/or mode of transport and/or operator.
For example, you might have a GTFS feed for all ICE trains in Germany and feeds for local transport in each municipality/bundesland.

A route can be part of multiple GTFS feeds, for example when it crosses a border between regions.
It can happen that one is not more authoritative than another.
For example, the train between Hamburg and Kopenhagen is jointly operated by DB and DSB.

When the whole feed is a copy/mirror, I agree that you do not need to tag both.

I think tagging both if the versions are slightly different is really important, as there is no way to match them otherwise. Of course we need to prevent programs changing it to one version and then back to the other repeatedly.

Proposal:

  • Reinstate gtfs:feed
  • Define gtfs:feed + non-subkeyed GTFS tags as the authoritative feed for this feature
  • Use gtfs:[tag]:[feed_code] for all other GTFS objects that correspond to this object, even if slightly different.

Automated tools may only modify geometry/tags outside the gtfs namespace if it sources it’s data from the gtfs:feed feed (and care should still be taken).
Automated tools may always modify GTFS tags for the feed they source from.

In the Hamburg - Copenhagen example, one feed should be chosen as authoritative (probably the one of the Denmark/DSB as they have the longer section). The other (Germany/DB) should be tagged as a secondary feed.

1 Like

Thanks Jelmer, this is actually much more than I hoped for.

If there is no authoritative feed, mappers can still ignore bullet 1 and 2 and use subkeyed GTFS data.

For stops however, the operator (=maintainer of the hardware at the stop position) can sometimes be seen as the provider of the authoritative data.

1 Like