GTFS in Nederland

spaanse · October 26, 2023, 9:27am

Ik zou graag willen dat OSM applicaties ook OV-navigatie kunnen - inclusief de vertrektijden. Uit eerdere discussies is het mij duidelijk dat vertrektijden te grote hoeveelheid data is die zeer volatiel is, dat dit nooit in de OSM database zal komen.

Ik denk dus dat een koppeling met GTFS feeds noodzakelijk is om dit voor elkaar te krijgen. Andersom helpt dit ook met het up-to-date houden van busroutes door tools als PTNA

Een probleem voor een goede koppeling is volgens mij het ontbreken van een eenduidige tagging scheme. Issues die de Wiki aangeeft zijn: twee gebruikte namespaces: gtfs_* en gtfs:*, en hoe om te gaan met overlappende feeds.

Ik heb wel een idee hoe zo’n scheme eruit zou kunnen zien, maar weet niet hoe het proces om zoiets voor te stellen gaat.

Samenvatting van wat ik zou voorstellen

gebruik van de gtfs:* namespace, zoals nu gebruikt door PTNA.
Alle gtfs_* automatisch omtaggen. Vind ik niet problematisch want ik zie geen interpretatieverschil in tags. Zal wel even gekeken moeten worden of er data consumers zijn die de gtfs_* tags gebruiken.
één relatie voor elke feed die alle stops, route-masters en routes bevat met een corresponderende rol.
Deze relatie bevat in ieder geval gtfs:feed_url, en een ref
Nuttig lijkt me ook een gtfs:license=CC0 ofzo, weet niet wat hiervoor handig is.
stops, route-masters en routes bevatten de juiste gtfs:stop_id / gtfs:route_id /gtfs:shape_id / … zoals nu omschreven in de Wiki.
In het geval van meerdere feeds, worden de id’s uit de verschillende feeds gescheiden door ;. Het aantal id’s moet hetzelfde zijn als het aantal feed-relaties waar het onderdeel van is.
De volgorde van de id’s is de volgorde van de ref’s in de feed-relaties (lexicografisch volgens ASCII of UTF8 ofzo?)
Mocht een object een bepaald soort id in een feed niet hebben en in de ander wel, dan wordt dat stuk leeggelaten; dus bijvoorbeeld gtfs:shape_id=;503 of gtfs:trip_id:sample=124;;

Een ander probleem is dat we in Nederland geen gtfs id’s hebben in OSM. In Nederland hebben we een GTFS feed van al het openbaar vervoer, zie gtfs.ovapi.nl
Voor zover ik kan zien is dit open data

https://gtfs.ovapi.nl/README
LICENSE
You are free to use this data, but there is no servicelevelagreement (best-effort) nor are you allowed to say you represent or impersonate any of the
transit agencies listed here.

Gebaseerd op NDOV data, waar staat:

NDOV Loket
Reisinformatie is in Nederland beschikbaar als open data met CC0 vrijwaring. Door overheden en vervoerders is afstand gedaan van auteursrecht en databankenrecht.

(Dit beantwoord naar mijn idee @ToniE 's vraag in Public_transport:version=2 - #33 by ToniE)

Wat zijn jullie ideën hierover?

ToniE · October 26, 2023, 9:55am

I’ll have a look at the data, Seems to be something similar to Switzerland or Norway, they provide a single GTFS for the whole country. No problem for PTNA.

Yeah, agreed.

In addition, in Winnipeg, CA we see gtfs:agency_id, gtfs:agency_url

From my personal experiences as power mapper for PT: some thoughts about using gtfs:* in route relations.

don’t expect GTFS data to be correct
- I’ve seen so many errors: not really suitable for QA?
- implementing QA based on GTFS might show many “false positives”
don’t expect GTFS data (ids) to be the same in the next version of the data
- GTFS data must be consistent inside a *.zip file but not from one *.zip to the next
- route_ids, trip_ids, shape_ids may differ or even point to a completely different data set (seen that in Winnipeg, CA)
- some authorities provide new GTFS data on a daily basis (DE-BY-MVG), others 4 times a year?
  - PTNA downloads and aggregates those once in a month only

I’d suggest using gtfs:* in OSM relations using a minimum set of entries

gtfs:feed_url - points to the *.zip file
gtfs:route_short_name - identifies the bus/train/… number
- set this if it differs from OSM’s ‘ref’ for route_master/route relations
- otherwise we may assume that our ‘ref’ is the ‘route_short_name’ in GTFS

This data will not change often, less often than route_id, trip_id or shape_id
This data should be sufficient for data consumers to map OSM to GTFS
This minimum set has least impact on maintenance of OSM data.
Keeping OSM gtfs:route_id, gtfs:*trip_id data up-to-date with GTFS data can become a full-time job - for certain feeds.

ToniE · October 26, 2023, 11:11am

Well, as we will see with NL-Nationaal GTFS analysis, there are tons of trains with route_short_name=Intercity.
Obviously, route_short_name isn’t enough for nation wide GTFS data.
We might have to add gtfs:route_id then in those cases.

ToniE · October 26, 2023, 11:26am

I just added gtfs analysis for the Netherlands

NL-Nationaal

covering the whole Netherlands to ptna

spaanse · October 26, 2023, 5:10pm

Is this just shapes or also stop order in trips?
I think it would be useful as a guide to what may have changed.
I suspect the feed also includes reroutings for planned obstructions, those might lead to false positives.

Comparing the zip from 20 september and the latest (can be found in archive):
routes.txt: +203, -212, 18 changed lines (total rows = 2423)
stops.txt: +675, -888, 298 changed lines (total rows = 63383)
These statistics are without any lines mentioning “i.p.v. trein” since I don’t think those should be mapped.
They were also sorted so that similar id’s are next to eachother (reordered lines do not affect statistics)
In conclusion these files don’t change much and the ID’s are relatively constant.
However 10% of routes changed in a month is still a lot. This does not even include the yearly timetable change.

trips.txt is mostly changed, even most trip_id’s changed
shapes.txt -2878 shapes, +2422 shapes (total shapes = 6716)

The most stable reference for routes is to use shape_id, but not the most consistent.
I don’t know how many of the changed shapes are for a train replacement bus.

spaanse · October 27, 2023, 8:38am

Ik heb de stops.txt even in QGIS gegooid
Er zijn twee soorten: <id> met location_type:0 en stoparea:<id> met location_type:1. Volgens de GTFS docs is de eerste dus een perron/halte, en de tweede een (bus)station. Af en toe ook gebruikt voor combinatie perrons in beide richtingen van een normale bushalte.
Ik weet niet of een gtfs:stop_id op een public_transport=platform of public_transport=stop_position hoort, ik vermoed de eerste.

Hieronder de stops uit de GTFS (blauw = platform, geel = (bus)station)

Platforms van station staan redelijk op hun plek, perrons van het busstation zijn chaotisch

Sommige sets haltes hebben een station, andere niet

Dekking van nederland. Zijn grote delen waarvan de station niet bestaat/goed staat

Dekking van Europa

En een heleboel stations (geen platforms voor zover ik kan zien) in Afrika
Dit zijn volgens mij de ontbrekende stations van nederland, zo ben ik hier de waddeneilanden in tegengekomen

spaanse · October 27, 2023, 9:34am

I see a lot of Trips have same Stop-Names but different Stop-Ids
This seems to be because the train uses a different platform on a station.
Can you include the platform_code in the stops table in PTNA?
Currently it gives the impression that this is an issue in the GTFS, but I think this is the correct way to represent trips which stop at different platforms but otherwise take the same route.

Example (edited in inspector)

for PTNA - GTFS Analysis

Edit:
Also: Lat/Lon doesn’t need to be that precise

ToniE · October 27, 2023, 12:38pm

For the NL-Nationaal, I can’t say that precisely. But during a short look into the data I found a long trip/shape/route for a bus which had only 2 stops in the middle - missing a lot more stops?
PTNA lists some (not all) suspicious errors here.

spaanse:

ToniE:

don’t expect GTFS data (ids) to be the same in the next version of the data

Comparing the zip from 20 september and the latest (can be found in archive):
routes.txt: +203, -212, 18 changed lines (total rows = 2423)
stops.txt: +675, -888, 298 changed lines (total rows = 63383)
These statistics are without any lines mentioning “i.p.v. trein” since I don’t think those should be mapped.
They were also sorted so that similar id’s are next to eachother (reordered lines do not affect statistics)
In conclusion these files don’t change much and the ID’s are relatively constant.
However 10% of routes changed in a month is still a lot. This does not even include the yearly timetable change.

trips.txt is mostly changed, even most trip_id’s changed
shapes.txt -2878 shapes, +2422 shapes (total shapes = 6716)

The most stable reference for routes is to use shape_id, but not the most consistent.
I don’t know how many of the changed shapes are for a train replacement bus.

Every GTFS set of data is different. I’ve seem similar GTFS data set structures only for those derived from MentzDV-SW (the people who work for Transport for London, my local (DE-BY-)MVV, and many other authorities). So it’s up to the local guys, with local knowledge to find out how the GTFS data is structured.

ToniE · October 27, 2023, 12:52pm

True, mostly seen for trains. Some trains (route_id) show then 300 and more trip variants - tedious to map them all. I’d suggest mapping a representative one, the one with the highest “Number of rides” (3rd colum on PTNA - GTFS Analysis)

Adding the “stop_id” into the list (columns 6-8) would blow up the table?

I also see the need for a better presentation of the differences but do not really have a solution in mind.

I take the data as is from the GTFS, I would like to avoid truncating the numbers.
Same for ‘distance’ values in the “shape” section. Does GTFS say a word about miles, meters, kilo-meters?

ToniE · October 27, 2023, 12:58pm

Yeah, location_type=1 is referred to as “parent stop”, in OSM terms “stop_area”? Whatever the authority (creator of GTFS) thinks about that.

That’s also my impression. The points are nearly always besides a street, so ‘platform’. Passengers should be routed to the appropriate platform when starting a trip.

spaanse · October 28, 2023, 6:59pm

I mean on the trip page. The table with all the stops along the trip.
I agree that within the trip table on the route page it would make it a lot bigger.

https://gtfs.org/schedule/reference/#shapestxt
Here it mentions to use 5.25 for 5.25 kilometers, but OVApi seems to use meters.

I have looked at one of them. It seems to be a part of the full route of a train. Maybe shortened because of scheduled works? It seems that they still use the full shape in that case.

ToniE · October 29, 2023, 1:45pm

I see, the “stop_id” is already shown in column 7 of the table in the section “Stops”

The word ‘kilometer’ appears in a comment to stop_times.txt there. Maybe some people (including me) did not notice that.

Yeah, see also the train Amersfoort Centraal => Amsterdam Zuid : the stops are OK, the sahep is far too long to either side.

ToniE · October 29, 2023, 1:50pm

Someone asked for support of “Dutch (Netherlands) (nl_NL)” for PTNA @ Transifex.

Are you interested also in PTNA’s analysis of existing OSM-Data for some areas of the Netherlands?

nl_NL @ Transifex supports the OSM-Analysis only. The Web-Site of PTNA is stored @ GitHub. Translations for HTML and PHP pages (incl. GTFS) are done there.

spaanse · October 29, 2023, 9:13pm

I want a column for platform code. The third column in my screenshot I added manually, so it does not exist on the site

I am currently focusing on the GTFS side, but maybe others are interested in the PTv2 analysis. Maybe ask in a new thread as this one is mostly about the Dutch GTFS feed. Does it incorporate GTFS into the PTv2 analysis?

IIVQ · October 30, 2023, 6:26pm

The “official” data format for the Netherlands - and Europe - is now NETEX, which provides both pre-planned timetable and realtime data, and should come from the transport operators (or authoroties, where so relegated) directly towards the two “dataloketten”.

spaanse · October 30, 2023, 8:16pm

I see a couple issues with NETEX:

There already exists tooling for GTFS and OSM, I don’t know how much exists for NETEX and OSM.
OSM is global, and GTFS seems to be a more global standard than NETEX - due to the efforts of Google.
I believe one of the “dataloketten” is NDOV Loket. However they require a signup and signing of a user agreement. This makes it more difficult for any program that uses OSM and a PT feed to use this, since they need to do something like that for each country (or worse each operator).
So the data may be CC0, but it is not as trivial to access as the GTFS feed.

Note: I looked at the example user agreement, and they have a remain-open clause in there which is nice.

ToniE · October 31, 2023, 6:48pm

I agree and NeTex might be seen as an “island solution”, although Europe is a big island. The NeTex covers far more details and is far more complex than GTFS. I’m afraid providing data in NeTex form is not affordable for most authorities. Much effort to be spent w/o real benefit?

spaanse · November 6, 2023, 9:43am

You could display them in a table.
On the rows the different trips
On the columns the different stops
In a cell an indication that the trip stops there () or the platform code
To describe the trips also include a column with the trip_id and - if you combine the forward and backward trips into one table - a column with the direction (← or →)

For the columns you want the stops in such an order that each trip is a subsequence.
This is the shortest common supersequence problem. There exists a DP solution for this problem but that is exponential in the number of (unique) trips.

Luckily in public transport, routes are mostly linear. If there exists a stop order that only uses each stop at most once, then that order can be found by a topological sort. But for many (210 / 2450 routes in gtfs-nl) this is not the case.

Some common scenarios where this does not hold:

These can be solved by looking at the troubling group (strongly connected component) of stops and their part in the trips. If a stop X only occurs at the starts and ends of these segments, then we have that SCS(segments) = [X] + SCS(segments without X) + [X] is still optimal.

With this we only have 10/2450 routes that cannot be resolved without the DP algorithm. Some of these are flex-busses where you call beforehand and the bus only visits the stops necessary for all passengers.
route_id’s that don’t work: 73439, 73440, 75628, 81304, 81479, 83255, 85819, 86654, 86663, 86688

You can find my code in this gist:

gist.github.com

https://gist.github.com/spaanse/9089b4a177449974b969f4027b566911

getlines.py

import pandas as pd
import sys

class union_find:
	def __init__(self, n):
	    self.parent = [-1 for i in range(n)]
	def find(self, x):
		if self.parent[x] < 0:
			return x
		else:

This file has been truncated. show original

Couple of notes:

My code assumes that a trip does not visit the same stop twice in a row. This is not always the case (see your suspicious start/end analysis). While it will not crash, it also does not duplicate the stop in the output. I saw a trip where a trip went from Roosendaal (4a) to Roosendaal (4), which presumable is a merge with another carriage on 4b. For the table to work the output would need to include ... - Roosendaal - Roosendaal - ... but my program currently outputs just ... - Roosendaal - ...
My code handles branching routes. However, it does not necessarily keep a branch together. So for the following situation:

A - B - C - D
 \E - F - G - H

The cleanest output is A - B - C - D - E - F - G - H since it keeps the branches B - C - D and E - F - G - H together. But my program could also output A - E - F - B - G - C - D - H or any other mix of the two branches.
3. When stops are split (in the cases of circle routes or offshoots), you sometimes have a choice which of the split stops you include for a trip. The best way to decide this is to look at the segment in the group that caused the split and whether the stop is at the start or the end. If that does not distinguish, look at whether it is at the start or end of the trip. If that does not distinguish it does not matter.
As an example, you could have:

A - B - C - B - D
A - B
            B - D

This would look weird if formatted as

A - B - C - B - D
A --------- B
    B --------- D

spaanse · November 27, 2023, 12:16pm

@ToniE PTNA is not using the latest version of NL-Nationaal
The newer versions contain the timetables for next year, so they would be useful.
Is it possible to filter on trips run after 10 december, so only the new timetables are considered?

ToniE · November 27, 2023, 1:02pm

I downloaded the newest version as of 2023-11-22.

No, I can filter our all non-relevant routes according to download date - those where the end_date/“valid until” is before “today” (download day). The earliest start_date in the current feed is 2023-11-21 and all entries have “valid from” = 2023-11-21 and “valid until” = 2024-08-04 though.

On NL-Nationaal, you can sort the table by columns 4 = “valid from” or 5 = “valid until”.
If “valid from” is in the future, the cell is marked with green background.
If “valid until” is in the past, the cell is marked with orange background.
Downloading this table as CSV includes the “valid from” and “valid until” in the 3rd column of the CSV. Maybe this helps?