PTNA does not use the Base-Geometry but rather a simplified one which is larger: see the difference - will this cause problems?
larger to have all potential platform relations of the route relations in the search area
so the data is available in poly, no issue to have them in geojson
PTNA’s nightly run is based on planet dump and extracts rather than overpass queries
Plan would be starting the script in the post-processing of IL-MOT GTFS-handling, separately for each sub-district, and push the results into OSM-Wiki. Let’s see for the details and where to store the templates (OSM-Wiki also?)
I noticed that, but I didn’t find any library that accepts .poly shapes, and I didn’t want to reinvent anything. So it would be easiest to use geojson, so long as that’s easy enough for you. These borders really shouldn’t be changing much (or ever) so it’s probably fine to download them once only.
I really can’t see any problems with the bigger/simplified geometry. The most important thing is that the shape is the same as the one you use to analyze.
It’s great that PTNA took the burden off of overpass but this script doesn’t care at all about OSM data, it only looks at GTFS files.
I have some thoughts on this. We have four options, but I can rule out one of them:
What you suggested, with both template and result hosted on the wiki.
Only the template would be hosted on the wiki. The script is run before analysis (not gtfs post-processing) and the result is used immediately to analyze. The result CSV doesn’t need to be published anywhere.
Only the template would be hosted on the wiki. The script is run as a post-processing step for GTFS analysis, and the result is stored somewhere internally for PTNA analysis.
Merge the template and result into one (see below) and use just one wiki page. The script runs as GTFS post-processing.
We can rule out 3, because that would mean changes on the wiki could not be immediately visible on PTNA, until the next GTFS analysis which is out of users’ control. If I edit the template I don’t want to wait a week to see the result.
Option 1 is okay, but if someone wants to update the template (e.g. update train routes) they would have to update both the template and the result - the template for long-term use, and the result for immediate visibility in PTNA on the next analysis which they may want to queue immediately. If they don’t know about this, they would either lose data or get confused about why their changes aren’t showing up.
Option 2 is the one I was imagining as I wrote the script. The script doesn’t take that long to run - by far the biggest chunk of time is spent iterating over stop_times.txt. It should be feasible to run it as pre-processing for every PTNA analysis. And the run time could probably be massively reduced if we switch to SQL queries as you initially suggested.
Option 4 might look something like this:
== Train
Blue;train;;Jerusalem Malha;Tel Aviv Center Savidor;Israel Railways
Darkblue;train;;Hod HaSharon Sokolov;Rishon LeTsiyyon HaRishonim;Israel Railways
Darkgreen;train;;Beer Sheva;Nahariyya;Israel Railways
Lightgreen;train;;Modiin;Nahariyya;Israel Railways
-
== Light Rail
@light_rail
1;light_rail;;;;תבל;IL-MOT;"34447;34448"
2;light_rail;;;;תבל;IL-MOT;"34445;34446"
3;light_rail;;;;תבל;IL-MOT;"34449;34450"
@@
== Bus
@bus
1;bus;;אור יהודה - מסוף אור יהודה;רמת גן - מגדל אשפוז שיבא;מטרופולין;IL-MOT;"951;952;953;954"
1;bus;;;מסוף קדמה;מטרופולין;IL-MOT;7698
1;bus;;;;דן;IL-MOT;"13428;13429;39167;39168"
# ...
@@
Where the script would replace everything from @bus to @@ with its imported data, and leave everything else intact. This way the CSV is still stored on the wiki, and only in one place. The only thing is you have to make PTNA’s parser aware of @ as a special character and ignore lines that start with it. This is probably my favorite option.
One minor note: I think I’ll change it so that the tags are @bus and @light_rail matching the OSM route tag, instead of @BUS and @LIGHTRAIL. It would be much more consistent. Edit: I’ve done this now. Also changed the example above to match.
I got rid of most of the spam. If you’re interested, this is now the output log (on stderr) when run on the Tel Aviv area:
get_shape returned geometry with 1 parts and 749 coordinates
3032 stop_ids out of stop_ids_from_shape
48520 trip_ids out of trip_ids_from_stop_ids
Route has trips with different headsigns: route_id = 23603, destination = חזון איש or גבעת סוקולוב
Route has trips with different headsigns: route_id = 37657, destination = שער מזרחי שיבא or שער דרומי שיבא
2076 route_ids out of route_ids_from_trip_ids
2076 routes out of routes_from_route_ids
Operators breakdown:
647 רכבת ישראל
358 מטרופולין
311 דן
306 אגד
198 קווים
108 אלקטרה אפיקים
46 תנופה
23 נתיב אקספרס
22 דרך אגד עוטף ירושלים
18 בית שמש אקספרס
8 גלים
6 תבל
6 יונייטד טורס
4 סופרבוס
4 נסיעות ותיירות
4 מוניות רב קווית 4-5
3 דן בדרום
2 אודליה מוניות בעמ
2 מוניות מטרו קו
Catalog entry 23006 with route_ids ('2267', '2268', '28384') needs to disambiguate, but has missing destination or multiple destinations for the same direction:
[{'אזור התעשייה', 'בית עלמין'}, {'אזור התעשייה'}]
Catalog entry 21008 with route_ids ('2287',) needs to disambiguate, but has missing destination or multiple destinations for the same direction:
[{'קהילות יעקב'}, set()]
Catalog entry 56001 with route_ids ('7698',) needs to disambiguate, but has missing destination or multiple destinations for the same direction:
[{'מסוף קדמה'}, set()]
About half of those lines are statistics that you could probably do without, the other half is warnings about the data that we can’t do anything about anyway.
It will also point out if there are routes that could not be disambiguated. These are the only cases in the whole country:
ID is still ambiguous: ('3', 'bus', 'אגד', '', 'תחנה מרכזית') (catalog number 12003)
ID is still ambiguous: ('1', 'bus', 'אגד', '', 'מרכז') (catalog number 16001)
ID is still ambiguous: ('1', 'bus', 'אגד', '', 'מרכז') (catalog number 17001)
ID is still ambiguous: ('3', 'bus', 'אגד', '', 'תחנה מרכזית') (catalog number 71003)
Again, not much we can do about that.
Edit: by the way, if I skip processing stop_times.txt the whole script runs in about 2 seconds, so stop_times.txt really is the only bottleneck. With stop_times.txt it takes about 16 seconds.
ptna-network.sh -g … reads the OSM-Wiki (get routes)
ptna-network.sh -p … writes to OSM-Wiki (push routes)
ptna-network.sh -m … modifies the read OSM-Wiki data
does not apply here though, intended for mass-manipulation via sed -i -e …
We will not introduce any new OSM-Wiki pages … only extend existing ones
your script will “simulate” manual changes to the pages required after a GTFS update
We could extend your code allowing more structure - I’m just dreaming …
regex examples for ASCII but could apply to any UTF-8 char
removing any match from the list of still to be printed “numbers”
but: many people are not familiar with regex (PCRE)
@bus /^[1-9][0-9]{0,1}[^0-9]*$/
# 1 or 2 digit bus numbers go here (with optionally following letters)
@@
@bus /^1[0-9][0-9][^0-9]*$/
# 3 digit bus numbers 1.. go here for city of Munich (with optionally following letters)
@@
@bus /^2[0-9][0-9][^0-9]*$/
# 3 digit bus numbers 2.. go here for Munich county (with optionally following letters)
@@
@bus /^5[0-9][0-9][^0-9]*$/
# 3 digit bus numbers 5.. go here for Erding county (with optionally following letters)
@@
@bus /^X[1-9][0-9]$/
# Express bus in Munich
@@
@bus /^X[1-9][0-9][0-9]$/
# Express bus outside Munich
@@
@bus LHX
# exact match for a single bus (Lufthansa Express Bus), only if seen in GTFS
@@
@bus /^[A-Za-z]/
# bus "numbers" starting with ASCII letters (except those which matched in rules before)
@@
@bus
# the default, all others which have not been printed yet, like '+900', ...
@@
Yes! I was already thinking of placing bus lines within the region separately from bus lines connecting to other regions. It’s totally possible to make it more flexible and allow regex or other conditions.
logging is quite useful at any time, you’ll never know in advance what will happen to/with the data.
The stderr will be part of the logging of the GTFS imports: example for FR-PAC-Palmbus, so you can check the logging output.
The import of IL-MOT takes ~40 minutes, so adding 15 times 16 secs is no issue at all. post-processing will be started after e.g. 2025-02-03 22:16:06 start normalization
After giving it some thought, I think filters would be something like this:
@bus
@inside=yes
@ref~^1[0-9][0-9][^0-9]*$
# data goes here...
@@
This allows any arbitrary filter, using any properties that are available in the catalog that was previously built. Filters of @field=value require an exact match, while filters of @field~regex would use Python’s re for regex. By putting each filter on a separate line, I avoid having to deal with delimiting, quoting, escaping, and all that nonsense.
Later, it would be pretty easy to separate the “building the catalog from GTFS data” part from the “filling the template with the data from the catalog” part. This would make it possible to adapt to other GTFS feeds, without needing to maintain separate versions of the template filling script.
The structure of the catalog right now is a bit too specially-crafted for Israel’s feed for now, but that’ll be easy to change so that it’s more fit for being adapted to other feeds.
Unrelatedly, the script name “ptnaGenerate.py” is very much a placeholder, I’m open to suggestions for a better name
Absolutely great idea. which properties would I see in the catalogue? Are those properties the field names of the GTFS?
route_short_name aka OSM’s ref
route_id
route_name aka OSM’s name
…
Do I understand this right? Is this an AND filter and what does “inside=yes” mean?
@bus
@inside=yes
@route_short_name~^2[0-9][0-9][^0-9]*$
@route_id~^mvv
# data goes here...
@@
to exclude any 2xx bus which is not maintained by “MVV”, our local ‘network’ here. Their route_id starts with ‘mvv’
Background: our OSM-Wiki-CSV will be “filled” from 2 GTFS sources DE-BY-MVG (city) and DE-BY-MVV (region) for the same ‘network’=‘MVV’. There are no conflicting bus numbers at the moment, but: you’ll never know
Mmh: “gtfs2ptna-routes.py” or “gtfs-to-ptna-csv.py”
It won’t repeat lines that were already printed. I made it repeat the filters in the closing @@ line, but this is purely for the convenience of anyone reading the CSV. Any line starting @@ still marks the end.
For now this “catalog” is an internal format for this script, where each item contains these keys (only string values can be filtered):
‘route_type’: ‘bus’ or ‘light_rail’
‘names’: set() of potential ref tags
‘ref’: final ref, deduced from ‘names’
‘id’: tuple of (ref, type, operator, from, to), minimal required to uniquely identify the route
‘catalog_number’: number used to identify this route in various Ministry of Transportation systems (Israel-specific)
‘routes’: array of GTFS route objects (dicts parsed from GTFS) that belong to this entry
‘operator’: operator, as it should appear in OSM
‘to_from’: array of the form [set(), set()], used when disambiguating
This is subject to change to be more useful or standardized, this is what I meant when I said this:
It would totally be possible to make it include GTFS fields as well. It can also have different fields for different GTFS feeds, such as Israel’s ‘catalog_number’.
Yes, this is an AND filter. “inside=yes” was me musing about adding that field to the catalog format that I defined above, to identify routes that are wholly contained within the region of interest, as opposed to routes that connect this region to other regions of the country. Not currently coded up, but should be easy to add.
The filter you suggested should work just fine, as long as the catalog has a ‘route_id’ key. But I think it would include only routes maintained by MVV, instead of excluding them. Should I add negation? I.e.:
@bus
@inside=yes
@route_short_name~^2[0-9][0-9][^0-9]*$
@route_id!~^mvv
# data goes here...
@@
To exclude MVV routes.
Sure. Neither great nor terrible, these options will certainly be adequate.
Some other very strange things can be seen also, but that would probably requite SQL based on PTNA’s adapted, aggregated and analyzed GTFS-DB:
DE-BY-MVG as 3 entries for the subway U1 with nearly same route_ids 1-U1-G-015-2, 1-U1-G-015-3, 1-U1-G-015-4, except last part: 2, 3, 4: that’s the ‘version’ and the three entries have different valid-from and valid-to periods.
As a mapper, I would like to see only the one route_id with the longest validity-period 1-U1-G-015-2, not those short-term variants reflecting constructions (having higher version numbers?).
I think, those 3 entries are not according to what “GTFS best practice” suggests, but yes: we can still find many of those feeds (I can see by the structure of the route_id which company SW created that).
DE-BY-MVV GTFS did the same in the past but changed towards “the good”: single entries.
Maybe DE-BY-MVG follow their example?
Another example is the DE-SN-VMS having 10 times the route_short_name/ref ‘A’ where the CSV-‘operator’ = ‘agency’ does not help to distinguish them and the ‘route_name’ does not help to derive valid CSV-‘from’ and CSV-‘to’ disambiguate the 10 lines.
Yeah: you’ll always find tricky edge-cases, but hey: we still have manual maintenance for the CSV
I wanted to include only MVV routes, so route_id~^mvv was just OK. All othere would be pard of the default and final ‘@bus’ statement.
Having ‘route_id’ in the ‘catalog’ would be great. Those 10 times ‘A’ for DE-SN-VMS can easily addressed by @bus and route_id~^43-LOA, route_id~^44-MAA and so on.
Once the gtfs->catalog step is separated from the catalog->csv step, you could customize the gtfs->catalog code for each individual feed. Maybe IL-MOT would keep using raw GTFS files with the code I already wrote, while DE-SN-VMS would use SQL queries to build the catalog. And then DE-SN-VMS can add the route_id field, while IL-MOT won’t (because there’s no 1:1 correlation in IL-MOT).
Right, I misread that, but I’m adding negation operators anyway.
Well, the good news is, everything seems to be working as it should. The bad news is, Codeberg is down (seems it’s being DDOSed) so I can’t push my code. This is awful timing for me personally
Edit: seems it’s back up, it’s spotty though.
Oh well. I’ll start work on giving the catalog a more standard format.
Been thinking about how best to separate the two halves. There’s two options:
The gtfs->catalog code will be created as a Python module, which would be provided as an argument for the catalog->ptna script. The latter script would import it and call a function there. I don’t like this, I looked up how to import a Python module dynamically (based on arguments), it’s not pretty and might be a pitfall.
Use two entirely separate scripts - gtfs->catalog would create a .json file with the entire catalog, and catalog->ptna would read this json file. This has a few advantages, but the minor downsides of having to create the intermediate json file.
Overall though I think 2 is the better option. It would mean other gtfs->catalog scripts can use any language they want, as long as it can save a json file. So this is what I’m going with.