Best way to merge external data into an OSM pbf

tordans · May 16, 2025, 9:11am

What are your recommendations on how to merge external data into an OSM pbf file?

My use case:

I have some data like a CSV with osm_id, osm_type, addiontal_tags
I want to add those additional_tags to an osm pbf
I would use this file with osm2pgsql to make decisions based on the osm tags and additional tags in LUA

I found that brouter has a process for that to add “pseudo tags” before importing the pbf into brouter. Die Berechung der Welt, Generierung von Pseudo-Tags für den BRouter - media.ccc.de (Is that what they do? How?)

I found I could run a full osm2pgsql run, then add tags in SQL, then export this into a PBF again from Postgres following How can we export data from PostgreSQL database into *.osm.pbf? - #2 by Richard. I was hoping for a process that is a bit more direct and maybe faster?

I could probably create artificial changesets and apply those with osmium Osmium manual pages – osmium-apply-changes (1). That would probably work. (Any examples of this in action?) I assume those artificial changesets would then become the last change for those object (changing the last edit user, last edit timestamp, version) which would not be ideal (but maybe OK…)

Jochen_Topf · May 16, 2025, 11:36am

Creating artifical changes will probably not work (easily) because you presumably want to keep the existing tags. But there are easier options anyway: One is to use PyOsmium, write a small script that reads the CSV file into memory and then goes through all the OSM data and adds the relevant tags. Another would be to hack something with the OPL file format which all osmium tools can read and write. But you have to make sure you don’t get any duplicate keys that way.

lonvia · May 16, 2025, 11:43am

You can do this with pyosmium. It works similar to what is described in the cookbook Adding Relation Information to Member Ways - Pyosmium 4.0.2 :

Load your CSV into three dictionaries, one for each OSM type with osm_id as key and your additional tags as value. Then follow the second part of the cookbook: create a writer for your output file, open a reader for your input file, add three id filters (one for each OSM type) and a `handler_for_filtered(writer). Inside the loop merge the tags and write the new object.

Without having tried it, the code should roughly look like this:

import osmium
import csv

extra = {'N': {}, 'W': {}, 'R': {}}

with open('extra.csv') as fd:
    for l in csv.DictReader(fd):
        extra[l['osm_type']][l['osm_id']] = l['additional']

with osmium.SimpleWriter('myoutput.pbf', overwrite=True) as writer:
    fp = osmium.FileProcessor('myinput.pbf')\
               .with_filter(osmium.filter.IdFilter(extra['N'].keys()).enable_for(osmium.osm.NODE))\
               .with_filter(osmium.filter.IdFilter(extra['W'].keys()).enable_for(osmium.osm.WAY))\
               .with_filter(osmium.filter.IdFilter(extra['R'].keys()).enable_for(osmium.osm.RELATION))\
               .handler_for_filtered(writer)

    for way in fp:
        tags = dict(way.tags)
        tags.update(extra[way.type_str()][way.id])
        writer.add(way.replace(tags=tags))

Jochen_Topf · May 16, 2025, 11:51am

Or, I forgot a solution that might make more sense in your case: Read the CSV from the Lua file first thing. Then merge the data in on the fly.

tordans · May 22, 2025, 7:51pm

Arndt (BRouter) hat mir ebenfalls geantwortet, wie es in BRouter gelöst ist. Für die Doku kopiere ich es nach hier:

wir machen das nicht auf Ebene der PBF-Datei, sondern einen Schritt weiter in der Verarbeitungskette, wenn die Tags zu einer Way-ID schon als key-value map vorliegen.

Folgende Java-Klasse liest die Pseudo-Tags aus einer CSV in eine Map im Memory und addiert sie dann zu den bestehenden Tags in einer key-value-map:

brouter/brouter-map-creator/src/main/java/btools/mapcreator/DatabasePseudoTagProvider.java at master · abrensch/brouter · GitHub

tordans · June 24, 2025, 7:07pm

@Jochen_Topf That’s the approach we’re using right now — reading the CSV from the Lua file and merging on the fly. (I’m planning to write more about that soon …)

One question we have: Once the CSV becomes large, we’re thinking about splitting it into multiple files. For that to work efficiently, we’d need to know whether osm2pgsql processes osm objects in a deterministic order for a given PBF file.

Maybe this is something you know off the top of your head: Is the processing order of objects (e.g. nodes) deterministic for each run on the same input file?

Jochen_Topf · June 25, 2025, 7:19am

Is the processing order of objects (e.g. nodes) deterministic for each run on the same input file?

Yes it is. Objects are always processed in the order they are in the input file(s).