I would like for https://planet.openstreetmap.org/ planet.pbf file to include blobs offsets index inside .pbf.
Backstory
I made this tool https://nightwatch.karlas.si/ that is tracking boundaries coastlines and administrative boundaries world wild, and it is updating minutely, because I wanted to be able to run on single machine with minimal hardware requirements I went with idea, to keep .pbf in place, extract coastlines and boundaries, and keep fetching .osc files and keep relevant data in LLMB key-value lookup database, which takes around 20GB so in total 100GB of storage is needed with .pbf. Extracting whole .pbf into LLMB would take 500GB+.
But problem appears if some way or relation references Node outside LLMB database, which means I need to look back into .pbf this takes a long time if you need to parse blob by blob to get to correct blob containing that node. So I came up with file called planet.pbf.index.
planet.pbf.index
This is simple file which has following structure:
INT32 size_of_nodes_table
ARRAY<(INT64 FirstNodeId, INT64 FileOffset)> nodes_table
INT32 size_of_ways_table
ARRAY<(INT64 FirstWayId, INT64 FileOffset)> ways_table
INT32 size_of_relations_table
ARRAY<(INT64 FirstRelationId, INT64 FileOffset)> relations_table
This file is 740KB big with planet-250303.osm.pbf and keeps going up slowly. planet-250303.osm.pbf stats:
Blobs are around 4MB big. There is 30,711 node blobs, 16,290 ways blobs and 416 relations blobs.
What it allows is binary lookup and parallel processing of .pbf files.
For example I have this helper method CalculateFileOffsets that accepts osmType
and elementIds
and returns list of file offsets and IDs contained in blob. Then multiple threads can be kicked off each focusing just on one blob and knows exactly which IDs it needs to pull out. Since each blob is 4MB this is pretty quick operation.
Other uses
While working with planet.pbf I also noticed that this would be super useful in things like Apache Spark which would allow instant processing on multiple machines, Driver in Spark cluster would parse index and immediately be able to dispatch executors and sending each offset that it needs to parse, making PBF “cloud ready” format.
What is next
I kept thinking about doing something about this for some time, and didn’t know where to start and wanted to see community feedback, hence I’m finally opening this to see what others think. I welcome also proposals on how this could be implemented in backward compatible friendly manner. Probably new blob type OSMOffsets will need to be added to existing OSMHeader
and OSMData
.