Adding blobs binary search index to PBF

I would like for https://planet.openstreetmap.org/ planet.pbf file to include blobs offsets index inside .pbf.

Backstory

I made this tool https://nightwatch.karlas.si/ that is tracking boundaries coastlines and administrative boundaries world wild, and it is updating minutely, because I wanted to be able to run on single machine with minimal hardware requirements I went with idea, to keep .pbf in place, extract coastlines and boundaries, and keep fetching .osc files and keep relevant data in LLMB key-value lookup database, which takes around 20GB so in total 100GB of storage is needed with .pbf. Extracting whole .pbf into LLMB would take 500GB+.
But problem appears if some way or relation references Node outside LLMB database, which means I need to look back into .pbf this takes a long time if you need to parse blob by blob to get to correct blob containing that node. So I came up with file called planet.pbf.index.

planet.pbf.index

This is simple file which has following structure:

INT32 size_of_nodes_table
ARRAY<(INT64 FirstNodeId, INT64 FileOffset)> nodes_table
INT32 size_of_ways_table
ARRAY<(INT64 FirstWayId, INT64 FileOffset)> ways_table
INT32 size_of_relations_table
ARRAY<(INT64 FirstRelationId, INT64 FileOffset)> relations_table

This file is 740KB big with planet-250303.osm.pbf and keeps going up slowly. planet-250303.osm.pbf stats:
Blobs are around 4MB big. There is 30,711 node blobs, 16,290 ways blobs and 416 relations blobs.

What it allows is binary lookup and parallel processing of .pbf files.

For example I have this helper method CalculateFileOffsets that accepts osmType and elementIds and returns list of file offsets and IDs contained in blob. Then multiple threads can be kicked off each focusing just on one blob and knows exactly which IDs it needs to pull out. Since each blob is 4MB this is pretty quick operation.

Other uses

While working with planet.pbf I also noticed that this would be super useful in things like Apache Spark which would allow instant processing on multiple machines, Driver in Spark cluster would parse index and immediately be able to dispatch executors and sending each offset that it needs to parse, making PBF “cloud ready” format.

What is next

I kept thinking about doing something about this for some time, and didn’t know where to start and wanted to see community feedback, hence I’m finally opening this to see what others think. I welcome also proposals on how this could be implemented in backward compatible friendly manner. Probably new blob type OSMOffsets will need to be added to existing OSMHeader and OSMData.

1 Like

I might be am getting this wrong, but if this is supposed to be a permanent addition to OSM data containing PBF files it needs to be able to hold the indices of all elements, with other words those tables need to be able to hold INT64 elements. But more importantly for larger extracts nodes_table will be roughly the same size as the all the node data itself.

Hi David,

Years ago, we ran into a similar struggle, and as a result ended up creating GeoDesk. Have a look, it may address your use case as well. In brief, GeoDesk is a spatial database engine designed specifically for OSM data. It stores the entire planet in about 100 GB (as a GOL – a Geographic Object Library), and provides a wide range of ultra-fast queries. Retrieval of relation members takes under a microsecond.

Currently, GOLs cannot be updated (you’ll have to rebuild from a fresh .osm.pbf, which for the planet takes under an hour on a modern machine). Updating incrementally from .osc files will be possible in V2, out later this year.

We don’t have a C# toolkit yet, but the C++ API is straightforward, you should be able to wrap it as unmanaged code. If there’s enough demand, we’ll create a C API, which would make foreign-function integration easier. There are also Python and Java APIs if you’re considering porting Nightwatch.

Going back to your proposal for PBF indexing: I like the compactness of your index, it imposes only a miniscule amount of overhead. However, you’ll still have to uncompress the blob and scan it for the desired feature (PBF parsing is relatively slow because of poor branch prediction). You could mitigate this by caching recently-accessed blobs (along with an index of its features, built on demand as the blob is scanned for the first time). The members of a relation often have IDs in the same range, so there’s a good possibility they’re located in the same blob.

In contrast, a GOL is about 30% to 50% larger than the PBF from which it was built, but lookups are essentially instantaneous – a classic time/space tradeoff.

There have been discussions in the past to add spatial indexing to the PBF format, but nothing concrete has been proposed (GOLs incorporate spatial indexing, as well).

I’d be great to hear more about your ideas, I’m excited about anything that makes OSM data easier to consume, analyze and improve.

1 Like

My understanding: He’s indexing the blobs, not each node individually. Since the nodes in a PBF are sorted by ID, he can retrieve the blob in which a particular node is stored via binary search on this relatively tiny index.

Yes, that make sense.

Two random things come to mind here:

  1. In tilemaker when reading PBF files @cldellow is building a kind of index on the fly. Have a look at
    Alternative protobuf library? · systemed/tilemaker · Discussion #621 · GitHub.
  2. Look at this proposal which would create larger blocks making your index less efficient: Pbf-reblob: Reduce PBF file sizes (without loosing data)

INT32 size_of_nodes_table is just number of blobs, aka. how many tiles you should read INT64 FirstNodeId, INT64 FileOffset, which is currently 30,711 for nodes.

On topic of GOL, I saw it before, I even played with it, but lack of updating with .osc was blocker for adoption.

But even for GOL conversion from .pbf it would probably beneficial from this, since it would allow to immediately start multithreaded extraction of pbf.

Regarding performance vs. tool like GOL, it will never reach such performance but for example executing code like this:

var sw = Stopwatch.StartNew();
var index = PbfIndexBuilder.LoadIndex(path);
foreach (var node in NodesParser.LoadNodes([
        2389124702,
        5398559205,
        6453678807,
        6578673562,
        7004290281,
        7294954845,
        7856041912,
        7964450430,
        8983947116,
        9376228339
], index))
{
    Console.WriteLine(node.Tags);
}
Console.WriteLine(sw.Elapsed);

Takes 100 milliseconds, which is acceptable for most use cases…

@Jochen_Topf I build this index first time I see .pbf without .pbf.index next to it, it takes like 80seconds, because 1GB/s limit of hard drive, so it’s not end of the world, but would be nice if running tools like osmium on freshly downloaded downloaded planet.pbf that shows random node content under 1second or something like that.
Regarding, increasing size of blobs, yes it would make this a bit less effective, but I believe it would still be much faster then without index.

Some possible improvements:

  • Use INT64 instead of INT32 for the index sizes – that way, if you mmap your index file, the IDs and offsets (which are both INT64) are properly aligned

  • Consider splitting the IDs and offsets into separate arrays (May improve cache locality for the binary search)

  • If you store the index as a PBF blob, encode IDs and offsets as varints representing positive deltas