Pbf-reblob: Reduce PBF file sizes (without loosing data)

codesoap · October 21, 2024, 5:01pm

While reading through the PBF format specification, I have noticed, that the uncompressed blob size limits of 16 MiB (soft)/32 MiB (hard) are not utilized by PBF files generated with the osmium tool. These PBF files simply contain ~8k OSM entities per blob, which results in uncompressed blob sizes of usually ~100 kiB.

Because each blob contains its own string table, there will be more duplicated strings in a PBF file, if the blobs are smaller. To see how much space can be saved by increasing blob sizes, I wrote pbf-reblob. Although the results were not as great as I hoped, a few percent can usually still be saved on small extracts (larger extracts don’t fare that well). Here is a small sample of my test results:

Although the savings are not massive, I think the demo still shows that there is room for improvement in the common tools, when it comes to blob sizes. After all, the savings come (almost) for free.

Additionally, I was happy to see that zstd compression seems to work better with larger blob sizes and was able to further reduce file size without lowering the parsing performance.

Jochen_Topf · October 22, 2024, 1:14pm

Interesting analysis. I don’t remember where the 8k OSM objects thing came from, I believe Osmosis did it that way and I just did the same in Osmium.

This looks like an easy change we could make to save some space. But, as you mention, there are other issues to keep in mind here. Memory when encoding and decoding is one, another is the effectiveness of multithreading which might actually improve with larger blocks. Backwards compatibility might be an issue. Theoretically a change like this should not trip up readers, but experience shows that not every implementation actually implements the spec, but looks at what’s out there and can only read that.

There is one issue that concerns me the most: When creating PBF files it is much easier to just create blocks with small numbers of objects and not max out the theoretically available space, because you can never go beyond the limit. If you do, you’d have to start a new block, etc. This logic has to be implemented which is not straightforward, especially if done in multiple threads, because there is another thread already working on the next block.

What would actually help us much more and is much easier to implement is moving from gzip blocks to zstd blocks. Not only in size but also in compression/decompression time. But that would be incompatible to existing installations.

codesoap · October 22, 2024, 7:27pm

I understand your concerns about the implementation. In my pbf-reblob tool chose to decide how many OSM entities go into one block in a single thread and only encode and compress the resulting blocks in parallel. Unfortunately, determining the size of a block is somewhat computationally expensive, so I think doing this in a single thread is indeed a slight performance bottleneck.

However, maybe a simpler solution could be implemented as a first step: I have noticed that the block sizes depend strongly on the entity types. In a quick search I found blocks containing ways with up to 2832 kiB in (raw) size, blocks containing relations with up to 2824 kiB, but those containing dense nodes at most 501 kiB. Maybe we could simply say that blocks with dense nodes are allowed to contain 6x as many entities as other blocks or something similar. This should have a big effect, since most entities in PBF files are usually dense blocks. If we also increase the amount of entities per block overall, the numbers could look something like:

16k entities for blocks containing nodes, ways or relations.
96k entities for blocks containing dense nodes.

If further analysis shows that generally blocks are no larger than 4 MiB raw, even 32k/192k entities could be reasonable.

Of course this might result in invalid PBF files in custom use cases, where entities have received an unusually large amount of tags, but if Osmium had a flag for configuring the amount of entities per block, the problems could be circumvented.

PS: I had previously written a tool for converting the compression inside PBF files to zstd (zstd-pbf), but noticed that (at least with the implementation I used) files actually got larger than with zlib. With larger blocks, however, zstd achieved better compression than zlib. Maybe zstd has a higher overhead and thus only works well with larger blobs.

cldellow · October 22, 2024, 7:30pm

That’s very cool, I love when people dig in to the nitty gritty like this and share detailed numbers.

Another aspect: IMO, it’s useful for a PBF to have small blocks if having larger blocks would mean having relatively few blocks. This is because tools may use the blob as their unit of parallelism, so you want to have at least, say, 2-3x the number of blocks as your computer has cores. In fact, tilemaker will warn you if you’re using a PBF with relatively few, large blocks, and hint that you may wish to use osmium cat to rejigger the blocks into more, smaller ones.

The motivation for this feature was that I used BBBike’s very handy export service to get PBFs of areas that I cared about. As of Nov 2023, BBBike was using osmconvert, which aims for 16-32MB per block. For most small regions, this results in a PBF with only a handful of blocks. London was 8 blocks, IIRC. Of those, 4 might be nodes, 3 might be ways, and 1 might be relations.

Due to tilemaker distributing 1 block per core, PBFs with only a few blocks meant slower processing, as not all of your computer’s cores can be used. Of course, you could also imagine letting tilemaker distribute different parts of the same block to multiple cores – but that adds significantly more complexity for a situation that can also be easily solved on the PBF generation side (and which doesn’t apply at all when doing country or planet level processing).

codesoap · October 22, 2024, 7:37pm

Good point. Maybe it would be a simple compromise to aim for a blob size of 4 or 8 MiB. The gains after 8 MiB were not that large anymore anyways…

cldellow · October 22, 2024, 7:39pm

Ah, that’s your tool! Thank you for it. I also reproduced your zlib vs zstd results. The tool also let me quickly run an experiment I’d been wanting to do: export all blocks as “plain text” (well, as their underlying protobuf), train a zstd dictionary on a subset of them, then recompress all of them with that dictionary to see if the overall compressed weight plus dictionary was better than zlib. It was not, unfortunately. I think this is to be expected - my understanding is the zstd dictionary feature is best for very small inputs, like < 5kb uncompressed. Still, I was happy to have been able to cheaply try it.

codesoap · October 22, 2024, 7:43pm

Cool, thanks for letting me know! I had also heard about the custom dictionaries, but hadn’t yet taken the time to understand what exactly they are. Good to know that I don’t need to investigate this any further in the context of PBF files

cldellow · October 22, 2024, 7:50pm

In the back of my mind, I suspect there may be some benefit if you used only the string table as the piece to train on, and not the node latlngs or the offsets into the tables. When I poked about in the dictionary that was learned by default, my intuition was that it had learned a bunch of things that wouldn’t be generalizable, like node latlngs.

But I’m at bottom a lazy person looking for quick wins, so I stopped there. In that vein, though - your zstd-pbf library introduced me to GitHub - klauspost/compress: Optimized Go Compression Packages, which introduced me to libdeflate, which tilemaker has now adopted for a small yet reliable ~2% decrease in total processing time. Not as good as being able to use zstd, but doesn’t require boiling the ocean to convert all the other players in the ecosystem to zstd first.

Jochen_Topf · October 24, 2024, 7:55am

If somebody wants to experiment with different block sizes in Osmium and how that affects various software etc., please go ahead. Unfortunately I don’t have the time to do that. But if you can make the case that this will “solve more problems than it creates”, I’d definitely consider changing the behaviour in Osmium.