Has zstd been considered as alternative general-purpose compressed to zlib?
It is both faster and gives better compression at the same time, and works on same principles (so easy to implement) - so should be win-win.
Yes, for structured data, newly developed OpenZL promises even faster speeds and even better compression ratios… but that one is still beta / being improved. And more importantly, is harder to implement, as it is not plug-and-play replacement for zlib. But it would definitely be the best if those obstacles can be overcome.
If full polar coverage or workflows like post-processing untagged nodes are required, those cases can still side-load that data or continue with the PBF-based import flow. The proposed GOL-based import path is mainly about reducing processing time and resource use for the more common cases.
Thanks for outlining some of the internals of osm2pgsql. My goal isn’t to convince you to integrate GeoDesk; it may simply not align with your project’s needs, and that’s perfectly fine. If you ever see potential for collaboration or find certain additions useful, feel free to open a GitHub issue or tag @clarisma.
Thanks for the link. That’s a neat approach, generalizing a pre-compression step based on data schemas, then feeding the result to zstd. With regards to GOB, there aren’t record structures in a traditional sense, as the encoder already compacts them using various techniques (e.g. deduplication with reference substitution, delta-encoded varints, etc.)
I had high hopes for zstd, but ratios were comparable to zlib. The only tangible gain I’ve observed was a 10% speedup of gol save, not enough to justify the added dependency (zlib is already included in any case, because gol build needs it to decompress PBFs).
zstd seems to work best for text-based formats (HTML, or OSC files), often achieving 10:1 ratios. Currently, zlib at level 8 compresses raw GOB-encoded tiles to about 70% of their original size. I’m also considering LZMA (though it is on the slow side), but there simply may not be enough “air” left to squeeze out of the binary format.
In specific cases, a general compressor can still pick up repeating sequences. For example: Let’s say a GOB tile contains 100 ways tagged highway=track,tracktype=grade2. Those 100 ways will share one common tag table. However, if they all have different names, each will have its own. In that case, a Huffman-type compressor can substitute a shorter symbol for the repeating key/value refs of the tags the ways have in common. Another example are common substrings, such as multiple street names that end with Avenue. (Those are the cases that zlib picks up.)
Not exactly “beta”. The way OpenZL works is that the compression “recipe” gets included with compressed files and “universal OpenZL decompressor” uses that recipe to get the original file. So as the compression engine improves, their decompression engine will still be able to open those files.
Their github page states:
This project is under active development. The API, the compressed format, and the set of codecs and graphs included in OpenZL are all subject to (and will!) change as the project matures.
However, we intend to maintain some stability guarantees in the face of that evolution. In particular, payloads compressed with any release-tagged version of the library will remain decompressible by new releases of the library for at least the next several years. And new releases of the library will be able to generate frames compatible with at least the previous release.
(Commits on the dev branch offer no guarantees whatsoever. Use only release-tagged commits for any non-experimental deployments.)
Despite the big scary warnings above, we consider the core to have reached production-readiness, and OpenZL is used extensively in production at Meta.