New OSM file format: 30% smaller than PBF, 5x faster to import

The OSM dataset is huge, and keeps growing every day. Great news, of course, but sometimes the sheer volume can be overwhelming – there are just gobs and gobs of data!

Hence, we created GOB (“Geo-Object Bundle”), a new file format that makes tackling OSM data faster and easier. It’s a companion format to our now-familiar Geo-Object Library (essentially, a tightly-compressed GOL with its indexes stripped).

To support this new format, GOL Tool 2.1 has two new commands: save GOLs as GOBs and load GOBs into a GOL (Of course, like all of the GeoDesk Toolkit, the GOL Tool is free & open-source).

Advantages of GOB

  • GOB files are on average half the size of a GOL, and 30% smaller than PBFs.

  • Importing a GOB is 5 times faster than building a GOL from a PBF. A modern system loads a planet-size GOB into a GOL in 3 minutes. The speed advantage grows more pronounced on memory-constrained machines: gol build starts paging heavily with less than 32 GB of RAM, whereas gol load requires minimal resources (even a decade-old laptop loads the whole planet in under an hour).

  • GOBs are organized into tiles, so it’s easy to extract regional subsets (basically at file-copy speed) and stitch them back together; that makes GOB a convenient format for archiving and distributing geodata.

The image above shows some of the tiling structure, which mimics that of tile renderers. On the left, the smallest squares are zoom 6, the right shows the most granular level (zoom 12). A typical planet GOB has about 60,000 tiles.

Below are some size statistics for the planet file and popular regional extracts (without metadata):

                PBF      GOL               GOB
Planet      65.4 GB  93.6 GB  +43.1%   46.0 GB  -29.7%
California  1.18 GB  1.59 GB  +35.0%    770 MB  -36.5%
France      4.54 GB  5.89 GB  +29.7%   2.84 GB  -36.3%
Germany     4.29 GB  5.92 GB  +38.0%   2.67 GB  -37.5%
Italy       1.96 GB  2.63 GB  +34.0%   1.34 GB  -31.6%
Japan       2.13 GB  2.91 GB  +36.1%   1.34 GB  -37.0%
Poland      1.84 GB  2.72 GB  +47.6%   1.29 GB  -29.7%
Switzerland  487 MB   634 MB  +30.1%    311 MB  -36.2%

Dense, well-mapped areas tend to compress best as GOB. Less complete regions are below average in terms of GOB’s size advantage (GOBs for Brazil and China are only 23% smaller).

Limitations

Just like GOLs, GOBs don’t store:

  • metadata (timestamp of last edit, changeset, username, etc.)

  • history (each GOB is a snapshot of the OSM dataset)

Therefore, it is not intended for editing, but for archival and distribution.

How to work with GOBs

You will need GOL Tool 2.1 or above (download).

To export a GOL as a GOB:

gol save <gol-file> [<gob-file>]

If <gob-file> is omitted, it uses the same base name as the GOL. The .gol and .gob extensions are optional.

To limit the export to a specific area, use the --area (-a) option. You can specify a (multi)polygon as WKT, GeoJSON or simple coordinates (lon,lat pairs, rings are closed automatically), either directly or as a file. If no file extension is given, .wkt is assumed.

For example:

gol save world bodensee -a 9.55,47.4,8.78,47.66,9.01,47.88,9.85,47.58,9.82,47.46 

exports the tiles covering the region around the Bodensee (Lake Constance).

To import tiles into a GOL:

gol load <gol-file> [<gob-file>]

As with save, if <gob-file> is omitted, the base name of the GOL is used. If the GOL does not exist, it is created. To load just a specific region, restrict it with the -a option.

gol load japan -a shikoku

loads tiles from japan.gob into japan.gol (creating it if it doesn’t yet exist), but only those intersecting the area defined in shikoku.wkt.

Available datasets

What’s next

This is still a work in progress, so the format may change. I’m experimenting with different compression algos beyond zlib to make it even tighter and faster (zstd didn’t yield any significant gains).
I’m also in the process of enabling gol load to download a GOB directly from a URL and build the GOL in the background, which would bring the wall-clock import time to zero. UPDATE: This is released in GOL 2.2.

As always, questions/feedback are welcome! Please stop on by on Github and @geodesk@en.osm.town.

17 Likes

What’s a GOL?

2 Likes

GOL = Geo-Object Library (a compact single-file database for OSM features)

So, it’s a given format (GOL), stripped of everything that can be generated from the rest of the data (the indexes), then compressed, and it imports into the original format (GOL) 5x times faster than importing from another format [but only to that format] [probably because it just decompresses and generates the indexes].

Yes, in a nutshell (plus the ability to load/extract specific areas).

This is especially useful for OSM-based applications hosted on a low-power virtual server (e.g. 2 cores, 8 GB RAM). Loading a GOL from a GOB requires far less memory than building from a PBF (for which the process needs to keep a node index in memory to assemble the geometries of ways, or else it will start paging furiously). On that kind of machine, the speed increase will be closer to 15x.

From a GOL, you can then export to other formats (GeoJSON, WKT, CSV, OSM-XML), or perform queries using a Python script or C++/Java application.

2 Likes

It sounds like GOB is an intermediate format for your library which reduces data stored or transferred at the cost of some CPU time, is that correct?

What software outside of GOL can currently read or write GOB? Can any of the standard libraries people use read it and create geometries?

2 Likes

Could you go into more detail how the format works?

I was unable to find a higher level overview in the docs or on GitHub.
What encodings do you use? How effective are they for you?

Here is our research, in case you are interested: https://arxiv.org/pdf/2508.10791
Here is our documentation of the encodings that we found to be relevant for styling, which might or might not work for you (you want lossless, for maps generally low-loss is fine): Encoding Algorithms - MapLibre Tile Specification

2 Likes

Yes, GOB is a companion format to GOL:

  • GOL is a single-file database. It is uncompressed and indexed for fast queries (~100 GB for the planet)

  • GOB is for archival and distribution. Only the essential data is stored, in a tightly-compressed layout, then compressed further with zlib (~50 GB for the planet)

The hardest part of working with OSM data as a data consumer is assembling OSM elements into geometries. Since ways store references to nodes rather than actual coordinates, this requires a lookup strategy that turns node IDs into longitude/latitude. The typical approach is a hashmap for smaller sets, or a dense array for the full planet.

(Old news to you, of course – just summarizing for other readers).

Keeping the node coordinates in a dense array takes up close to 100 GB. The GOL Tool uses a different indexing approach that brings this down to about 25 GB, but that’s still too heavy for most laptops or virtual servers (gol build won’t run out of memory, but it will take several hours due to paging).

So those users can now download a GOB (which contains the ways with their geometries fully resolved, and relations with optimized member references) and turn it into a GOL, with minimal RAM needed. Basically, gol load reads the required tiles from the GOB, unzips them, transforms them into a layout suitable for querying, indexes the features, and writes the tiles into the GOL.

At this point, only the GOL Tool supports GOBs. In theory, the tiles within a GOB could be exported directly into other formats, but it’s unlikely I’ll implement that capability, since it’s fast to turn a GOB back into a GOL (which already supports multiple export formats).

If there’s enough interest, I may refactor the GOB functionality from the GOL Tool codebase into a separate library (I’d prefer to keep it out of libgeodesk, which is meant to be lightweight).

As a side note, at some point I’d like to propose GOL as an alternative input format for osm2pgsql. This should shave at least a third off the import time (and making it feasible to run on low-end hardware), while only adding 200 KB to the executable size. Since you’re a key contributor, I’d love to know your thoughts.

Thanks for posting the research paper, I’ll have a more thorough read-through later. The SIMD and GPU-based accelerations sound fascinating.

I haven’t yet published a technical spec for the GOL/GOB file formats. I will do this eventually, since this will be essential to get more developers involved in the project.

Here’s a high-level overview:

The basic file structure is broken into tiles (contiguous chunks of storage up to 1 GB in size, typically 500 KB to 4 MB). The tiling scheme is the same as commonly used by tile servers: a single square that covers the world in Mercator projection at zoom 0, recursively divided into quadrants. There’s one important difference: whereas tile servers produce MVTs/PNGs for every tile at a given zoom level, the tiling in a GOL is sparse. In low-density areas such as oceans and deserts, tiling stops at level 4 or possibly 6, whereas in dense urban area, tile granularity can go up to level 12. That’s why a typical planet-size GOL stores about 50K tiles instead of millions.

A recurring theme is the design around locality of reference, cutting down not only IO, but also access to main memory, by defining structures that maximize use of CPU caches.

Tiles themselves are divided into “hot” and “cold” zones, attempting to keep frequently accessed data together, and also ensuring that features that are spatially close and/or related thematically are placed in contiguous storage locations. This reduces the number of pages that need to be loaded to perform queries. On the other hand, an SQL-based database that treats individual features as rows essentially stores them wherever there is space. Even though SSDs don’t have seek costs in the traditional sense, most still perform dramatically better on sequential reads, and a tighter layout means more data can be cached in memory.

For low-level storage, both GOL and GOB use similar techniques as MVT, e.g. delta-encoded LEB128 for coordinates. GOL/GOB also extensively de-duplicate structures. For example, a tile of a city might contain thousands of palm trees, or buildings of the same type. These can share a common tag-table. The same goes for strings. (A traditional database typically stores a separate set of tags for each feature, leading to needless bloat, especially if stored columnar.)

Here are some performance stats (10x2.3 Haswell Xeon, 32 GB, PCI3 NVMe):

  • Find all Italian restaurants (points and polygons) in the U.S. (based on a detailed admin-area polygon), using a planet-wide dataset (i.e. world("na[amenity=restaurant][cuisine=italian]").within(usa)): 52 milliseconds

  • Measure the length of all canals in a bounding-box covering 500 square-km: 47 micro-seconds

  • Find all features in a bounding-box spanning multiple city blocks: 3 micro-seconds

Results are fairly consistent across the 3 toolkits, with Java about 50% slower for polygon-restrained queries. Measurements are median timings based on a mixed-workload simulation (so large portions of at least the indexes will be in cache).

Further optimizations could possibly gain another 20% speedup (e.g. SIMD instructions for bbox checks). As far as the data structures themselves, I typically don’t design for particular CPU architectures, because that field is evolving so rapidly. The general trend in CPUs is heavily multi-core, and this benefits the GOL design. A regional query can easily span 100+ tiles, these can be processed by multiple cores in parallel.

We support whatever libosmium supports and don’t do any parsing of different file formats within OSM.

I haven’t done any recent measurements but I can’t see any file format cutting a third of the import time. Far less than a third of the time is spent reading the file. Most of the time is spent in Postgres or building geometries.

I have found osm2pgsql feasible to run on any hardware that can handle the resulting database. osm2pgsql is designed to create databases for certain uses, and those uses tend to require more RAM/IOPS than osm2pgsql.

2 Likes

Do you have a technical specification of the file format? Note that I’m not interested in how the GOL tool works but how the data is practically stored in the file. Ideally you have a specification somewhere that is detailed enough that in theory one could write a parser for the format.

Not that I want to write another parser. What you propose sounds really interesting and I just want to understand how it works. From your description I take it that the format rearranges the data (the tile-base processing) and it very much sounds that the conversion from pbf is lossy in more ways than just “dropping metadata”. Both can be very reasonable design choices but might also limit the way in which the data can be processed. So it might be really useful for some use-cases and less ideal for others. But that is really hard to judge without having more technical details.

5 Likes

Will you be releasing an official Torrent for this data?

We don’t unfortunately have a publishable GOB spec yet (just something that meets development needs). My first task would be to write a proper architectural guide that provides a sufficient intro to all the technical concepts.

But here’s a bird’s-eye view of a GOB file’s layout:

64-byte header
Catalog (8 bytes per tile: tile number and compressed size)
Metadata (global string table, tile index, etc.; zlib-compressed)
Tiles (in catalog order; zlib-compressed)

Each compressed Tile contains the elements needed to construct the GOL Tile:

  • local strings;
  • tag tables (key/value string refs, similar to PBF)
  • reltables (the refs from members back to their parent relations)
  • nodes; ways and their geometries; relations and their member refs/roles.

In addition to a snapshot, a GOB can also describe changes, so it also may contain the IDs of features that were deleted or moved to a different Tile.

Unlike ways, the geometries of relations are not pre-assembled. These are created on the fly in response to feature.shape / feature.toGeometry(). Instead of type and IDs of each member, each relation stores a table of refences to their member features (either local or in a different tile), along with their role-string refs.

Apart from the omitted editing metadata, GOL (and GOB) is a lossless representation of the source OSM data, with the following exceptions:

  • Coordinates beyond +/- ~85 degrees latitude are clipped (to allow them to be represented in Mercator projection)

  • Ways with less than 2 nodes are discarded

  • Self-references in relations are removed

  • Circular relations are transformed into a relation hierarchy

  • By default, the IDs of untagged nodes are omitted, unless a) the node is a relation member, b) it has the same location as another node (a duplicate node), or c) it does not belong to any way or relation (an orphan node).

In GeoDesk parlance, those omitted nodes are called waynodes (they serve no purpose beyond defining the geometries of one or more ways). Based on user requests, I’ve added an option to preserve their IDs (option -w in build / load / save), to support external workflows that rely on node IDs to determine connected elements. However, I discourage using it unless absolutely necessary, as it increases the sizes of GOLs and GOBs (by 20% and 30% respectively). All the GeoDesk toolkits support topological queries, e.g. streets.connected_to(my_street), that don’t require waynode IDs.

That said, I wouldn’t bother parsing a GOB. Instead, gol load it into a GOL (a 2-minute task on the systems commonly used for osm2pgsql). Quite likely users may have the GOL already. So in either case, just read the desired features via the libgeodesk API, which also allows precisely selecting them based on tags and polygonal bounds.

Admittedly, estimating exact time savings of importing from GOL vs. PBF is a bit tough. But at minimum, we’d avoid those 10 billion lookups against a 100-GB node table (assuming about 1 billion ways in the planet, with 10 nodes each on average), especially if swapping would be required (relevant Aug 2024 blog post by @Jochen_Topf)

Also, once gol update ships, you’ll be able to restrict queries to features modified/deleted since the last update, so the results can then in turn keep a Postgres db in sync.

1 Like

Open Planet Data is a third-party site (I only helped with some of the GeoDesk workflows). I know the datasets are hosted on Cloudflare R2, so torrents may not be needed as downloads are already very fast and high-availability. (But I’m not an expert on networking.)

I’ve pinged the site’s owner, I think he’ll give an intro here soon. He’s working on a lot of cool things related to OSM, and geodata in general.

3 Likes

The “building geometries” part is what osm2pgsql would avoid by importing from a GOL, which contains ways with their geometries pre-assembled (similar to PBF with locations-on-ways).

To give you some actual timings, I ran planet-wide queries for all linear ways (w) and all areas (a, which includes both area-ways and relations of type=multipolygon and type=boundary), dumping their geometries as Well-Known Text:

$ time gol query world w -f wkt -o nul
00:02:51 Found 335,483,088 features.

real 2m52.212s
user 46m16.080s
sys 2m25.622s
$ time gol query world a -f wkt -o nul
00:04:33 Found 758,642,775 features.

real 4m34.932s
user 80m50.599s
sys 3m8.450s

These were run on a 10-core Haswell Xeon, 32 GB, NVMe over PCI3. So basically 7m 30s wall-clock to retrieve the geometries of nearly 1.1 billion features (and formatting them), on lower-spec hardware than the typical setups that run osm2pgsql.

How do you handle the fact that you can’t fully create a geometry without style-specific logic?

A way could get turned into a polygon, linestring, multiple polygons, multiple linestrings, or some combination of them. The user specifies how to build the geometry in the tag transforms. This isn’t known at the time the GOB is built.

Are the untagged nodes present twice in the file? Not many users use the process_untagged_node callback but we allow them to do it. I’m not sure how that would work if nodes were only present when handling ways.

We also use libosmium for geometry building. If we used different code paths for building geometries depending on the file format then we’d get different results for the same data.

Another complication is slim mode. At time of update we might only have data for some nodes in the input file, e.g. only one node in a way changes.

Great tool - thank you so much for this! It’s a great means to make backups locally of certain data to certain times

1 Like

When a GOL is built, an area flag is computed for each closed way based on (customizable) tag rules. However, you can simply ignore Way::isArea() and instead treat a way as an ordered collection of coordinates, from which you can then create the desired geometry.

No. They are either not stored at all (except duplicates, orphans and relation members), or optionally in a compressed table alongside each way’s coordinates.

You don’t have to use Feature::toGeometry(), you can iterate the relation’s member ways and feed them to Osmium’s polygonizer.

Not sure exactly what this means. My understanding is that “slim mode” refers to storing node coordinates in a separate flat file rather than the Postgres database. By importing from a GOL, you can avoid that file altogether. For updating, once gol update ships, you can query the source GOL for modified/deleted features and then upsert/delete just those.

This would require a two pass strategy as otherwise the nodes cannot be processed before the ways. We guarantee that on import the call order is process_(untagged_)node, after_nodes, process_(untagged_)way, after_ways, process_(untagged_)relation, after_relations

Slim mode is where osm2pgsql stores data other than the output tables. This is used to either reduce memory usage or allow updates to take place.

With slim mode osm2pgsql requires, at a minimum, the ID and coordinates of every node; the id, ordered node list, and tags of every way; and id, ordered member/role list, and tags of every relation.

It’s worth reading over the stages documentation for some of the information required for stage 2 processing.

If you aren’t concerned with updates (or performance) you can get to the point of doing crazy stuff in the Lua. You can handle circular relations, long dependency graphs, and other crazyness. I know of someone who wrote lua tag transforms where the Lua code fetched some other data from a different PostGIS database.

This is, of course, just about osm2pgsql. libosmium will have its own list of requirements.

I had missed this earlier on, but it’s a big problem for osm2pgsql. osm2pgsql supports any projection. It can even handle geometries that are invalid in EPSG:3857 but valid in EPSG:4326, not that those are common.

1 Like

Have you considered utilizing OpenZL to minimize file size even more?

4 Likes