GeoDesk: A fast and compact database engine for OSM features

GeoDeskTeam · November 9, 2022, 4:11pm

Hello World!

We’ve released GeoDesk, a new database engine designed specifically for OpenStreetMap data.

Its advantages:

Compact file format: A GeoDesk database is a single self-contained file, only 20% to 50% larger than an equivalent .osm.pbf file (That’s one-tenth the size of a traditional database).
Fast imports: Turning .osm.pbf data into a GeoDesk database is significantly faster than a traditional database import (minutes instead of hours).
Queries are fast (micro-seconds instead of milliseconds) and intuitive (MapCSS-like syntax)
Full support of the OSM data model (including relations). Alternatively, users can treat OSM elements as simple features and just work with points, lines and polygons.
Seamless integration with the Java Topology Suite for advanced geometric operations (buffer/union/simplify, convex/concave hulls, Voronoi diagrams, and more)
Easy to share and distribute OSM data: Internally, a GeoDesk database is organized into tiles, which can be exported in compressed form (similar in size to .osm.pbf, and often smaller). Others can then download the tiles for only the regions they need.
Modest hardware requirements: Just about any 64-bit system is fine for working with existing databases or tile sets. For country-size imports, we recommend an SSD and 8 GB RAM (24 GB if you routinely import planet-size datasets).
Cross-platform (pure Java)
100% free & open-source

GeoDesk is primarily aimed at software developers, who can integrate the database engine into their own geospatial applications. There is also a stand-alone utility, which allows users to select and filter OSM elements and export them in various formats (e.g. GeoJSON).

Please visit GeoDesk.com for documentation, examples and setup instructions, or our source-code repository on GitHub.

As this is our initial release, the software isn’t perfectly polished yet. We expect to ship a maintenance release in about month to address any issues that will invariably crop up once the toolkit runs on a broader variety of platforms. Please open an issue on GitHub if you discover a bug or need help. If your question may be of interest to non-developers, you can also post it on this forum (just mention @GeoDeskTeam).

GeoDesk is a non-commercial project, maintained solely by volunteers (though we may offer paid support or consulting services in the future). We developed this database engine for one of our own projects, because we could not find any solution that met our needs. We figured that there are probably other developers like us who could benefit from it, so we decided to release it as open source.

Today marks the anniversary of the fall of the Berlin Wall. We’re not suggesting that these two events are comparable, but we hope that by knocking down technological barriers, we’ll help advance OpenStreetMap’s mission to make geospatial data freely accessible to anyone.

Thanks for checking out our work, and happy queries!

stephan75 · November 14, 2022, 10:10pm

Hello, I managed to create a first database on my Windows10 system on the CLI by typing
gol build bremen bremen-latest.osm.pbf

I have the result for this little data extract of Germany:

Building bremen.gol from bremen-latest.osm.pbf using default settings...
Analyzed bremen-latest.osm.pbf in 1s 796ms
Sorted 1,577,635 nodes / 308,925 ways / 5,357 relations in 4s 626ms
Validated 14 tiles in 1s 515ms
Compiled 14 tiles in 6s 349ms
Linked 14 tiles in 263ms
Built bremen.gol in 16s 520ms

But when trying to do that with mecklenburg-vorpommern-latest.osm.pbf from geofabrik.de, that program freezes at

Analyzing... 9%

Maybe not enough memory for JAVA?
How can I verify any error?
Adding -v parameter doesn’t change screen output.
Where can I add the -Xmx parameter to give more RAM to that JRE?

GeoDeskTeam · November 15, 2022, 11:38am

Hi Stephan,

Memory should definitely not be a problem for a small file like that – and the program shouldn’t hang even in that case, so this may be a bug. I haven’t been able to reproduce with mecklenburg-vorpommern-latest.osm.pbf from today (from GeoFabrik, file size: 108785606), it all looks normal:

Building mv.gol from ..\mapdata\mv.osm.pbf using default settings...
Analyzed ..\mapdata\mv.osm.pbf in 2s 889ms
Sorted 10,569,013 nodes / 1,663,253 ways / 18,979 relations in 7s 956ms
Validated 30 tiles in 2s 118ms
Compiled 30 tiles in 14s 329ms
Linked 30 tiles in 107ms
Built mv.gol in 29s 411ms

(Windows 10, dual-core notebook, 8 GB RAM)

I also tried with the older versions (11/14 and 11/13), similar results.

Which exact version of mecklenburg-vorpommern-latest.osm.pbf did you use?
What is your system configuration? (cores, RAM, available storage)
Did the program hang intermittently during multiple attempts, or consistently on every run (and if so, always at the same percentage completed)?

Thanks!

P.S. If you do need to increase heap size in the future, you can add the -Xmx option in the call to java in bin\gol.bat – e.g. -Xmx8g will allow Java to use up to 8 GB (by default, it will use no more than 1/4 of maximum available RAM).

stephan75 · November 16, 2022, 4:01pm

I tried a new download of mecklenburg-vorpommern-latest.osm.pbf,

and indeed I had succed in converting that dataset via gol utility. Thanks for the further hints!

GeoDeskTeam · November 16, 2022, 4:38pm

Do you still have the file that caused the Analyzer to freeze?
Was the file damaged or truncated during download? (Even if the input .osm.pbf is damaged, the GOL Tool should report an error, but never hang, so I’d like to investigate this issue.)

Discostu36 · November 16, 2022, 4:44pm

Are you aware that most of the links in the footer of your website don’t work?

GeoDeskTeam · November 16, 2022, 4:53pm

Wow, thanks for letting us know!
(Just pushed a fix, links should be working now)

stephan75 · November 18, 2022, 6:44am

No, that old file mecklenburg-vorpommern-latest.osm.pbf has been overwritten, sorry.

Discostu36 · December 9, 2022, 9:41pm

Are there plans to extend the features of the query language? If I see correctly, I can only check for features near other features if I use the GeoDesk in a Java program. I don’t want to write a Java program, but would like to use GeoDesk as a fast overpass alternative.

GeoDeskTeam · December 10, 2022, 12:09pm

We plan to add scripting support to the GOL Tool, which will enable more sophisticated queries.

Scripts will be invoked via the gol query command and make use of the -a (--area), -b (--bbox), -t (--tags) and -f (--format) options. In addition, parameters can be supplied via the -p:<param> option:

gol query <gol-file> @<script-file> -p:<param>=<value>

Examples:

Find all supermarkets within 500 meters of a given location (Output is post-processed according to the --format command-line option):

myLocation = lonLat(13.38, 52.63)
supermarkets = select "na[shop=supermarket]"
out supermarkets maxMetersFrom(500, myLocation)

Count all soccer fields in a given state (the name parameter is supplied via -p:state_name=<name> on the command line):

state = single select "a[boundary=administrative][admin_level=4][name={state_name}]"
printf "There are %d soccer fields in %s",
    count select "a[leisure=pitch][sport=soccer]" within state, state.name

Print the names of all train routes found in the query bounding-box or area (options --bbox / --area), and print the name of each stop:

select "r[route=train]" each { route ->
    println route.name
    route members ".stop" each { stop -> println "- {stop.name}" }
}

We haven’t slated this for a specific release yet. The big question is whether to prioritize scripting over support for incremental updates (which has been the most-requested enhancement so far). We may introduce scripting as a Preview feature sometime after 0.2 (expected in January), so we can start gathering feedback.

woodpeck · December 10, 2022, 12:25pm

This is a bit of a tangential issue but is there a reason why nobody in
your team seems to have a real name and reside in a real country? Not
that it matters much but when I encounter something interesting I always
try to “locate” it in my head - ah, it’s the guys from XY university who
did this, or ah, it’s the French who did this, or whatever. Are you in a
country where you’d be at risk for being seen as working with OSM data?
Employed by a company who hates Open Source? Employed by a company that
has done embarrassing things in the past to which you don’t want to be
linked? Or why the secrecy?

GeoDeskTeam · December 10, 2022, 2:26pm

No, there aren’t any nefarious reasons for publishing pseudonymously. GeoDesk isn’t part of ESRI’s ploy to take over OSM (if there is one). The project is not affiliated with any company or organization. It’s not a Mafia front. It’s not a shield to hide anyone’s dark past.

I personally am very privacy-minded and simply don’t want my name on the Internet. I don’t care about credit or status, and would prefer to stay un-Facebooked and ungooglable. It’s just a distraction from the work product.

Anyone who wants to know personal details can send a DM (either here or on Twitter) and I’ll do my best to satisfy their curiosity. Also, if anyone needs a Certificate of Authorship, happy to provide that as well.

Hope that’s ok for now, and thanks for understanding. (The above is not a hard-line position, at some point I probably will post some bios on the site).

SimonPoole · December 11, 2022, 6:35am

Just a recommendation: have a chat with a lawyer. While most here can probably sympathize with your desire for privacy, defacto you are acting as a legal entity, potentially in commerce and you may run afoul of regulations on business contact information and similar where ever you are based.

GeoDeskTeam · December 12, 2022, 4:56pm

Thanks for the heads up! All clear for now, but as the project becomes more substantial, I’ll probably set up an entity (perhaps even a simple non-profit corp).

stephan75 · January 11, 2023, 4:22pm

There has been a recent blog post from @ZeLonewolf about rendering (and filtering raw OSM data before) global data … see ZeLonewolf's Diary | OpenMapTiles planet-scale vector tile debugging at low zoom | OpenStreetMap in detail.

I could not resist to ask about the possible usage of gol-tool from geodesk, originally he used osmium to do a filtering of global OSM raw data.

Due to a lack of capable hardware, I cannot reproduce the steps that ZeLonewolf describes there, and do any comparison between osmium and gol-tool when filtering the mentioned data for the whole planet.

Maybe there is anyone interested in such a filtering process, and who can report about success or failure, and maybe some benchmarks?
(Would it be better to open a new topic here instead of proceeding in this quite general topic?)

GeoDeskTeam · January 11, 2023, 5:15pm

It looks like @ZeLonewolf needs a PBF as input into the OpenMapTiles toolchain. There currently is no PBF output option for GeoDesk queries, so unfortunately the gol tool can’t help him there. For anything that requires writing PBFs, osmium is the best bet.

However, it would be very interesting to explore GeoDesk as a backend for map rendering. This would obviate the need to import OSM data into a PostGIS database (which takes a long time and beefy hardware). For comparison, gol build will create a GOL from a planet file in about 40 minutes on a fairly low-end workstation (10-core Xeon, 32 GB RAM, consumer-grade NVMe SSD). A renderer could then query the tile regions and receive the various features, with their geometries already assembled. Such a renderer wouldn’t use gol query, but interact directly with the GeoDesk library for best performance. Querying a GOL is significantly faster than anything involving an SQL database (and a GOL from a recent planet is only about 80 GB).

Right now, we’re working on support for incremental updating (Currently, GOLs need to be rebuilt with a fresh planet file, which is fast, but not fast enough if someone wishes to update more frequently than once a day – which is a typical requirement for tile rendering).

We’re looking to eventually integrate with the various map-rendering toolchains, so thanks for pointing out this post!

ZeLonewolf · January 11, 2023, 5:31pm

At present, I only need to use the openmaptiles-tools toolchain for development renders. This process is slow because openmaptiles-tools uses a postgres database as an intermediary rather than a direct PBF-to-mbtiles processing logic.

For production renders, I use planetiler, which renders the planet pbf to an mbtiles file in the OpenMapTiles schema in about an hour on a 64-core machine with SSDs. Planetiler is a Java implementation that skips the intermediate steps in the openmaptiles-tools toolchain. planet.osm.pbf → planet.mbtiles, 1 hour. This is done with a Java profile for OpenMapTiles that essentially hard-codes the logic for generating the planet mbtiles in Java code, using planetiler as a core depedency.

THAT process is so fast that there is likely nothing to be gained from pre-filtering the planet. However, updates to planetiler lag OpenMapTiles, so it’s not useful in a “testing new features in OpenMapTiles” context.

The biggest performance gap I currently have is that it takes on the order of close to an hour to patch the (hopefully) weekly planet PBF by applying .osc hourly diffs to it. This process is single-threaded using pyosmium-up-to-date. If you had a solution to THAT, I would be very excited.

GeoDeskTeam · January 12, 2023, 2:31pm

We don’t have anything in our arsenal for this use case, unfortunately. There’s a fast PBF reader in GeoDesk, but nothing for writing PBFs.

Have you tried using osmium apply-changes instead of pyosmium-up-to-date? libosmium (the underlying library of both) reads PBFs using multiple threads, and its PBF writer also has (at least limited) multi-thread support starting with version 2.17.0. However, the requirements of the CPython interpreter may force the Python version to run on a single threaded.

The OSM objects in a PBF generally have to be sorted by ID, which means the blocks within the file have to be written in order — this makes parallelization more complicated. (@Jochen_Topf, the main author of Osmium, may be able to shed some light on this).

If the .osc processing could be spread across multiple cores (maybe with help of a bounded priority queue to enforce the block writing order), processing time should take about 5 minutes on your kind of system (With 64 cores, it will be mostly IO-bound in this scenario).

How frequently do you want/need to update your planet file?

Jochen_Topf · January 12, 2023, 3:38pm

Osmium apply-changes and pyosmium-up-to-date are doing essentially the same internally. And yes, as @GeoDeskTeam mentions, there is some multithreading involved.

Multithreading works well for reading PBFs, but writing in multiple threads is not straightforward. The reason is less so because of the ordering requirement, but because of the way PBFs are encoded in blocks. Ideally you want blocks to contain a “reasonable” number of objects and/or bytes. But objects have widely different sizes, so you basically have to write them into the blocks before you know when the blocks have a “good” size. But you can’t start with the next block until you know where the previous block ends, so you can’t really have a different thread start on the next block while you encode the current one. You can just use fixed number of objects per block to solve this, the blocks might have different sizes, but that might not matter in every case. It might make reading somewhat less efficient, though, due to the extra per-block overhead. And you have to make sure that blocks stay below the maximum size defined by the PBF format, so you need to cover that case somehow.

And the more stuff you are doing at the same time, the more memory you need for all the “in-flight” data that’s in the process of being assembled. This can amount to many GBs of data that you keep in memory while processing. That’s why the multithreading is limited on writing in Osmium. If anybody has an idea how to make this better, please implement it and tell me.

ZeLonewolf · January 12, 2023, 4:39pm

I think the ultimate solution in my case is for planetiler to support ingesting change files directly. It’s already holding massive in-memory and on-disk representations of the planet at various phases of its processing that are therefore independent of PBF file limitations. It’s something that’s been discussed with regard to planetiler, and they’d welcome it, but that’s certainly not trivial work.

Though I was struck by the comment:

And the more stuff you are doing at the same time, the more memory you need for all the “in-flight” data that’s in the process of being assembled. This can amount to many GBs of data that you keep in memory while processing. That’s why the multithreading is limited on writing in Osmium.

If “reducing memory use” is a limitation, I would note that machines with hundreds of GB of RAM are readily available these days and there are certainly use cases that would have no problem trading RAM for speed.