I have a small app that expects the output of imposm3. I’m only extracting a subset of the data such as buildings, addresses and POI’s. Right now I only use a country .pbf export.
Now I want to start my “production” app with planet.osm.pbf on the smallest hetzner server having 4GB RAM and currently 60gb storage. The first bottleneck seems to be the initial import of planet.osm.pbf. The imposm3 doc mention the following
An import in diff-mode on a Hetzner AX102 server (AMD Ryzen 9 7950X3D, 256GB RAM and NVMe storage) of a 78GB planet PBF (2024-01-29) with generalized tables and spatial indices, etc. takes around 7:30h. This is for an import that is ready for minutely updates. The non-diff mode is even faster.
Now my question is: Do I really need all of that hardware if I don’t care about the duration too much?
What is the minimally required hardware given days for the initial import?
I tried seeding the database from my development machine (48GB RAM, plenty storage + CPU) via SSH Port Forwarding to my production database but ran into an obscure "no space left on disk in query COPY" error after 8 hours. Yet when I check the storage usage on my server, it still had 25GB of storage available.
I can attach a bigger disk to my server, but I don’t even know how much storage I realistically would need. Maybe it’s a RAM issue? I really don’t want to over provision too hard since I plan to let it run for a long time without any return on investment.
As usual, once I give up, write something, I find a possible solution.
The -appendcache flag of imposm3 could to be the magic. Need to see whether all of the country files of https://download.geofabrik.de work.
My appendcache setup seems to work. But the VPS is still too slow and I added 500GB of storage. 2 shared vCPU and 4GB RAM also seem to be too little. Right now it will take approximately 5 days to import all of Europe, so roughly 20 Days for the planet. So whoever wants to use a small VPS, 2 CPU cores and 4GB RAM are too little for the world. But 8GB RAM and 4 Cores seem to be reasonable. 10 Days for the initial import.
In the OpenStreetMap data model, ways contain a list of node IDs only. However, PostGIS (and almost all other GIS software) store lines and polygons as a list of coordinates. Therefore, data consumes like Osm2pgsql or Imposm have to maintain a cache of all node locations (a mapping of uint64 → (int32, int32)) to build the geometry of the linestring/polygon.
You could store this cache in swap memory but it will make things painfully slow because accessing the cache is purely random IO. It will not take hours or days, it will take multiple weeks for the whole planet.
Consider switching to Osm2pgsql. It is as fast but works with 128 GB RAM only (maybe even 96 GB). Compared to a couple of years ago, it can be configured better. Back in these days, Imposm was a better choice for some use cases. But nowadays Osm2pgsql has caught up.
I recommend one of the dedicated servers by Hetzner, not their (little) cloud machines.