For some testing I am currently doing, I have setup an Ubuntu virtual machine on Virtualbox, allocating 12GB RAM of the 16 available on the Windows 10 host.
I would like to attempt to load Europe (there is 500GB disk space available on my SSD for the VM alone) with --slim and --drop as I don’t need updates for this.
I have successfully loaded the Geofabrik Dach extract with my PostgreSQL shared_buffers set to 4000MB, work_mem at 400MB and maintenance_work_mem at 256MB.
Will this setup also suffice for Europe, or are there adjustments needed? Before I venture into the unknown and see the process being killed, it would be nice to hear some ideas. Based on what I read so far, I think it likely the shared_buffers need significant increases, even if this means swap space is needed to compensate for lacking RAM.
Well, I think I can now partially answer this question myself:
I have now made an attempt to load Europe without making changes to the PostgreSQL settings as listed above. Instead of running into an issue with PostgreSQL itself, osm2pgsql nicely reported to little “cache” for storing the nodes in memory with the “-C 4000” (4000MB) cache setting I set on the command line (don’t confuse this 4000MB with the PostgreSQL shared_buffers listed above, this is a setting of osm2pgsql itself).
So I raised cache to -C 20000 as second attempt, which exceeds the RAM. osm2pgsql again intervened and correctly reported I didn’t have the RAM for this (besides about 11GB RAM for the virtual machine, only 8GB swap was available, the latter probably also ignored by osm2pgsql).
So the third attempt was with -C 8000, and that again bailed out after loading about 650M nodes. Since planet is already well over 4 billion nodes, and Europe about half the PBF size of planet, I guess Europe is at least 2 billion nodes, so I fall short with RAM available for cache by at least 16 GB on my machine, assuming I managed to load about a third of the Europe nodes.
So now, instead of attempting to load this entirely in-memory, I selected the --flat-nodes option for my fourth attempt, which should store the nodes on disk instead of loading them in memory. This is still running, I will report back whenever it fails or finishes, but it looks promising, as it already passed the 650M nodes loaded where the previous attempt failed that didn’t use --flat-nodes.
It also seems that the speed of the loading of the nodes is marginally faster than without --flat-nodes, something like 700k nodes/s versus 550-600k nodes/s.
I can now definitively answer my question: yes!, you can load the whole of Europe - and thus likely the entire planet - on a minimalistic 12GB RAM virtualbox VM.
In order to finally succeed though, I had to overcome two more crashes. Both were likely caused by the fact I used two osm2pgsql settings that, unless you have the resources, are probably best left to osm2pgsql: the “-C” cache option, and “–number_processes”. Once I removed both of these from my command line statements, osm2pgsql happily chugged along and imported the data without issue.
So, what did I learn from all of this?:
Yes, it is possible to load the Geofabrik Europe extract (and likely planet as well as it is just about 2-3 times bigger) on a low RAM device.
It is vital to use the “–flat-nodes” option to not run out of RAM and instead use disk space to store the nodes (cache). You cannot process this amount of data (Europe) in memory. So ignore warnings that are listed on some web pages that “–flat-nodes” should only be used “for the whole planet”: if you don’t have the RAM, use “–flat-nodes”, it is your only option anyway without enough RAM.
Don’t try to be smart like I did by setting the “-C” cache option or “–number_processes” parameters, instead, leave it to osm2pgsql to figure out its optimal processing settings, it seems to do a reasonable job!
You could also enable LuaJIT with the most recent osm2pgsql release, as it speeds up the tag processing part. Use ccmake … in the build directory to activate it, followed by cmake as usual, also don’t forget to install the corresponding luajit dev package.
Can’t say exactly. I installed it on stock Ubuntu. Based on available disk space left over, I estimate the database was roughly 175-200GB before indexing (except spatial index, which is included in this figure), and some 250-275GB after indexing (other column indexes I choose myself). This is based on a default openstreetmap-carto guided import, so with columns as defined by that style.
I loaded with --hstore-all, as I want access to all tags via HSTORE commands.
I plan to define some views and materialize them, which will more than double the storage.
Thanks for the suggestion. Since I don’t plan updates though, and this is just a test setup, I was quite happy with the performance as-is. It took less than a day to import, close to 17 hours or so, this is on a 7th gen Core i7 HQ laptop with 2TB SATA SSD. And the VM guest being Ubuntu 18.04.
I think I need to supplement the above recommendations and figures given: one thing I forgot to mention, is that the virtual machine, besides the 12 GB RAM, also had 8 GB of Ubuntu swap space allocated on the same SSD. This is crucial for the actual import.
Slightly surprising, a test run with importing the Geofabrik North America extract failed with this same setup as used for Europe, despite the PBF format North America extract being considerably smaller in terms actual size (just 7.9 versus 18.5 GB in the latest state). I guess, but don’t know for sure, that this may have to do with relation processing, and possibly some very large mega-relations in the North America extract. But this is just speculation at this point.
Anyway, raising the swap space to 16 GB resolved the issue and allowed import of the North America extract.
I guess for planet, allocating 64 GB of swap space is likely necessary if running a similar low RAM setup, maybe 48 will just about do though.
Using swap hardly seems to affect import speed though when fully run from SSD, so this is of little concern.
And as a last supplement, here’s the exact command line statement I used. Note though that it uses –slim --drop and additionally –flat-nodes, since I don’t need (minutely) updates on my machine, and flat nodes to force putting the intermediate node cache on the SSD instead of loading the data in PostgreSQL. Despite warnings to only use this flat nodes option for global extracts, I have found it to work equally well for (small) country extracts. This may have to do with my SSD storage though, and may not be the case with a traditional hard drive. Additionally, I am using –hstore-all, since I need access to all keys.
You may need to make adjustments to suite your needs, especially take notice of the location and names of the (default) openstreetmap-carto style assumed here, if you use a different style or different folder structure, that will definitely need change: