I’m trying to load in a number of OSM dumps from the past few months, to compare past maps. However, each dump takes a very long time to load; the most recent one has been running for more than a day for me on reasonably nice workstation hardware (Core i7, 16gb RAM, reasonable RAID array). It seems like it’s doing a ton more disk IO than I would expect it to need to do. Is there any trick to make it faster? Maybe an intermediate index of some sort?
Hm, reading more: First, apparently this is a known problem; I hadn’t seen all the Wiki pages and comments about it.
Second, reading the code: It seems that the problem is all the data correlation that osm2pgsql has to do. If it could be just a few giant COPY statements, rather than that plus a truly massive quantity of prepared select’s, it would be hugely faster.
Some thoughts on that: First, why is osm2pgsql single-pass? I feel like it’d be a great deal faster to scan the file once and fill up the cache with nodes from one region and process them, then repeat for a second region, etc. This would require stitching the regions together in the end, but I’d expect that to be considerably cheaper?
Second, if osm2pgsql has to do all of this correlation to convert between formats, maybe this isn’t the best format for this data? Does anyone publish equivalent PostgreSQL dumps of this data?; that seems like the simple solution to me. Also, has anyone considered denormalizing the OSM format a bit, just enough that it can actually be read sequentially without needing to refer to past parts of the file?