Not realizing how many resources it would consume to try to use the entire dataset, I tried using osm2pgsql with a server with 8 cores, 128gb of ram and 6tb of disk space…it wasn’t, as those familiar with such things, enough or even close to it it would seem.
Now, based on smaller datasets found at geofabrik, I believe the file contains data for zoom levels that I really don’t need.
If I wanted to drop all of the data for an entire zoom level using osmosis (or another tool?), how can I do that?
I figure / am hoping that dropping two or three of the most detailed zoom levels could bring the planet data to something I could manage.
(as a side question, just what would it take to process the entire dataset?)
The raw OSM data isn’t based on zoom levels or layers. It’s all just one big lump of data. What you’d need to do is determine what features you care about for your purposes, and then extract just those from the data based on their tags and/or object type (ie. node, way, or relation). The filters you’ll most likely want to use are described here (–tag-filter would be the main one).
As for the resources required to process a planet file, I haven’t done that myself. I’ll leave that to someone else to discuss.
Going back to my original question then, given my better understanding of how the data is stored and displayed, is there a tool that would, for example, take the CartoCSS and filter the data based upon the zoom level?
I am guessing the answer is no…which leads me to my next question…how difficult might it be for someone like me with little to no experience dealing with this data or the general mapping concepts involved to write such a tool? (I am an experienced software engineer) Would such a tool even be considered reasonable or practical?
It’d need to look at any transformations in lua (that basically just change one tag to another), data selection criteria in the .mml file and things like “if zoom > x” in the .mss files. That’d be quite complicated - it’s usually easier for people to say “I’m not interested in X so I’ll remove all X” from the files that I load.
Looking at https://taginfo.openstreetmap.org, there are a lot of tags (2503) to try to make a decision about “I’m not interested in X so I’ll remove all X.” Most, if not all, of those decisions, have already been made and captured in the .mss, etc. files used by the viewer at https://www.openstreetmap.org … or at least it would provide a good baseline. It would be nice if there was an easy way to determine what tags were included at the various zoom levels. But, I understand that such a tool does not exist - at least not yet.
So, perhaps my next question is, is there a viewer or editor that I can have pointed to my tile server which would allow me to place my cursor over an object and have it display what tags are associated with that object? It is not always clear (to me at least) what the tags are or should be associated with and it would be useful to be able to visualize it.
This information is not stored anywhere. What you see is a rendered picture (which might be some days old) and a layer with the outlines of the objects which are in the current database.
It seems you want to bridle the horse from behind. Looking at you original post I think you want to load a database without data that is only used in the most detailed layers? A good start might be to remove all buildings.
That reply (from 4 days ago!) is essentially exactly what you need to do. Only you know what features you are interested in so you’ll need to provide that information.
As an example, https://www.openstreetmap.org/user/SomeoneElse/diary/47007 is a diary entry I wrote a while ago explaining how to extract only certain data, which links to an example script that I used to extract boundaries. You can use a variation of that to extract the data that you want.
I agree, that would be a good start. Now, I just have to look at at ~2500 other tags and make a similar decision - decisions that have already been made and would provide a good baseline if not the answer outright. I like to avoid replicating work when possible, especially the work done has been done by people with far greater expertise then me.
Yes, I understand that it is not stored anywhere, but as the data moves through the rendering pipeline, the decisions are encoded in the pipeline itself as it processes the data, .mss files, etc.
It might be nice to be able to interact with the pipeline in different ways. For example, it might be nice for an output to be what tags made it through to the end.
That’s not true. You must have mis-configured some part of the import statement.
I managed to load the whole of Europe (about 19.5 GB PBF, so 2/5th of planet at 45 GB) on a measly 4 core Virtualbox Ubuntu instance with just 20GB RAM assigned on a laptop, and the resulting database is nowhere near 6TB (can’t say exactly now, I am actually doing a re-import right now, but if I remember well it was in the range of 175-250 GB max).
I estimate you currently need some 64 GB of RAM to be safe with your planet import. Your 128 should in any case really be enough for a successful import.
Anyway, using the –flat-nodes option is also recommended if in any doubt about RAM limitations. It will alleviate issues in the first node processing stage of the import. Even if you have enough RAM to process all nodes in memory, with current SSD technology, the penalty of using a flat-nodes on-disk file versus processing all nodes in RAM should be limited (and yes, using HDD only, is strongly discouraged nowadays for an import. You can use an HDD to store other stuff or rendering output like tiles, but use an SSD for the import, it will make a huge difference in the required import time).
I also strongly recommend you to import with the –hstore option, this will retain any OpenStreetMap tag not already defined as physical database column in the style, in a PostgreSQL “hstore” type field, so you can access all tags at all times after import in your SQL by using a tags → YOUR_KEY_NAME type SQL statement.
Although, in that tutorial they also used -C 2500 --number-processes 1 which I left out as they did not seem necessary as I read somewhere the code has ways to automatically determine good numbers to use.
The issue, as best I could determine, wasn’t with the HD space or number of cores, but with the ram required. It climbed to use all available ram and then the process was SIGKILLed. I am in the process of importing the north-american region and it took ~90gb of RAM during the import process. I used the same basic import statement as above and used htop to watch ram usage.
I can certainly try using the --flat-nodes. If you see any other issues with the import statement, please let me know. I would love to learn how to do this better.
I must admit I always run with the –flat-nodes and –slim options combined, so my statements about needed RAM are based on that, and it might well be that loading all nodes in RAM needs more than your 128 GB for planet. Nonentheless, using –flat-nodes and especially –slim, should allow you to process the planet on that hardware you have (it is actually the --slim option that avoids reading all temporary data in RAM, and --flat-nodes saving it outside the database, the latter possibly reducing, although I am not entirely sure, the memory consumption of PostgreSQL). So, if you have SSD for storing the flat nodes table, certainly try and use both –flat-nodes and –slim.
Do not set your –number-processes to 1! It will disable parallel processing, the thing you actually want to have to speed up processing if you have configured your PostgreSQL instance properly to use parallel processing by adjusting the postgresql.conf configuration file. Be aware that, at least with --flat-nodes option enabled, I can’t say for sure without this option, the node loading stage seems single core. Only the way and relation loading seem multi-core.
If you don’t intend to run (minutely/hourly/daily) updates, but only intend to do full re-imports now and then, then dropping the slim tables using the –drop option is also recommended. It will save disk space and processing time.
Thank you for all of this good information. I can confirm that --flat-nodes does work as you describe. While not quite complete yet, processing the entire planet’s worth of data uses only a fraction of the amount of RAM and is a lot faster. It is clearly the way to go when processing large datasets.
I am curious…under what conditions would one not want to use --flat-nodes? If there really aren’t any, it may be worth it for someone to add an addendum to the tutorial specifically mentioning the flag and it’s benefits considering it allows the processing of data that may be impossible to process otherwise - something someone new is not likely to realize.
Good to hear this seems to have solved your issues.
I think this is partly an historical artifact. If your only option is HDD versus RAM, then, if you had the RAM on a beefy enterprise type server, clearly loading all nodes in RAM had huge benefits in terms of loading / processing speed, as the nodes table needs to be accessed pretty much randomly for building ways and multipolygon relations, which is especially poor on a mechanical harddrive.
Nowadays, with SSD’s and both motherboard and external interfaces like USB becoming ever faster, the whole thing of loading everything in memory for speed starts to become moot, unless you are really pressed for time. I think I saw someone mentioning having been able to load a rather recent planet in just 5-6 hours or so on a high end machine with plenty of RAM… that is probably still out of reach for most casual OSM users.