NB: It seemed to me that one of the major selling points for the new data model discussion was to assume exactly the opposite. Maybe it’s time to make some reasonable assumptions about current (affordable) hardware, which is backed by actual data.
Just to put some numbers out there – when I’m rendering the planet, I normally rent a machine on AWS that has 128GB of ram, a 64 core processor, and >1TB of SSD storage. For this kind of compute power, I pay something around $1.00 USD per hour. I spend the first hour downloading the planet and patching it with hourly diffs, and the second hour (closer to 40 minutes at this point) rendering my patched planet into an .mbtiles file. Then I copy the .mbtiles file out to a network share where my (way, way cheaper) tileserver ($30 USD / month) can access it to serve tiles.
I only render the planet occasionally (when we need an update for renderer development purposes, such as someone’s done a lot of mapping and we want to see the output), so the cost of $2 whenever I want an update is pretty reasonable.
Of course for a “production” capability, I’d want a permanent asset that can run builds continuously, but even that isn’t crazy-person prices anymore and very financially accessible in a business setting.
There are many different use cases for OSM data. And Osmium tries to be as useful as possible to the most number of people. Not everybody has access to Amazon machines or the money to spend on it. In fact one of the complaints I hear most often about Osmium is that it uses too much memory for this or that task. So this is a big concern for me. I want OSM to be accessible for the student with their old hand-me-down notebook. Also Osmium is a library so I have to be conscious about other uses for the memory the user might have over which I have no control.
If the 32 MB maximum uncompressed block size still holds, and we assume 10x size of the working set (to account for all the temporary buffers), even for 64 threads, we’re only talking 20 GB. That’s reasonable for workstation/server. 8 threads on a notebook = 2.5 GB
What do you think of this:
- Main thread reads the pbf and hands compressed blocks to the worker threads
- Each worker thread unzips/decodes one block, applies the changes, encodes/zips and puts it into a priority queue.
- Worker would need to know the type/ID of the first element in the block immediately following its assigned block, in order to determine whether it should incorporate newly created objects with ID > max ID into its block. (Workers would need to pass this info and signal their peers.)
- If a block becomes too big, split it
- The priority queue needs to be bounded (so we don’t run out of memory in case writing to disk is slow), but always allows enqueuing of the next block to be written
- An output thread grabs the encoded blocks from the priority queue and writes them to disk
(There are probably some subtleties I’m missing)
As for today, would you recommend going with Osmium directly vs. the Python version, or are they similar performance-wise?
Agree. I do like locations-on-ways (which is how GeoDesk stores ways internally), but getting OSM source data in this form would only cut the runtime of
gol build by about 20% (and it is only used once in the lifecycle of a GOL). I wouldn’t trade a time savings of a few minutes (for the typical user) in exchange for upheaval brought to the entire OSM ecosystem.
I’d rather see the time/energy/resources spent on launching your Overpass fork we’ve discussed above, or improving the UX of JOSM — or enhancing
That’s certainly a worthy goal, and being resource-conscious is always a good thing (especially in a cloud environment, where CPU/storage consumptions translates directly into billing).
Multi-threading support wouldn’t need to exclude users with low-end hardware, as fewer cores means fewer threads → less memory used.
Let me acknowledge (and thank you!) developers from the most widely used tools/library about intentional design decisions on a low resource footprint over benchmark-like speed (but with far higher baseline requirements). This is not just about hardware, but about the entire ecosystem of users around OpenStreetMap.
Anyway, this approach is still easier to optimize to use on large machines (e.g. if not explicitly RAM disks, for recurring tasks, make better use of linux page cache for the “unused RAM”) than any attempt to always hardcode a high minimum baseline for everyone. So it is somewhat unfair to think that the tools that use less resources are slower with poorly configured benchmarks.
I think one of the benchmark examples was very Blazegraph (Sophox implementation, very low concurrent access) vs Overpass (the same used in production). Ignoring a bit of a different query language as an additional option, one of the advertised advantages of Blazegraph was being able to make queries at world level without timeout like Overpass, however this was unfair: not only did Blazegraph on the benchmark don’t have full data (not just dont have attic data, but even the live data did not had geometries), but the server specs from Blazegraph were higher than Overpass.
So in general, I think it is better to take benchmarks with a grain of salt. Makes no sense to assume that a tool with default configuration to not use more resources might not benefit from fine tuning with similar time needed to other tools that claim to be faster under very specific circumstances.
The question here really is, what you mean by “accessible”. What are the specific use cases you have in mind here?
Besides, I believe this topic needs to be looked at with a bit of a broader view:
Assuming I’m a student in a region with fairly slow internet connection, and/or a data plan that’s prohibitively expensive / limited to xyz MB/GB per month, no amount of data model changes would get me anywhere close to processing OSM data on a global scale. Downloading a planet file would take days, and the occasional power outage doesn’t help either.
There are some alternative options already available: Some providers, like Geofabrik, provide country extracts, or we’re asking people to download some specific data from Overpass or other online services. Also, OSMF provides free access to their development server, you only need to apply for an account via a Github issue (Sign in to GitHub · GitHub). If you’re into vector tiles, people sometimes use osmium renumber before creating tiles, to reduce the overall memory consumption, etc.
What are the use cases that cannot be covered by these alternative options, and what could be done specifically to improve the situation?
I quickly wanted to comment on this one as well. I added an option in Overpass to process PBF files with
locations-on-ways to skip the node lookup part during way processing. Since Overpass is very unhappy with missing nodes, I had to use “–keep-untagged-nodes”.
Tests on a 2012 planet file showed a processing time of 56 minutes instead of 66 minutes. Estimating the effect of --keep-untagged-nodes, I’d say, 40 minutes should be possible.
This sounds like an impressive 40% improvement. However, I did those tests on my fork. Upstream would take around 4-5 times longer. When you put 1.1 hours or 0.6 hours in relation to 5 hours, it doesn’t seem like a huge step forward anymore.
(Not to sound presumptuous or like I know it all, especially since these are not my circumstances nor of anyone I know personally, but) you’re making some assumptions you don’t know are true. Specifically, it seems you’re assuming that anyone who wants to work with the OSM data will download everything themselves. One of the most common requests (e.g. to Organic Maps) was being able to share map files directly between users. So it’s credible in some parts they have somewhere where they can download files from the internet but otherwise distribute them through sneakernets.
I didn’t know about this, thanks for sharing. Maybe this is a solution for some people. I guess the point I can make here is that this is not widely advertised.
About the dev server, it may still not be the solution. Someone with consistent (even if slow) internet connection may be able to use it, while someone with intermitent internet connection may not. But this is me talking out of my ass… Everything depends on specific cases.
Either way, pushing everything onto the internet or “the cloud” just because we can is not the way to go with anything – small is better than big, local is better than remote. Why is one gonna depend on some remote server if one can do everything on one’s own computer/phone? If one depends on said remote server, what’s one gonna do if it goes away? Everyone must have heard something like this in the past, I’m just rehashing a rehash after all…
That’s a major accomplishment!
We’ve touched upon this in the other thread — I’m very surprised that there isn’t more of an effort on the part of the Overpass maintainers to pull your changes back into the mainline (Radical architecture change? Tradeoffs that clashed with other requirements?)
It’s seems silly to consider sweeping changes to the core of OSM that may result in performance gains of 20% to 40% (in select cases), while ignoring opportunities for order-of-magnitude improvements in existing OSM tools.
That’s the design approach behind GeoDesk: Users build a local database (or download a tileset), then run their queries directly. Overpass (as a hosted service) works great for querying specific features across a large geographical area, but isn’t suitable for high rates of repetitive queries or large data downloads (For example, discovering the surrounding features for several million points-of-interest).
Use cases of OSM data are very diverse (and hopefully will continue to grow) and will likely be met by a mix of cloud-based and local tools.
I think this would be a rather large undertaking. Currently, I have listed about 70-80 topics in the README file, where the fork deviates from upstream. It includes architectural and data persistency changes.
OSMF could probably fund someone for a few months to bring some of the changes back into mainline. Unfortunately, I don’t have the time to do that.
IIRC, I’ve managed >10’000 requests/s at one point, or downloading all motorways worldwide, or filtering by nearby phones for some 1.7 amenities in about 10s (overpass turbo). So in general, this isn’t completely impossible.
As a heavy user of overpass (on a private server), I would be in favor of simply adopting your fork, with all of its awesome performance improvements and, and instead paying someone to build up the supporting infrastructure to make it usable - which I think is just planet data downloads in the correct format?