Help with Osmium

Devetak · February 19, 2024, 4:44pm

Hello.

I wanted to process the entire world with a query of the type:

[out:json][timeout:300];  
area["ISO3166-1"="DE"][boundary=administrative]->.germany;  
(  
  node(area.germany)["man_made"="works"];  
  way(area.germany)["man_made"="works"];  
  relation(area.germany)["man_made"="works"];  
  node(area.germany)["industrial"];  
  way(area.germany)["industrial"];  
  relation(area.germany)["industrial"];  
  node(area.germany)["man_made"="works"]["product"];  
  way(area.germany)["man_made"="works"]["product"];  
  relation(area.germany)["man_made"="works"]["product"];  
);  
out center;  
(._;>;);  
out skel;

Of course, I did the responsible thing and downloaded the entire dataset from OSM. What I find odd is that the following Python code takes a very long time (now running for 4 hours and is roughly 20% done, assuming OSM has 8 billion nodes):

class IndustrialHandler(osmium.SimpleHandler):  
    def __init__(self):  
        super(IndustrialHandler, self).__init__()  
        self.elements = []  
   def node(self, n):
        if 'industrial' in n.tags or ('man_made' in n.tags and n.tags['man_made'] == 'works'):
            self.elements.append({
                "id": n.id,
                "type": "node",
                "lat": n.location.lat,
                "lon": n.location.lon,
                "tags": n.tags
            })

Is there anything I am doing wrong? For reference, the above query with Overpass took 1 minute. I am not very experienced with osmium and would be grateful for tips.

RicoElectrico · February 19, 2024, 4:57pm

Pyosmium is not designed optimally. In fact, its performance is unsatisfactory for all but the smallest datasets (tens of MB maybe).
First, the handler function incurs unecessary context switches between Python and C++ code. Most of the time you don’t need to evaluate all objects from within Python. Some sort of method chaining for passing filters would work better.
Second, the design is not Pythonic. They should have used e.g. generators.

I advise you filter the data via osmium tags-filter command line. It will reduce data size to a more manageable one. Then pyosmium might not be that bad for quick prototyping if Python is all you know and you don’t have experience in GIS tools. A heads up that area handling can be sometimes tricky, especially relations, have a look at How to create geometry from relation? · Issue #80 · osmcode/pyosmium · GitHub

CC @Jochen_Topf