Advise for getting global coverage OSM XML data

Hello, I am after advice on how best to get the OSM data I want. I am after:

XML OSM format of relations, their ways and their nodes.
• Country border data. Admin level 2, type boundary, boundary administrative. Global coverage.
• Global coastline data. Ways with natural coastline.
This is my base level requirement. Future requirements include other admin levels and other types of relation like lakes, rivers and more.
I have tried:
• The overpass API but my requests are throttled as I am after the data on a global scale.
• Osmium command line on pbf downloads but I am unable to limit the data I extract to match my requirements as the filtering works on OR logic rather than AND. Plus the recursive argument needed for extracting ways and nodes gets all admin levels. As a result, the extracts for just a small area are large.
I have started looking at libOsmium but this looks like a complex and difficult approach.
I would welcome any advice on how I may obtain the data I’m after in the format I want and with the global coverage? I ideally want to make the process as programmatic as possible. I am aware the data is available in shapefile format but this doesn’t meet my needs.
Best, Chris

Sometimes an example helps, so have a look here. That’s not exactly what you want but is an example of similar sorts of processing. It works with Geofabrik regions, so you can edit the script to match somewhere small near you. Depending on what you want to do with the data, you may not want to load an osm2pgsql database, but that is pretty common.

For large-scale queries, you could use GeoDesk. Using the GOL Tool, turn any .osm.pbf file into a Geo-Object Library (requires less than 100 GB of storage for the full planet) and run your queries locally (Toolkits for Python, Java and C++).

If you could share a bit more about your use case, I can show you how to write the queries.

Also, if you just want features as XML, you can use the GOL Tool itself:

gol query world a[boundary=administrative][admin_level=2] > countries.xml

The above extracts all countries as countries.xml from world.gol (which you can build from a planet file via gol build).

Osmium is super fast when reading PBF files so you can simply chain a couple of tag filter commands together - step 1, extract all relations with admin_level=2 into a file, step 2, extract all relations with boundary=administrative from that into a new file.

1 Like

Thanks for the replies.

The script for downloading from GeoFabric is interesting but the other suggestions look simpler.

I’ve been trying out GeoDesk GOL command line and it does let me extract admin level 2 and boundary administrative which is good. It is also extracting the subarea children relations recursively so I am getting an output file much larger than I need. I do want the relation child nodes and ways, and the nodes making up the ways, but not all the subarea relations. In overpass I used a query like:

(rel(349044);nw(r);node(w););

For a specific relation id. This allowed me to get all the data to plot just the country outline.

The chaining of osmium commands I think I also have got to work but does the same thing and recursively gets the subareas.

Is there a way with either method to supress the subarea recursion whilst keeping the extraction of nodes and ways?

Best, Chris

The XML output from gol query produces all descendent elements (so the set is always reference-complete). To avoid getting the extra subareas, you have two main options:

  1. You could output as GeoJSON: This assembles the geometries of the admin areas; however, it doesn’t give you “true” OSM data (i.e. the result set doesn’t include admin_centre nodes)

  2. Process the elements using one of the SDKs – this gives you the greatest amount of flexibility (example for extracting counties)

In Python, #2 looks like this:

from geodesk import *

world = Features("world.gol")
countries = world("a[boundary=administrative][admin_level=2][name]").relations

for country in countries:
    name = country['name:en']
    if name is None:
        name = country.name
    print(f"{country}: {name}:")
    for member in country.members:
        print(f"- {member}")

This prints

relation/2202162: France:
- way/1305406191
- way/362103016
- way/375582884
...

If you need the geometry as Well-Known Text, you can print(country.wkt). Or get the a Shapely Geometry using country.shape – the Shapely library is great for doing more advanced processing, such as buffering or simplifying.

Hello. Over the weekend I was able to compile the GOL C++ toolkit. The examples and class definitions suggest this GeoDesk approach can do exactly what I need. I have a very basic console application using the toolkit.

I’ve downloaded the planet-latest.osm.pbf file and converted this to GOL format. It is 104Gb which sounds right so I’m expecting comprehensive coverage of country boundaries.

However, I’m not getting the outputs I’d expect. I create the following Features:

Features planet(“e:/osmdata/planet-latest.gol”);

Features adminLevel2= planet(“r[admin_level=2][boundary=administrative]”);

When I iterate through adminLevel2 I get only 20 results. If I remove the boundary is administrative part of the filter I get 129 results. I was expecting to get around the 240 area which is basically one result for each country.

Are you able to help me understand what I’m doing wrong?

And a second question. When I correctly get the country relations I am after I plan to iterate through all the way members, get the nodes for each, and from this I can construct the way geometries in my Spatialite database and capture the first and last node ID’s of these ways to help with polygonising and other post-processing.

However, I see there is this anonymous node concept where nodes that only contribute to a geometry don’t have their ID returned. Is this correct? Or can I get the ID of every node okay?

If I can’t it means I won’t be able to construct polygons myself and means this GeoDesk solution won’t work for me.

The GOL function to polygonise doesn’t help my situation.

Best, Chris

Hi Chris,
When you are querying administrative areas, you have to use a as the type code (for “areas”, which can be ways or relations). Type r is for all other kinds of relations (This is different from Overpass queries).
As for the IDs of nodes, gol build by default only keeps the IDs of nodes that have tags or are relation members. There is now an option to keep the IDs of all nodes (build option -w). However, the toolkits currently don’t return those IDs as part of queries (This should be implemented next month).
In the meantime, if a node’s ID is 0, you can use its coordinates as its identity, as anonymous nodes are always guaranteed to have a unique location. If you use Mercator coordinates (or lon/lat in 100-nanodegrees), you can store a coordinate location as a 64-bit integer.
But is there any reason why the GeoDesk polygonization won’t work in your case? Storing the component ways in a Sqlite database and polygonizing from there seems like a lot of effort.
If you can share more about your use case, I can suggest some possible queries (or functions from the GEOS library) that may make that task easier.
If you must polygonize yourself, I would strongly recommend using the coordinates instead of the node IDs to identify end points, so you avoid the problem of not being able to connect two different nodes in the same location.

+1 to that! Creating proper polygons from OSM boundary relations - with all the intricacies of enclaves and exclaves and potential eight-shaped polygons and whatnot - is a very complicated wheel that has been invented a couple of times already (just like Geodesk apparently does, osm2pgsql will also do this work for you while importing data into a PostGIS database). Unless you have a very good reason, I would strongly recommend that you make use of one of the existing polygon-building libraries/tools instead of thinking “how hard can it be, I’ll hack this myself”.

Perhaps you can be a little more forthcoming about what exactly your use case is and why you believe that doing your own polygonizing in Spatialite is necessary.

Hm. Thankfully getting rid of subareas altogether is currently discussed elsewhere :wink: but yes, that makes it a little more complicated. This is how I would solve it:

osmium tags-filter -R planet.osm.pbf r/admin_level=2 -o al2.osm.pbf

extracts the relations but not their members. Then

osmium tags-filter -R al2.osm.pbf r/boundary=administrative -o al2boundary.osm.pbf

to get rid of potential relations with admin_level=2 but not boundary=administrative (unsure if there are even any). Then

osmium cat al2boundary.osm.pbf -fopl|sed -e “s/.* M//”|sed -e “s/,/\\n/g”|cut -d@ -f1|sort -u > want.txt

This converts the relations into a text representation which allows us to use standard Unix command line utilities to split out the member list, convert it into one member per line, strip off the role information, and create a text file with unique member IDs. After that, we can use

osmium getid -r planet.osm.pbf -i want.txt -o relation-members.osm.pbf

and finally, to bring these members together with their relations from an earlier step,

osmium cat relation-members.osm.pbf al2boundary.osm.pbf -o adminbounds-complete.osm.pbf

But, as I said in my previous message, I have a hunch that maybe you’re creating more trouble for yourself than strictly necessary…

The real fun starts with touching inner rings – the “lake-in-the-clearing” problem (invalid per OGC, but perfectly legal in OSM).

(Though for country borders, that particular case is unlikely.)

Hi GeoDeskTeam, Woodpeck,

Using ‘a’ and then checking country.IsRelation() has done the trick. I get 221 results which is the right number from previous experience. Thanks for helping me get to this point.

Here’s the background on what I’m trying to do.

I want to make mapping information available for the blind who depend on screen readers like Jaws, NVDA and VoiceOver) to get info off the screen. All the info eventually needs to be either spoken or sounded in some way. I am blind myself and, I would love to know I’m wrong about this, but I’m not aware of a screen-reader friendly mapping application at the moment. I am blind myself.

So I’m trying to write a self-contained desktop application. A client-server architecture is beyond me at the moment plus I don’t think meets performance requirements. I’m currently just trying to get an atlas working with basic info for countries and the seas.

My application has two main parts. The graphical high-contrast map (for those with some vision) and then a side panel which supports navigation by keystrokes and the display of information in text which the user can focus on and have read out by the screen reader.

Basic information I display is the name of the country at the centre of the screen, what countries border it and the major cities it contains. There are a thousand other things I’d like to show too.

I need to maintain my polygons with a knowledge of the ways that make them up because:

1. I need to have different resolutions of data to make zoomed out views performant. I’m using basic vector drawing. If I simplify completed polygons for each country I get gaps between the countries as the shared lines are simplified in isolation. So I need to simplify the shared way geometries, not the polygons.

2. OSM admin level 2 data, and possibly level 4 also, includes territorial waters. I need to intersect my country outline data with natural=coastline data to trim off this territorial water. This means identifying nodes by id to ensure I have exact matches. Then making a country polygon from ways that are inland and the coastline ways.

3. I need to match the coastline ways that form islands with place=island or place=islet. Matching on way ID’s used seems a good option here. Haven’t done this yet though and it may not be necessary – not sure.

I have got as far as having a working frontend for my atlas. My data preparation application got to the point of cutting of territorial waters. Or at least, it worked for Mozambique. It is when I wanted to test other countries that I discovered the throttling by overpass which seems like it has recently become more limiting. Hence I’ve looked at downloading planet data and tools to extract the data I need this way.

On creating my own node ID’s, looks like using 5 or 6 decimal point should be high enough to give unique locations and still fit in a uint64. But would reallly like an option within the toolkit to return the ID even for currently anonymous nodes. How can I track when this becomes available?

Anyway, that’s the background and very happy for any help, suggestions and advice.

Best, Chris

1 Like

That’s a fantastic project, but also an extremely challenging one. There are lots of apps for real-time, street-level navigation for the visually impaired, but I’m unaware of any tool that takes a globe-level approach. In general, screen readers ignore any kind of map content. So there is definitely a market for the “audio atlas” application you’re proposing.

My first advice: Don’t build a map viewer. Even if you limit it just to displaying countries, it is still a Herculean task. Instead, I would develop this as a web app, with MapLibre as the front end, and pmtiles or Shortbread-based vector tiles. This makes it easy to customize the visual style (high contrast, low-color, etc.) and allow users to choose different styles on the fly.

I would pre-generate the content that the screen reader will narrate. For this, I would use the familiar tiling raster employed by the map viewer: One single tile at zoom 0, a 2 by 2 grid at zoom 1, 4 by 4 at zoom 2, and so on. Basically, you’d be building a tile renderer, except instead of churning out vector tiles, it generates simple text files: That’s the content that the screen reader narrates each time a new tile comes into focus as the user pans and zooms.

Your application can then perform the spatial analysis offline. For this, you can query a worldwide GOL (contains and within queries are useful here). The tile-based approach works especially well here, since this is how GOLs organize their data internally. There is an internal class in libgeodesk that transforms tile coordinates (zoom/column/row) into bounding boxes, I can move that into the public API if that helps.

This would be a great use case for generative AI. You can take the raw spatial analysis results for each tile and feed them into the API of Claude or ChatGPT, and have it generate the narration text, so the screen reader won’t sound boring and robotic. You could even have the AI weave in tidbits about the various places, LLMs are great at that. This board teems with AI enthusiasts (who don’t always find an outlet here), I am sure they would love to help with crafting the appropriate prompts for this.

The end result is a collection of static files that the browser fetches and feeds to the screen reader. A few hundred lines of JavaScript for the basic parts of the front-end app.

So the advice boils down to this: Focus on the “gold,” the distinctive parts of your app: extracting the relevant features (in the context of each tile) for the narration. Reuse existing components for everything else.

Happy to help you flesh out the specifics regarding queries. I’ll also think about who else could help out with MapLibre integration and AI (if you want to venture down that path).

Hi, thanks hugely for your post though it blows my mind a bit. You’ve introduced about 6 technologies I know almost nothing about.

I have so many questions I’m not sure where to start.

Does your approach use server/browser based architecture or is it still all local to the user’s device? Performance is a key requirement. About the first thing a friend did with a prototype I gave him was to hit the right cursor key about 10 times in rapid succession and he moaned that he had to wait for the application to catch up. I’d be concerned that introducing the internet into the mix will reduce how nimbly users will be able to move around the globe.

I also know nothing about writing my application this way and I don’t have a server!

For the same reason of performance, all the vector tiles will need to be preprocessed. Querying a GOL file will e too slow I believe but my experience is limited on this.

I also know almost nothing about writing web applications both from my perspective as a blind programmer where I need screen reader friendly tools and making them accessible to screen readers. I do have people I can talk to though about this.

Getting text content from LLM’s is a fun idea. At one level it shouldn’t be needed. An atlas is very much about facts. Rome is in Italy; Italy borders Switzerland. But what shape is Italy? That is something perhaps AI can help with.

All the info for the screen reader though needs to be displayed on the screen at some point using controls that screen readers can get their teeth into to extract the text. Self-voicing apps generally aren’t the best and users typically prefer their own screen reader voice set to the speed they like.

Not quite sure where to go from here. In one way I’m happy with the tools I’ve collected so far but I know I’m in the foothills of the mountain still to climb. Making use of the new ideas you suggest sound sensible but there is so much I don’t know.

Yes, I know it can feel overwhelming, but what I had in mind is really a quite simple design. You won’t have an “application” in the traditional sense. Your core user interface consists of a single web page, which displays a web map and some basic controls (These are provided by a library like MapLibre). Your app consists of an event handler that gets called whenever the user zooms or scrolls the map. Whenever the tile at the center of the map changes, the handler fetches a corresponding text file that contains the narration. It then puts this text in a div element, maybe at the footer of the page, though the location should not matter. The screen reader should be able to notice this change automatically and read out the new text. You can also use aria tags to guide the screen reader, such as marking the div element with aria-live="assertive".
That’s it, that’s the heart of the front end!

The “meat” of your application is an offline process that pre-generates the narration text files for each tile. Once it’s done, you’ll upload these files to your server or to a CDN. Note that you don’t need to run an actual server process. You just have a bunch of static files. If you go as far as zoom level 10, you will have less than 2 million tiles total, and 95% of them will be empty (oceans and deserts). My rough estimate is that you will have about 25 megabytes of content. You can even host this on GitHub Pages for free.

As far as performance, these map renderer stacks are highly optimized, with hardware acceleration, cache management and support for http 2 and 3. A custom desktop app might be able to beat this, but it would be very, very hard. The slowness your friend noticed is likely due to the poor keyboard defaults, which only scroll a small chunk of the map for each keystroke. You can increase the stride, maybe scroll one-third of the screen at a time. Also, disable animations, which purposely slow the transition. With these customizations, the map transitions should be instantaneous.

The key to this overall design is the tiling structure of map viewers. There is a great tool to help you understand this: the map viewer on the Geofabrik web site. With a sighted person as your “copilot,” you can get an overview of this concept. There is an option under the top-right button, at the bottom of the menu, called “tile grid”. When checked, the map is overlaid with the tile raster for the current zoom level. For each tile, it will show the tile number, which consist of zoom / column / row. These 3 numbers identify each tile, and also form the directory structure that holds the tile files. For example, 4/8/5 is the tile that covers Central Europe. That’s zoom level 4, the 9th column from the left, 6th row from the top (zero-based). The map viewer fetches these images (or vector graphics files) from a tile server and stitches them together to show the actual map. Depending on the window size, there are typically about 20 tiles in view at any given time. So whenever 4/8/5 is at the center of the map, your front-end app downloads a text file named 4/8/5.txt and presents it to the screen reader, which may then say “Germany; bordered by France to the west, Poland to the east. Denmark and the North Sea lie to the North, etc.”

One thing to keep in mind: Unlike map coordinates, the row numbers of tiles increase as you move south, just like screen coordinates in graphics programming.

Here is the link for the tile-grid map.

What are you trying to do? Do you really need OSM XML? Either case, you could try to convert from other formats to it in reverse, unless you need the OSM metadata.
In general:

  1. https://osm-boundaries.com
  2. Do you need lines, or polygons?
1 Like