Extracting cities list from a bounding polygon

Hello. I’m new to OSM. I need a multi-lingual world cities dataset with countries and regions. I searched on the Internet and found nothing suitable. However OSM seem to contain enough data.

I downloaded full OSM data from here in pbf format. Then I used osmium-tool to estract cities ($ osmium tags-filter -O -o test.xml planet-230206.osm.pbf n/place=city). It worked but the output seem to contain around 11000 nodes. Is that normal? Also the accepted answer to this question tells that cities may be modelled not only as nodes. Does that meen that I missed some cities with my method of extracting them?

Another question that I have is how to get the country and the region of a city? Some cities contain tag “is_in:country” or “is_in:country_code”, but some don’t. It seems that the only way to actually check the country and the region of any level of any city is to first extract bounding polygon and then check if the city is in that polygon or not. I found this program that extracts polygons pretty well. Then I have 2 options - to extract the map contained in each polygon from the map and then extract cities from it or to first extract all the cities and then for each city find the polygon that containt that city. First option seems to be very slow for a large amount of polygons. So I need to check if a polygon contains a city. What is the easiest way to do that? I can even make my own script that performs that check. But I need to know some details about bounding polygons. How does the connection of 2 nodes work? Is it a straight line on the globe as a sphere, or is it a straight line in flat 2d lat-lon coorginates? If the second case it true, then how are the connections from <180 to >-180 lon modelled?

2 Likes

Hi.

Not to refrain you from using OSM data but, did you look at WikiData ?

Its structure is usually easier to parse. OSM is geographical before all, and that can make your project a little complicated.

Maybe someone can come with an actual solution ! :wink:

3 Likes

If you’re proficient in Java, you could use the GeoDesk library for this task.

First, create a GOL file from your planet file (instructions).

Basic code below retrieves all cities in the world, then queries the administrative area in which each city is located. In OSM, there are different admin_levels: 2 = country, 4 = state/region, 6 = county (or regional equivalent), with other potential levels depending on the country’s administrative structure. The area of the city itself is usually level 8 (In cases where the city is also a county, like San Francisco, or a German city-state like Bremen, the city area uses level 6 or 4).

If you want the actual polygons, use toGeometry(), which gives you a Geometry object that you can then process further using the Java Topology Suite.

// Open the library (use the GOL utility to create it)
FeatureLibrary world = new FeatureLibrary("world.gol");

// retrieve all cities
for(Feature city: world.select("[place=city]")
{
    // retrieve the country area that contains this city
    Feature country = world
        .select("a[boundandary=administrative][admin_level=2]")
        .select(Filters.containsXY(city.x(), city.y()));

    // Display the name of the city and its country
    System.out.format("%s in %s\n", 
        city.stringValue("name"), 
        country.stringValue("name"));

    Geometry countryPolygon = country.toGeometry();
        // a JTS Geometry (Polygon)
}

See also the tutorial and examples on GitHub.

11,000 places identified as “city” worldwide seems correct.

This is true for all place features in general, but the vast majority of cities have a node where the label appears. When in doubt, you could use * in the city query above (instead of n for nodes), but you might get duplicates in some cases where a city is mapped as both node and area.

  • In OSM, lineal features are represented as ways, which connect two or more nodes. Coordinates are stored as degrees latitude/longitude, with no particular projection.

  • In a GeoDesk GOL file, OSM features are stored in Mercator projection (but you can extract WGS-84 lon/lat coordinates as well, or use a library like Proj4J to reproject into other coordinate systems).

  • Any feature that spans the Antimeridian (+/- 180 degrees longitude) is always cut into multiple individual elements (for example, the area that represents Fiji is split into an eastern and western half).

2 Likes

You can check taginfo. ~11000 nodes seems alright.

Indeed there are nearly as much mapped as relations (and some ways).

It’s hard to guess how much are duplicates, probably a lot of them.

1 Like

Thanks. Could you provide a link please? I heard of some data on wiki represented as tables, but they are often in differet formats and shold be hard to parse.

Indeed there are nearly as much mapped as relations (and some ways).

It’s hard to guess how much are duplicates, probably a lot of them.

Does that mean that there are some cities that are not in the list of nodes that I’ve got? How can I get them then? I tried using nwr/place=city instead of n/place=city but that just gave me a bunch of empty nodes (without any tags).

Thanks a lot for such a detailed answer! I’m more familiar with Python than Java. However I also used Java for a bit so probably I’ll try your code because it looks really cleen and promising.

When in doubt, you could use * in the city query above (instead of n for nodes), but you might get duplicates in some cases where a city is mapped as both node and area.

I tried not * but nwr instead of n with osmium. However it just gave me lots of nodes without tags in addition to the nodes of cities. They look like nodes of boundaries, but without any information on how should I connect them. Maybe xml is a wrong format for that?

In OSM, lineal features are represented as ways, which connect two or more nodes. Coordinates are stored as degrees latitude/longitude, with no particular projection.

Probably you have misunderstood my question. I know how the coords of nodes are stored. That has been described in the README of the program that I used to extract polygons. My question is more about how to connect those nodes. They should be connected with a straight line I guess. But it may be a straight line on the real spherical earth or a straight line on a projection. So my queston is what kind af straight line it actually is?

1 Like

I’m no search engine, but here is a related example of how to query wikidata.

Regards.

Most in OSM is edited using tools showing geographical data as pseudo-mercator. Thus I’d say first case in your question.

Natural Earth » 1:110m Cultural Vectors - Free vector and raster map data at 1:10m, 1:50m, and 1:110m scales may be also of interest if only larger settlements matter, quality is likely much greater than Wikidata.

And may be easier to use if not interested in details.

2 Likes

These untagged nodes form the geometry of places represented as ways. OSM is different from other geospatial formats in that it represents the sub-components of features as individual elements, which are then referenced. This allows two adjacent areas to share nodes. As a downside, more processing is required to resolve the references and create the actual geometries.

A more suitable output format for your use case may be GeoJSON, which contains the assembled geometry of each feature.

Ideally, the line between two nodes should be a geodetic curve that represents the shortest path on the globe. In the simplified Mercator projected commonly used by geospatial frameworks (including GeoDesk), this turns into a straight line on the Euclidean plane. Unless two nodes are far apart (tens of kilometers), the resulting inaccuracy is negligible.

(Here’s how you can observe this effect: On Google Maps – which also uses a Mercator projection – use the “measure distance” tool to draw a line between two points. If the distance is sufficiently long, you will notice a curve that extends towards the poles, instead of a simple straight line).