OSM data in GeoParquet format

There has been some discussion about OSM data in GeoParquet format, most recently in The last hurrah for OSM - #44 by SimonPoole. Apparently there many different ways to stuff OSM data into some form of Parquet; osm-parquetizer does it with raw data, ohsome-planet builds geometries even for historical data, and cadencemaps tries to emulate Overture’s POI schema.

As the person running the Geofabrik download server I have been wondering if I could add some simple form of GeoParquet files to the download options there and I wonder what people would find useful. The simplest approach for me would be to take the existing Geofabrik shape files and convert them to GeoParquet. This is something that can be done for example with the “geoparquet-io” tool or with recent versions of plain old GDAL (“ogr2ogr”).

I have tested this with a somewhat large shapefile for the South-East region of Brazil. The raw osm.pbf file has about 800 MB; the free Geofabrik shape file is a 2 GB zip file (4.8 GB uncompressed). I have converted that to GeoParquet with gpio and with ogr2ogr, both yielding about 2 GB of internally-compressed parquet data. (We’d probably still have to put them in a zip for downloading since parquet does not seem to support multiple layers.)

Perhaps someone can tell me if these would be considered useful - we could then make them available on the download server for all regions where we currently have shape files.

(Geofabrik’s download server has shape files only for smaller regions - smaller countries have a shape file for the full country but larger countries are split into regions, and where those regions become too large sometimes shape files are only there for even smaller entities like “Northern California” and “Southern California”. Would that be acceptable to people working with GeoParquet files?)

8 Likes

First thank you for Geofabrik and for considering another download format. That sounds awesome and might be useful for many data analytics groups as is, though I’d probably still use one of the other extracts you offer myself - for downloaded data, the tools I use handle them better.

Since I originally raised parquet in that thread, I’ll say I really like using parquet when it’s:

  1. Hive partitioned by geography so that I could point a piece of software at the whole dataset and have it be able to pull out only the regions of interest itself. So a conversion that keeps those extracts structurally related would be important for some users.
  2. Keeping the organization by theme you already have so that you can query on columns to extract subsets
  3. Hosted in-place rather than for download so that software tools can subset them live to download only the pieces they need. This isn’t a dealbreaker, but is really helpful and I think is a big reason for people using Overture - they can easily perform on the fly geographic, thematic, and filtered (e.g. only the rows where highway=tertiary) extracts with desktop tools that handle parquet data. I’ve used large parquet datasets downloaded to my device though, and they’re still useful, but a little less useful than if someone had the bandwidth to host the raw access live (which I know could be a big ask).

I don’t know if it has been mentioned yet, but there is Layercake from OSM US.

4 Likes

Nick, I’m kinda loathe to make the Geofabrik download server into an operational component of people’s applications - I fear that offering the kind of access that you mention would lead to people building stuff like apps or docker containers that are then used by third parties (worst case, AI scapers) without any consideration about the fact that someone has to pay for servers and bandwidth! But I recognize that “cloud native” is part of the appeal (where “cloud” means “I’ll rather use someone else’s server than my own”).

Would the OSM-US “Layercake” thingie mentioned by Hector already satisfy your use cases? (I had read about that a while ago but then forgot again, hence thanks Hector for bringing it up.)

3 Likes

Totally understand and I didn’t think you would want to or should need to do that either. I was more noting that it’s a big part of what makes a geoparquet file useful in my book. So without that, I wasn’t sure I’d use it. Though even downloaded, they can still be nice, but I wanted to share that as one potential perspective on whether or not downloadable versions would help people.

Perhaps what I was trying to do last week can be useful to know if this is interesting to implement in GeoFabrik or not:

I was trying to obtain all ways in a big country that contained postal_code tagged. Overpass Turbo wasn’t working for that. So, my idea was to test GeoParquet, since it’s supposed to be fairly easy. However, both Overture and OSMUS’s Layercake don’t have the postal_code parameter (they only have thematic extracts with the most “important” tags).

Anyway, I had to download full country from Geofabrik and run a simple osmium script.

If I had to run for the whole world, I couldn’t, because I don’t have enough disk space in my machine, and GeoParquet would be very handy for that (all the processing somewhere else, not locally, just downloading what I exactly need)

Fred wrote:

I have been wondering if I could add some simple form of GeoParquet files to the download options there and I wonder what people would find useful.

I very much support that.

With Cadence Maps, we have gained experience in how to create a places layer that is compatible with Overture Maps Places categories/taxonomy—but is “Pure-OSM,” “community-first,” open, and documented.

Nick, I’m kinda loathe to make the Geofabrik download server into an operational component of people’s applications

I understand this.

Just FYI: I am currently working on implementing just such a web service under the working title “Cadence Maps Enhanced.” This will initially may be a prototype with registration.

Fred, can you provide me/us with a sample GeoParquet file?

I think it’s important to split up if OSM data in GeoParquet format is useful or if it is only useful if someone is hosting it over HTTP without any container format (e.g. zipfiles) and allowing range requests.

The second is very different from what Frederik is asking about.

2 Likes

Have you heard the good news about Postpass?[1]

Which country do you want? What exactly do you want? Put this into Overpass Turbo and it’ll give you all lines with the postal_code tag. It took ca 20 sec for me for Italy.

{{data:sql,server=https://postpass.geofabrik.de/api/0.2/}}

SELECT tags, geom
FROM postpass_line
WHERE tags?'postal_code'
AND geom && {{bbox}}

  1. I’m a member of the Postpass Evangelism Strike Force! ↩︎

1 Like

pnorman wrote:

The second is very different from what Frederik is asking about.

Thanks for the clarification. I’m aware of that, but you made me think.

It’s still important to have a use case. And that’s a separate static file hosting (the pipeline behind such a service would take Geofabrik’s “raw” GeoParquet files as input and convert them to read optimized GeoParquets).

Now regarding downloadable, “raw” “zipped GeoParquets”, Fred asked:

Perhaps someone can tell me if these would be considered useful

Before all I’d prefer having “zipped GeoParquets” (also) at highest level - like the 8 continents/subcontinents of Geofabrik.

And looking at specific “raw” GeoParquet files, I assume that they contain “Parquet Column Statistics” (Zone Maps). And it would be probably also convenient, if bbox columns exist for each row, and GeoParquet metadata (geometry_types, encoding declaration, CRS) - although a reader could derive these.

Open Planet Data offers the entire planet as a GeoParquet download (~ 150 GB, updated daily).

1 Like

I was not aware of that. But can’t find a documentation: Which layered GIS schema? Which categories? If it’s “raw” OSM tags, then the advantage compared to .pbf seems rather small.

The Board discussed providing a GeoParquet export at the Milan meeting last year but hadn’t moved on it. This discussion seems like a good trigger for getting a project moving. I have written a brief project proposal which we will discuss at the Board meeting this Thursday. If they like it, I will post it here to be developed further.

2 Likes

I’m the author of Layercake, mentioned above by @Robot8A. Just wanted to chime in here with my perspective.

There are two different reasons that I think distributing OSM data in GeoParquet format is useful. Both were factors that motivated me to build Layercake, but they are somewhat separate.

“Cloud-native” access

As mentioned above by @Spatialia and others, an application can read just the data covering a given bbox from a remote GeoParquet file, just by using HTTP Range Requests (which are supported by most static file servers, and object storage providers like AWS S3). This means there’s less need to pre-partition the data. Geofabrik’s prebuilt OSM PBF and Shapefile extracts are an important (and generous) service to the OSM community, but for GeoParquet data, applications can essentially create their own extracts on demand. Someone still has to pay for the bandwidth of course, but the compute ends up happening on the user’s device, and they can download any region they want to rather than being constrained to the ones offered by the service provider.

When Overture first launched their data distribution, there was a lot of excitement online about being able to query and explore an enormous dataset with relative ease, and I think this was mainly driven by the choice of file format and the fact that it permits this type of on-demand access.

Familiar and interoperable data model

I spend a lot of time teaching people from traditional GIS backgrounds about OpenStreetMap and how they can use the data. A really common question is “sounds neat, where do I download the shapefiles?”.

Implicit in this question are several assumptions:

  • that geospatial data is distributed in tabular file formats with rigid schemas and a fixed set of known columns
  • that different kinds of data (buildings, roads, waterways, etc) will be stored in separate files or layers, each with their own schema
  • and that records in a geospatial dataset will each have a single OGC geometry (Point, LineString, Polygon) associated with them.

OSM’s data model surprises these potential users in a couple of ways: by having effectively unlimited columns (ATYL), and by using Nodes, Ways and Relations as the fundamental data types instead of Points, LineStrings, and Polygons.

Translating OSM data from its native form into a tabular data model is tricky and requires a pretty deep understanding of OSM (how to assemble multipolygon relations into Polygon or MultiPolgyon geometries, how to determine whether a closed way represents a LineString or a Polygon based on its tags, etc). So publishing data that’s been preprocessed into this form and stratified into thematic layers for common use cases (buildings, highways, boundaries, etc) can help make OSM data more accessible to the broader GIS community.

It’s worth noting that this conversion process is inherently opinionated. To convert OSM data into GeoParquet (or similar) format, you need to make decisions about:

  • which OSM elements to include in each layer
  • which tags are “important” for that kind of element and will get included as columns
  • what geometry types are allowed for each kind of feature, and how to disambiguate closed ways (LineString or Polygon?) for each kind
  • whether to harmonize synonymous tags, preprocess complex tag values (opening_hours, conditional syntax), parse numeric tags (population, ele, maxspeed) to ints or floats (and how to handle units), etc

Layercake tries to make sensible choices and create data that’s suitable for a broad range of uses, but ultimately it’s subjective, and there’s definitely room for other offerings for OSM data in tabular file formats that make different trade-offs or apply other interpretations of OSM’s tags.


TL;DR: Layercake’s goal is to preprocess OSM data into separate thematic layers in a tabular format and with normal OGC geometries for each feature. GeoParquet is an acceptable output format for this and as a bonus permits efficient remote access, but in my opinion the main value that Layercake offers is the (opinionated) conversion from one data model to another, not the specific output file format.

9 Likes

For anyone looking to convert OSM data into GeoParquet format, I wrote a DuckDB extension that supports reading OSM PBF files with libosmium:

It lets you do things like this:

COPY (
  SELECT id, tags['name'] as name, geometry
  FROM 'example.osm.pbf'
  WHERE kind = 'node'
    AND tags['place'] = 'city'
) TO 'cities.parquet' WITH (FORMAT PARQUET, COMPRESSION ZSTD);

It supports line and area geometries too (including multipolygon relations). Hopefully this will make it easier for people to transform OSM data into GeoParquet and other formats in whatever schema they want.

3 Likes