I’m the author of Layercake, mentioned above by @Robot8A. Just wanted to chime in here with my perspective.
There are two different reasons that I think distributing OSM data in GeoParquet format is useful. Both were factors that motivated me to build Layercake, but they are somewhat separate.
“Cloud-native” access
As mentioned above by @Spatialia and others, an application can read just the data covering a given bbox from a remote GeoParquet file, just by using HTTP Range Requests (which are supported by most static file servers, and object storage providers like AWS S3). This means there’s less need to pre-partition the data. Geofabrik’s prebuilt OSM PBF and Shapefile extracts are an important (and generous) service to the OSM community, but for GeoParquet data, applications can essentially create their own extracts on demand. Someone still has to pay for the bandwidth of course, but the compute ends up happening on the user’s device, and they can download any region they want to rather than being constrained to the ones offered by the service provider.
When Overture first launched their data distribution, there was a lot of excitement online about being able to query and explore an enormous dataset with relative ease, and I think this was mainly driven by the choice of file format and the fact that it permits this type of on-demand access.
Familiar and interoperable data model
I spend a lot of time teaching people from traditional GIS backgrounds about OpenStreetMap and how they can use the data. A really common question is “sounds neat, where do I download the shapefiles?”.
Implicit in this question are several assumptions:
- that geospatial data is distributed in tabular file formats with rigid schemas and a fixed set of known columns
- that different kinds of data (buildings, roads, waterways, etc) will be stored in separate files or layers, each with their own schema
- and that records in a geospatial dataset will each have a single OGC geometry (Point, LineString, Polygon) associated with them.
OSM’s data model surprises these potential users in a couple of ways: by having effectively unlimited columns (ATYL), and by using Nodes, Ways and Relations as the fundamental data types instead of Points, LineStrings, and Polygons.
Translating OSM data from its native form into a tabular data model is tricky and requires a pretty deep understanding of OSM (how to assemble multipolygon relations into Polygon or MultiPolgyon geometries, how to determine whether a closed way represents a LineString or a Polygon based on its tags, etc). So publishing data that’s been preprocessed into this form and stratified into thematic layers for common use cases (buildings, highways, boundaries, etc) can help make OSM data more accessible to the broader GIS community.
It’s worth noting that this conversion process is inherently opinionated. To convert OSM data into GeoParquet (or similar) format, you need to make decisions about:
- which OSM elements to include in each layer
- which tags are “important” for that kind of element and will get included as columns
- what geometry types are allowed for each kind of feature, and how to disambiguate closed ways (LineString or Polygon?) for each kind
- whether to harmonize synonymous tags, preprocess complex tag values (
opening_hours, conditional syntax), parse numeric tags (population, ele, maxspeed) to ints or floats (and how to handle units), etc
Layercake tries to make sensible choices and create data that’s suitable for a broad range of uses, but ultimately it’s subjective, and there’s definitely room for other offerings for OSM data in tabular file formats that make different trade-offs or apply other interpretations of OSM’s tags.
TL;DR: Layercake’s goal is to preprocess OSM data into separate thematic layers in a tabular format and with normal OGC geometries for each feature. GeoParquet is an acceptable output format for this and as a bonus permits efficient remote access, but in my opinion the main value that Layercake offers is the (opinionated) conversion from one data model to another, not the specific output file format.