Producing a validated OSM dataset

In order to find out what was in it, I had a look. I downloaded https://daylight-map-distribution.s3.us-west-1.amazonaws.com/release/v1.21/planet-v1.21.osm.pbf and then using GeoFabrik’s “rutland.poly” I split out the equivalent to the small English county Rutland, which I downloaded the OSM version of from GeoFabrik.

The Facebook dataset is equivalent to GeoFabrik’s internal data (it has changesets and userids in it), so I needed to download internal data from GeoFabrik by logging in there.

I can easily convert each to “opl” format with e.g.:

osmium cat rutland-221216-internal.osm.pbf -o rutland-221216-internal.osm.opl

which makes comparing files easier.

Despite being dated 16th December, the Facebook data has nothing in it from after 11th November.

Objects missing from the Facebook data seems to include:

No objects were obviously “added” to the Facebook data, although some data deleted from OSM after the Facebook cut-off date was still present - the Facebook data is old enough that there isn’t an exactly matching date for data from GeoFabrik**. In addition, a way https://www.openstreetmap.org/way/311666071/history (which is not in Rutland, but is part of a route relation which partially is)***.

Object tags were written in a different order in the resultant .opl files and I’ve not done a strict comparison, but a quick glance suggested that all tags were intact.

To summarise, in this example, nothing was taken away apart from admin boundaries and things that really don’t matter, and nothing was added. Presumably if someone had added a fake city “(badword)sville” it would have been spotted and removed from OSM or the extract before release, but that wasn’t an issue here. Rutland is a very simple example with no complicated non-English names, so it clearly is only the simplest possible test.

** It would of course be possible to download an OSM planet file for the correct date and use that for the comparison, but I haven’t bothered doing that.
*** GeoFabrik’s data selection likely has more to it than my naive “osmium extract”, which might explain this too.

9 Likes