Producing a validated OSM dataset

fititnt · December 18, 2022, 8:36am

Hummm. Now realized the following: they’re going for standardization of whatever file format they will use (then bully back as if it is better just because it is documented and with proper name). Considering their size and corporate-like approach, I would suppose they will aim likely to ISO (which is slooooow, but eventually would be done) over IETF/OGC/W3C/OASIS/etc and direct to the point specification.

While commented weeks ago about just schemas to validate for OpenStreetMap XML, this made me think that just validate XML is a bit too low, since the canonical format is so important and that should had at this level of use at least some RFC (informational ones are less complicated than standard ones, but both exist). But now, they’re also going for it.

So, as boring as might be perceived, I think that nudging existing developers on OSM about formalizing the exchange format as industry standard (something that even less used ones have) now makes sense.

SimonPoole · December 18, 2022, 8:54am

The problems with producing a validated dataset are many, but IMHO the two most important are

lag, this was nicely illustrated in the Jewtropolis incident, were the data consumers that were at the end of a long validation pipeline got vandalized data that was long fixed in OSM proper,
double effort, OM, Facebook and co can afford spending the money on staff in -choose you fav low pay economy- to fix issues both in the validated dataset and then go back and fix them in the source, it is unrealistic to expect the OSMF to be able to do something similar, we really only want to fix something once.

So in any case I don’t think it makes sense to go down the path of trying to build a validated distribution after the fact, but what might be worth trying is a planet dump (and maybe a weekly diff) that is conditional/delayed on some (lightweight) quality criteria being met in the original OSM data, similar to what Jochen has done with coastline generation. The coastline mechanism naturally also shows how painful that can be, so I don’t think there should be too many illusions about how well this might work, but I think it is definitely worth a try. Criteria could be no vandalized major place names and no major bodies of water missing for starters.

CjMalone · December 18, 2022, 1:20pm

I don’t think that’s how Daylight works. As far as I know, they fix it in OSM and cherry pick the changeset(s) into the release.

EDIT: sources to support:

And slide 4 “Fix: Submit fixes on live OSM, not in an internal database”

SomeoneElse · December 18, 2022, 2:53pm

(veering somewhat off-topic from the original discussion and perhaps worthy of a different thread)

Has anyone actually analyzed what does get included and what doesn’t and what gets changed? I’m sure that Facebook have said something, but I’m wondering if there has been any independent analysis?

SimonPoole · December 18, 2022, 3:04pm

There’s been a talk “Keepin' it fresh (and good)!” - Continuous Ingestion of OSM Data at Facebook on how they continuously integrate updates in to their data, but I’m not aware of anything on daylight.

mmd · December 18, 2022, 3:22pm

Here’s another blog post on the same topic with some more background info: MaRS: How we keep maps current and accurate - Engineering at Meta

SimonPoole · December 18, 2022, 6:39pm

I wrote a bit about the problem with POIs here Simon Poole: "@amapanda I've pointed this out before, the way i…" - En OSM Town | Mapstodon for OpenStreetMap too

fititnt · December 20, 2022, 1:39am

Leaving aside how to make the distribution itself (file, or API, etc) on the governance for specific projects with different corporate players deciding what goes or not into the final release, one close governance model is… how Unicode decides the Unicode Common Locale Data releases.

The CLDR have far less data than OSM (but still have names for places, regex to parse dates, plural rules, etc, things that are ready to be employed in software) and releases take months, but the amount of small decisions is such that in general, submissions from general users tend to be accepted. The last world tends to be corporate decisions, some sort of language-representative (often might be backed by some country concerned with how names of places will be default to) and other interested in the topic (which don’t require payment, but often is restricted on subject or region) on the CLDR Survey Tool:

So, yes, going this path end result would tend to be more aligned with the version used by the UN than “ground truth”. Maybe even naturally would bring massive data from places which OSM is currently blocked, because even if corporations could vote for ground truth, they could still align with economic interest and, in some regions, just agree with votes by some other org.

So if CLDR approach is taken on an OSM release backed up by these votes, in special places names would be frozen. The natural advantage for Overture Maps Foundation “OMF” players to try to go with OSMF would be it acts as conflict resolution between the big corps themselves (and the fact open processes have advantages for them). The disadvantage for OSMF is that one release (maybe with different naming) becomes stricter at the cost of disputes being voted on, similar to how some existing standards organizations do voting today (this happens with W3C, ISO, etc, maybe notable exception is IETF).

Another somewhat disadvantage to OSM going this path, even if this works very, very well, is that the corporations might decide to avoid releasing certains kinds of data they would profit from, even if OSM could clearly be expanded to do this. I do suspect this because TomTom seems to avoid releasing certain kinds of data. Again, I might be wrong, but they seem more a threat to OSMF (as an organization to have big players around) than to existing businesses that always worked around OSM data. Leaving voting for >50% corporate will just make them focus on a subset of OSM data even less than today’s data offered by Geofabrik daly dumps while appealing administrative boundaries focused for commercial use without stress with major governments.

So, at this moment, I’m not sure the best approach, but the way Unicode works is a way to allow corporation’s (which most of the time are mostly approving what others already suggested) may be one way to do it. But, by analogy with Unicode, this would also means:

For every country OSM data operates, this kind of release would allow, for free (without need to pay fees) representatives of the local government on that feature (Unicode allows not only corporate, but “experts”). So: for roads, any country department responsible for roads should have a vote; for rivers, any country department related to that should have a vote, and go on… That’s the logic. Again, despite less votes, as long as there’s no issue to conflagate data, the tendency of this release gives priority to “official” data (as long as license allows it).
The agreement’s OMF would be at closed doors with local governments and would become known to the OSM community. While it is always faster for the government to deal directly with this group (again, no need to wait for community approval), they would anyway become more likely to allow direct data on OSM. The acceptance of the OSM community accepts that this might vary by region (sometimes might never happens, in special in disputed regions)
The tooling for data conflagarion is likely to improve. The released version by this initiative might actually be a beta testing place in the eyes of the OSM community without risk damage on the reference data.
(Likely I’m missing more impacts)

Anyway, the decision here on the approach needs to take in account what is for medium to long term interest to OSMF or not. But if (and that’s a big if) approach them, the way allow the voters on topic by topic to decide what’s going on published versions might take in consideration how Unicode does it.

fititnt · December 20, 2022, 2:46am

I really liked your comment here!!! In fact, already was concerned to make as less human in the loop necessary, preferable with rules that are reusable everywhere.

While looking about how to convert the tags metadata into RDF, ways to parse regexes, etc, aka know how to validate data and convert from raw strings, I was thinking about ways to reuse rules already used to validate data to enforce a fix (even if means cut tag-value that fails to validate to an strict schema).

I took a quick look on how very, very efficient is the code used to make this, and while might have other implementation much more efficient, the way I’m thinking is about RDF inference (mostly SHACL based) would be very, very slow, to a point of I’m trying to make if use less memory (easier if running client side, loading only what is near), but would still be general propose (and easier for humans write new general propose rules, including make OSM data interoperable with other data). But for example, for something such as maxspeed, the world would have a default, which might be overridden by country and then optionally enforced if some feature is above it for the context. General proposal also would mean humans are able to create rules based on POIs (“IF 100m of school entrance THEN maxspeed <=30 km”). It’s slower, but reusable because it uses standards (but sometimes requires conventions, so people can reuse the same rules).

So, concerning existing Wiki documentation to RDF, with semantic inferencing, yes, this would make OSM release schema-enforced be proudly be able to say it applies state of the art artificial intelligence to ensure consistency. Actually, we could even go further and release schemas to explain how to compute variables depending on the vehicle someone is using, and this could be code against a documented OSM strict schema. I don’t brag a lot about it because it is a bit boring to make things work memory-efficient, but pretty much every feature announced by TomTom (at least the ones which don’t depend on even more data OSM don’t have) is viable to make it.

stevea · December 20, 2022, 7:07am

I actually followed that, or most of it, and it seems you wish to take an 18-year-old data project (OSM) and turn it into a rules-based system that maybe looks at OSM data, maybe (in some cases) doesn’t even do that. That might be putting it a bit extreme, but I do see a path for that in what you say here.

“…make OSM data interoperable with other data” sounds a lot like “tagging for the renderer.” You want to MAKE OSM data into something? I’m all for improvements, but…maybe we need to talk about this more.

SK53 · December 20, 2022, 4:57pm

Zkir did something like this for routing. He gave a talk on his approach at SotM-Baltics back in 2013. A list of failed validations was produced as feedback to the community. IIRC the routing was for a limited set of inter-city routes (basically E-roads)

SimonPoole · December 21, 2022, 7:08am

Had time to quickly watch the video now, it doesn’t go in to any detail of how their process works technically, but it seems the “cherry-picking” is wrt known good object versions not changesets*. How they get to those known good versions isn’t discussed at all.

Simon

*that would have been very surprising to start with as OSM changesets have none of the properties that you would probably want in such a context.

Minh_Nguyen · December 21, 2022, 8:30am

Each release announcement includes a link to a CSV of “flagged and fixed features”.

SimonPoole · December 21, 2022, 8:37am

I know and it wasn’t in doubt that FB fixes issues that it has found (just as others do). The question is if these are applied in parallel or really just in OSM and if the later (as they seem to claim), how do they retrieve the “good” version and ensure consistency with other potentially conflicting changes that may have been made in the mean time (aka how they avoid starting over again and again).

PS: to the moderators / @nukeador maybe the whole sub-thread on providing validated data could be moved to a separate topic?

RobJN · December 21, 2022, 1:21pm

Doesn’t the article @mmd linked to answer it?

The local copy can accept only changes made upstream and can’t be written to directly.

SimonPoole · December 21, 2022, 1:41pm

How MaRS and the daylight distribution relate to each other is not documented anywhere I could find, the cherry picked versions they mention for the later might or might not correspond in some fashion to the LoChas, but at least from the two talks there is no reason to believe that these use the same mechanisms (it is obviously likely that the wheel was not totally reinvented).

mmd · December 21, 2022, 1:44pm

See https://daylightmap.org/ → How To Reach The Team

Learn more about the technology behind our process from our engineering team:

MaRS: How Facebook keeps maps current and accurate post on Facebook’s Engineering blog, Sep 30, 2019

“Keepin’ it fresh (and good)!” - Continuous Ingestion of OSM Data at Facebook presentation at OSM State Of The Map US conference, Sep 8, 2019

SimonPoole · December 21, 2022, 1:48pm

And Producing a validated OSM dataset - #3 by CjMalone in that case, wondering how many times we need to re-reference the same stuff in one thread.

mmd · December 21, 2022, 1:51pm

I think they delay small independent edits before publishing their distirbution, like having some sort of super fine granular time machine. What isn’t documented anywhere is the algorithm to decide which of those edits to delay. So what we read about is more how the mechanics of identifiying independent pieces work, but not much more.

Somehow this is step one to vendor lock-in in an Overture world. You can get the data, but we control the complete tool chain. I’m wondering why nobody is bringing this up as the major disadvantage compared to OSM.

SomeoneElse · December 22, 2022, 8:32pm

In order to find out what was in it, I had a look. I downloaded https://daylight-map-distribution.s3.us-west-1.amazonaws.com/release/v1.21/planet-v1.21.osm.pbf and then using GeoFabrik’s “rutland.poly” I split out the equivalent to the small English county Rutland, which I downloaded the OSM version of from GeoFabrik.

The Facebook dataset is equivalent to GeoFabrik’s internal data (it has changesets and userids in it), so I needed to download internal data from GeoFabrik by logging in there.

I can easily convert each to “opl” format with e.g.:

osmium cat rutland-221216-internal.osm.pbf -o rutland-221216-internal.osm.opl

which makes comparing files easier.

Despite being dated 16th December, the Facebook data has nothing in it from after 11th November.

Objects missing from the Facebook data seems to include:

Admin boundary nodes, ways and relations, such as https://www.openstreetmap.org/way/256898607 .
Things in OSM with no “main tags”, such as https://www.openstreetmap.org/node/18300649 , which has a name and a source only.

No objects were obviously “added” to the Facebook data, although some data deleted from OSM after the Facebook cut-off date was still present - the Facebook data is old enough that there isn’t an exactly matching date for data from GeoFabrik**. In addition, a way https://www.openstreetmap.org/way/311666071/history (which is not in Rutland, but is part of a route relation which partially is)***.

Object tags were written in a different order in the resultant .opl files and I’ve not done a strict comparison, but a quick glance suggested that all tags were intact.

To summarise, in this example, nothing was taken away apart from admin boundaries and things that really don’t matter, and nothing was added. Presumably if someone had added a fake city “(badword)sville” it would have been spotted and removed from OSM or the extract before release, but that wasn’t an issue here. Rutland is a very simple example with no complicated non-English names, so it clearly is only the simplest possible test.

** It would of course be possible to download an OSM planet file for the correct date and use that for the comparison, but I haven’t bothered doing that.
*** GeoFabrik’s data selection likely has more to it than my naive “osmium extract”, which might explain this too.