Schema-normalized additional API endpoint

Danysan95 · December 26, 2022, 12:17am

I find this an interesting idea. Additional controls that would be interesting are:

(probably the most important thing of all) removing unintentionally broken or intentionally vandalized elements, identified basing on
- name/description text
- geometry shape and correctness
- last editor track record
remove broken links in website=, wikipedia=, wikidata=* and wikimedia_commons=*
remove wikidata links that are clerly wrong because they point to a person (the user likely used wikidata=* instead of subject:wikidata=* or something similar), a tree species, …
fix coastlines to prevent the “flooding” effect when it gets broken
remove elements where the content of source tag suggests usage of invalid sources with non compatible licenses

Most of the controls cited above have been addressed in some shape or form in existing QA tools we would not need to start from scratch. For example, wikidata tags that point to inexistent IDs or clearly wrong entities are covered by OSM-wikipedia-tag-validator (which if I understand correctly uses this library; problem reports are here).
Inspiration for other clearly-wrong-element-signs could come frome other existing QA tools such as Osmose and OSMCha’s flagging.

Of course the list of removed elements should always be available and public, to find and fix broken/vandalized items.

That said, I have a doubt:
Is there some particular reason you suggest to create a new version of the API rather than “simply” creating a new Planet.osm distribution beside the standard one? In my opinion this would be more useful for downstream data users (and I suspect would be easier to implement, but I’m not sure).

This would be similar to what Meta has done in its Daylight distribution’s Planet file. It’s similar to a standard planet file but OSM’s data has been fed through a pipeline that:

Tries to find and fix broken coastlines with the open source OSMCoastline tool
Remove items identified as vandalism basing on NLP analysis and user embeddings. I haven’t been able to find any open source code of this task but a corpus of manually verified vandalic text has been released at this link and the vandalism detection machine learning model is in part described in this paper

An similar OSM in-house “safe” distribution would be really useful for OSM data downstream users.

PS:

While I understand your point, I don’t think we should adopt this “competition” attitude nor should it be the motivation behind this feature. After all, the strength of OSM is in other areas; if there really was a competition in this feature, the corporate backers of OM would have a level of firepower (time, resources, know-how in AI models) that would probably guarantee a win for OM. Having a clean and safe dataset out of OSM is in the best interest of both the OSM community and OM, and having a common base toolset (possibly expandable for a consumer’s needs) to generate this dataset from the original OSM dataset would be the most efficient solution for both.