Schema-normalized additional API endpoint

fititnt · December 16, 2022, 11:41pm

Recommended reading: Overturemaps.org - big-businesses OSMF alternative

As much as this might be a bit off (because I’m new contributor, so this really should be a matter of others here), I would like to put for open criticism i makes sense to have some sort of read-only version of existing API data (the one that contains de facto data for users) that could be possible to pass for automated modifications and be under OSMF umbrella (not outside project).

While what actually notifications are mostly here for example, and how this will be done is also any future discussion, the real request for feedback here is if having such additional versions makes sense.

Some examples of such automated conversion from reference OSM data to have

Disclaimer: again, remember that I’m a new user. What can and what cannot would mostly be decided by others. However, such a version could ideally be rebuilt from scratch from time to time. But would have less use than existing APIs, and maybe not even require rebuild the map, just the API data

At minimum, this would mean opinionated changes from tagging schemas that are clearly aliases and removal from tags from very old times that have no impact today.
Values that clearly are impossible (such as if maximum highway speed in a country is 120 km/h, have 300 km/h in a highway=residential likely to be extra 0) the output could clean that Information. No questions asked. Just make it happen and live with that for each rebuild.
For some types of field that accept a list of Items, if there’s more than one separator (like ; vs ; ) regex-based clean up could be used to enforce one schema of the separator. Obviously there are other patterns to clean, but here is mostly to do data cleaning. Without this it is impossible to create documentation for end-users of what to expect, and we really want pristine documentation
(Not sure applicability; needs feedback) changesets with higher likelihood to be suspect (like mass deletions), might intentionally not be applied if their submission is too recent. We assume they don’t exist
(Not sure applicability; needs feedback) same as previous, but even after passing the time that would allow that changeset to be merged, if it gets changeset comments by someone, we add even more delay to be merged. Maybe we suggest explicit keywords in the first or second comment to give hints about good or not.
… Suggestions? Please let’s keep it simple, something we know could be done in just a few months. On this topic let’s focus on output-only, avoid proposing things that change how to insert data on OSM (or at least not ones that would be hard to implement).

But why?

I think changes that Overture Maps Foundation “OMF” would claim to do to “clean/validate/remove vandalism” from what they call “several sources” could be done by OSMF itself with a free-to-be-top-down opinionated decision. And nor just this, but we should initially focus on getting more well cared for the feature: for example, they will focus on first interaction (in this case, highways, and seems also some amenities, so I would suppose gas stations), so it makes sense we do this, be focused, instead of trying to improve everything. As much as people might be upset with some false positives (but this doesn’t change original data, just a version from it), and even the fact itself of have normalized version, I think this approach could allow fast response by half of 2023 (which is the release date of OMF project).

By next year, the organizations behind OMF (even if they add some extra features like AI generated roads from where it doesn’t exist) will likely to be able to provide even OSMF data in some sort of normalized form, so really make sense to at minimum not allow such type of argumentation be valid on their press releases to a point of OSMF really cannot defend itself. Both organizations produce open data (but OSMF is more traditional, so even those who like Linux Foundation would try side with OSMF) so this kind of version of own OSMF data, likely even the validation/cleaning rules be open to them submit, would disincentive they to not cooperate for some automated rules we could do to transform OSM data from the formats. Add to this that even existing organizations that work directly with OSMF for a longer time might start to move to OMF if some kinds of data cleaning would not be able to be done directly under the OSMF umbrella. This is unlikely to happens very fast (at least not for organizations not already part of OMF members) but over years could happens.

So I think still technologically viable to OSMF both keep it’s community side (not even require changes on data it have, just automate generation from it) and offer things that OMF would do.

Potential Timeline to decision

I don’t think the decision about this is urgent, but unless there’s heavy rejection, even if not implemented/approved, people who could think about how to make it viable could already prepare scripts and/or how to design the infra.

But if have some sort of ideal date, should be before OMF start to deploy services (as this date, their faq says: “Overture will release its first datasets in the first half of 2023”) so make easier for OSMF deal with press release comparing both organizations and (what obviously will happens) would occur benchmarks comparing vandalism/harmful errors on both, so this automated approach could allow fast fixes. As much as OMF would try on press releases to get developers that use Google services to move for them, I think we, as either community or as suggestion to OSMF, aggressively to push opinions from developers and news media to know about existence of OpenStreetMap.

About conflicts of interest
I, Emerson Rocha, declare no conflicts of interest on this proposal. Neither my dayjob or any contracts are related at all with OpenStreetMap, OSM related services, or with the companies related to both OSMF and OMF. I do think at the time of this writing, for a very utilitarian point of view, that even if for free (no sponsorship or whatever), could be better for both OSMF and OpenStreetMap community allow this type of version of OSM data be under OSMF umbrella, but if agreements with OMF and or individuals working on OMF happens, this should only be open, and that’s it. I’m proposing this (again, as new collaborator to OSM, so feel free to do very harsh criticism) because part of their issues, in special related to make easier data conflagration, seems to be are true (so I dont think is only to avoid show “OpenStreetMap collaborators” or even have someone from these companies on OSMF board, but how to fill gaps when there is no data without need to wait too much), so this approach under might solve 40-70% of the problem to OMF simply not use OSMF data. But even without any agreement with OMF (both explicitly not cooperating with each other), the idea of similar data endpoint for easy hotfixing seems worth to keep online.

Danysan95 · December 26, 2022, 12:17am

I find this an interesting idea. Additional controls that would be interesting are:

(probably the most important thing of all) removing unintentionally broken or intentionally vandalized elements, identified basing on
- name/description text
- geometry shape and correctness
- last editor track record
remove broken links in website=, wikipedia=, wikidata=* and wikimedia_commons=*
remove wikidata links that are clerly wrong because they point to a person (the user likely used wikidata=* instead of subject:wikidata=* or something similar), a tree species, …
fix coastlines to prevent the “flooding” effect when it gets broken
remove elements where the content of source tag suggests usage of invalid sources with non compatible licenses

Most of the controls cited above have been addressed in some shape or form in existing QA tools we would not need to start from scratch. For example, wikidata tags that point to inexistent IDs or clearly wrong entities are covered by OSM-wikipedia-tag-validator (which if I understand correctly uses this library; problem reports are here).
Inspiration for other clearly-wrong-element-signs could come frome other existing QA tools such as Osmose and OSMCha’s flagging.

Of course the list of removed elements should always be available and public, to find and fix broken/vandalized items.

That said, I have a doubt:
Is there some particular reason you suggest to create a new version of the API rather than “simply” creating a new Planet.osm distribution beside the standard one? In my opinion this would be more useful for downstream data users (and I suspect would be easier to implement, but I’m not sure).

This would be similar to what Meta has done in its Daylight distribution’s Planet file. It’s similar to a standard planet file but OSM’s data has been fed through a pipeline that:

Tries to find and fix broken coastlines with the open source OSMCoastline tool
Remove items identified as vandalism basing on NLP analysis and user embeddings. I haven’t been able to find any open source code of this task but a corpus of manually verified vandalic text has been released at this link and the vandalism detection machine learning model is in part described in this paper

An similar OSM in-house “safe” distribution would be really useful for OSM data downstream users.

PS:

While I understand your point, I don’t think we should adopt this “competition” attitude nor should it be the motivation behind this feature. After all, the strength of OSM is in other areas; if there really was a competition in this feature, the corporate backers of OM would have a level of firepower (time, resources, know-how in AI models) that would probably guarantee a win for OM. Having a clean and safe dataset out of OSM is in the best interest of both the OSM community and OM, and having a common base toolset (possibly expandable for a consumer’s needs) to generate this dataset from the original OSM dataset would be the most efficient solution for both.

Danysan95 · December 26, 2022, 3:06pm

I have created a new sandbox wiki page to sum up the proposals we are talking about in this thread: User:Danysan/Sandbox/Opinionated Planet.osm - OpenStreetMap Wiki

Everyone is welcome to integrate it with other proposals we can think of.

(@fititnt I haven’t included the 3rd point of your original post because I haven’t undestood what that proposal is about)

SomeoneElse · December 26, 2022, 4:04pm

Whilst I don’t think that everything on that page is a good idea (and disagree with the premise of some of the rest), I’d definitely encourage you to experiment by “just doing it” for a smaller area, once you’ve got an idea of what transformations you want to make.

Make sure that there are no external dependencies (so you can do everything locally, or perhaps collaborate on a shared cloud server) so you’re not waiting for anyone else. Put the code at GitHub or some other shared code platform.

Small area downloads are available from several places (see the OSM wiki, but obviously GeoFabrik is one that people often use).