I noticed in talk about ouverture, that their data carries an extra attribute of “quality index”, i.e. the degree of likelihood that the data is correct. In my local area lots of POIs have been imported by a user that deleted themselves meanwhile, and the data has low quality, i.e. many falsehoods.
Revert the whole of the import, even though there are good ones, or add a plausibility score, e.g. 84%, to the mass of them instead? Or rather wait some years until the community sorted all failures out?
But we don’t have the ability to do that, do we?
Probably the best option, sadly.
It’s not a usable measure, from what I saw in my country. At 0.95, there are still clearly inappropriate ones. At 0.99, the vast majority of valid features are excluded, and a remaining few have inaccurate positions yet. I like to think of the database as for Facebook page geographical relevance.
Speaking as a decades-long software and data quality engineer at companies like Apple, Adobe and more, as well as a contributor to OSM for ~15 years using structured, crowdsourced methodologies that are well-documented (e.g. USBRS, USA Railroads) where I estimate (in wiki sometimes and in passing conversation rarely) that those have 99+% completeness-and-accuracy and ~75% completeness-and-TIGER-Review-and-accuracy (respectively), I would NEVER deign to add a “quality index property” DIRECTLY to OSM data. Such a value would quickly become obsolete; who would update it? How?
I would revert it as undiscussed import (for start, license of data needs to be checked before import)
This is reliable indicator of deliberate vandalism or very confused user messing up data despite good intentions. In either case import done by such people would be ideally reverted.
I now did delete close to 600 POIs after again pulling random samples and finding bad data in droves.
No use of adding something like “Fixme: bad quality import” to highlight them.
Unfortunately lots of the added stuff has already received quality assurance (e.g. non-canonical tags fixed in batches). Perhaps a time based overpass can find them?
There were 312 changesets by this user:
changesets=> select count(*) from osm_changeset where user_id = '18497449'; count
Here is an overpass query for things last edited by them. There are 250-odd items there - do they look OK? An example is here. That has a changeset tag of
source=knowledge, which is never a good sign
The DWG had a report about this account, but it had already been deleted by the time we dealt with it. Given what you’ve said (and other information on the ticket), I’d suggest undoing every edit they made. If you’d like help with that let us know.
Edit: To email the DWG about this ticket (not one that I was dealing with), mail firstname.lastname@example.org with a subject line of “[Ticket#2023071410000131] Quality index property”.
I used this overpass turbo to look at pre-QA-state. Loaded in JOSM updating data (ctrl-u) leaves a 1000 POIs that still in the data.
Talk in AT forum in progress. Somebody there also interested in the issue.
@woodpeck’s complex_revert.pl script got an option
--override recently to overwrite newer versions if a conflict occurs. If most subsequent edits are such humanized bot edits (adding stuff from Name Suggestion Index without verification on the ground) and you feel brave enough to override few real on the ground surveys, you can use his tool.
If you prefer a safer way, you could write your own reverting software based on Machina Reparanda (note: I did it only once to revert tagging edits).
I had a go and removed part of the stuff. But it does not feel right. There is actually good data too. It is not about vandalism, just somebody with a very old dataset. I’d give it a correctness of say 80%.
Sadly, I have no means to separate the wheat from the chaff, as the entries spread over all of the province where I have no local knowledge. So I can only kill it all or nothing. Therefore I stop now. Eventually Street Complete users will fix it. From what I seen on google street view, they might even have to ring door bells and ask.
What is license of this dataset?
Personally, I’d say that “80% correct” was significantly below the threshold for accepting an import into OSM.
So well, I removed now another 450 under the rationale that the import is missing a license. The user cannot be contacted, so license cannot be provided. I left in place what in the meantime has been vetted by locals. A small percentage, at this pace it will take many years to vet it all. Number above was wrong, query did count too much. Thank you all for your patience.