My current plan is to run a server with API allowing to mark entry as bogus or added. Similar to what Osmose have. I admit that I am not entirely sure how to do this, yet.
And then get reports - for example if some spider is more often marked as bogus than used, then it is fishy.
(and also to look at StreetComplete edits with attempt to spot and stop people adding suggested POIs without verification, for example if they seem to not adjust POI locations or get changeset comments)
For the UK, you might be better using my Chain Reaction tool. This doesnāt include everything from AllThePlaces (some have too poor data quality or are otherwise unsuitable, and others are still to be added) but Iāve done some additional checks and processing to make sure the matching is working well for whatās there. It also includes some store data thatās not from ATP. Thereās no overall map, but the discrepancies between the data and OSM are shown (along with other things) in Survey Me.
(The use-case I was trying to cater for with Chain Reaction is to have locations of places that would benefit from a ground survey. I want to avoid false positives where-ever possible to avoid mappers going to out of their way for un-necessary checks. I therefore do quite a bit of pre-precessing and checking of the ATP data before adding it to the tool. Some of the datasets are much better than others.)
Not as such, no, but it would be good to have something. Most of the ones I canāt work with are just because the spider is completely broken and not returning any results.
There are a number of brands with caveats / warnings - these have red boxes on the chainās page, e.g. https://osm.mathmos.net/chains/Q5204927/ . Mostly these are the greyed out ones in the list at https://osm.mathmos.net/chains/ which means theyāre not feeding through to Survey Me.
Iāve also got a block of code that does special tweaks to individual datasets before theyāre imported. These are mostly to fix minor problems with the spiders in ATP, which I havenāt got round to feeding back there yet. I want to look for a way to open that up and work through the issues.
How important these issues are depend on what you want to do with the data. (You may not care how the store names or addresses are formatted, or you might not be flagging unmatched stores in OSM so donāt mind if some of these are missing.)
The question why (yes I know it is the StreetComplete way) you find it necessary to reinvent the wheel, when you could just provide the data to OSMOSE and have false positive handling etc already done, a public API for everybody instead of only SC, and you could simply add OSMOSE tasks as a source of SC challenges for the fanbois.
The āchallengeā with Osmose is that a lot of what it flags arenāt problems at all, and OSM already has lots of problems with people āfixingā those. Osmose is very clear about the level of QA information it is providing - it already says on its front screen āIn no case Osmose-QA should provide you the absolute right way to map, always keep a critical eye.ā
However, in something like StreetComplete ātagfiddling from afarā really shouldnāt be an issue, so hopefully using Osmose there will be less of an issue than using Osmose almost anywhere else.
Also, there are some negative consequences of having it visible on Osmose.
BTW, I am curious to what āyes I know it is the StreetComplete wayā is referencing here. I am not really aware of reinventing wheel where it was not needed. And I am definitely not a fan of starting yet another project doing the same and then abandoning it.
If anyone is aware where it can be put - please let me know! I really would prefer to not reimplement existing thing. And I would really prefer to avoid sysadmining one more server. I neither like nor enjoy it.
I think where it could be put is dependent on what the best way to add it to OSM is, and that depends on if it can be confirmed any individual place should be added.
In general it is really difficult to add a POI (business or operating venue) to OSM without having direct first hand knowledge (like StreetComplete asks for). Doing it via JOSM, Rapid, iD editor is not ideal unless you just happen to know the POIs well from personal experience, itās not just easily confirmed by comparing to a satellite image. I would guess this is the concern of Osmose because it is the obvious problem I also see.
If it did make sense to make it an overlay for a desktop editing environment (and again I am not sure it does) it could make sense as a layer in the Rapid catalog, where various datasets live, often reformatted by Esri to have OSM schema so you can verify and add to the map (currently buildings, addresses, some roads and sidewalks). So Esri open data/community maps could host that and deploy a URL (consumable via Rapid or anywhere).
Osmose itself is actually a similar layer, along with the other Austrian hosted one about turns, I forget the name. These are all some ephemeral layer that users can take action on and overlay in an editor. Individual and knowledgeable user action is required for each item.
I think we are arriving at the fundamental issue: ATP as a data provider has scraped data (as @SimonPoole says it will always be at the mercy of the scraping, definitely true). ATP is not doing robust āsignalā gathering to see if the data is accurate or providing strong evidence, although existence of a website with the address is helpful. In the end, ATP leaves it to an end user to risk using the data without quality guarantee. Itās the exact criticism of Overture POIs, or various other external datasets.
Youād have 1000x more ease using ATP if ATP was not just an aggregator, but was curating and carefully vetting the dataset, the way that perhaps a local government agency is doing with defibrillators or perhaps even a bike share company with docks, and so on. ATPās quality will vary based on every individual upstream source and while they do have their own attempt to not ingest junk, itās not going to produce the results that a POI company with paid employees (Safegraph, Dataplor, etc) or an organized local OSM community will achieve, where someone does the hard part which is vetting it, rather than getting a raw dataset. Again, exactly what people might criticize Overture POIs for, because it leaves the OSM users needing to do work to find the gems and fix the errors. I also agree with @SimonPoole for ATP, and I say for Overture too, that the best value would be map tiles with almost no existing POIs, but only if there is an OSM user present with strong local knowledge to verify things.
(just to chuck another potential issue in here) ATP data isnāt necessarily ācorrectā or āwrongā - it can be a mixture. For example, here thereās a misplaced entry for this pub. The actual pub is in a different location here in OSM, and some of the ATP data is more up to date than OSM, even though the ATP location is far enough away that, on the ground, you canāt see the one from the other (obviously a challenge for SC).
I have no idea whether ATP can store information such as "this item has been definitively matched with XYZ object in OSM, and āA, B and C data from ATP are correct, but D, E and F arenātā. Iād suggest that within ATP would be the right places as āmetadata about 3rd-party datasetsā probably doesnāt belong in OSM.
Just to add another idea into the mix, you could look into putting it on Maproulette. It has a flag for challenges that require local knowledge, which I think hides them from armchair mappers. But Iām not sure if there are any mature solutions yet for solving challenges directly from within mobile apps. Iāve found a master ticket where they track integration progress here in case you want to have a look yourself.
One if the most annoying failure modes Iāve seen is a soap shop (yes really) that has 3 actual shops (all in OSM) and then roughly 100 other locations (aka other companies shops) in which they sell their products ā¦ net result OSM is 100 locations short according to ATP.
Would be quite nifty if we could create a standardized import/sync process for selected ATP spiders that have a āgold standardā approval from the local communities? Would be great to keep opening_hours up to date.
I am currently preparing next thread, similar to this one that will discuss opening hours specifically - I want to confirm is it potentially viable before spending $BIGNUM hours on this.
For the last one (Five Guys) itās possibly the distance thatās the issue. But for the other two I wouldnāt have thought thatās the problem. FWIW, in my Chains tool I consider precisely the OSM objects with the right brand:wikidata=* value for the matching, and then match based on distance within a threshold or a larger threshold if thereās also a matching (UK) postcode. For this to work well it requires some manual checking and adding of missing brand:wikidata=* tags - but the payoff is that you have much better control over the matching.
Iām confused by what the ā% of ATP matched to OSMā is supposed to be. e.g. holland_and_barrett says itās 10%, but there are no āmissing in OSMā entries. If only 10% are matched, Iād expect the other 90% to show up in the āmissing in OSMā column.
I think the maps would be more useful if they also showed matches as well as the missing stores. (Sometimes a match fails because thereās a duplicate in ATP, and this would allow you to spot this.)
As well as stores missing from OSM, the other data quality issue is stores on OSM that shouldnāt be there. Do you also check for OSM objects of each store type that havenāt been matched to an ATP record?
should I consider shop=builders_merchantshop = doityourself as matching? If yes, what else may be worth matching?
shop=interior_decoration vs shop = houseware
is ATP badly classifying Dunelm? Should it be shop = houseware in ATP? Or maybe OSM is classifying it badly?
shop=interior_decoration and shop = houseware should be considered as matchingā¦ What else belongs in this group?
Yes, distance alone will consider it as separate and not matching.
Some of problems here are that the same brand:wikidata may be for both say supermarket and fuel station (Tesco).
Or the same for fuel station, convenience shop and parcel locker (Orlen).
Also, I tried to use wikidata codes once - you ebd in a rabbit hole of some brands having separate wikidata entries for subbrands, some not. Sometimes brand:wikidata is linking dead company entry sometimes dedicated brand entry. And often multiple at once.
More importantly I have not yet found case where brand:wikidata matching would improve matching.
For UK using postcode is likely helpful but it does not work well in global coverage where postcodes are tagged fairly rare or in various formats.
And for global processing adding missing brand:wikidata as part of making this tool is juts not feasible at all.
(some prices paid for having global tool - dedicated localized one are likely to remain better)
I have major TODO for matching: consider also website tag
I will try to reduce this confusion, I am not yet sure how.
But likely it will include customized distance thresholds for when POI is considered as gone (for some spiders it should be 15m, for some 1500m).
Next regeneration of data will be published with lower threshold for distance when object is considered as missing. If this goes well I will reduce it further, maybe eliminating gray area completely.
Or maybe it should be listed as a separate category?
I just drafted something that should report it as a separate category, lets see how well it will work.
for now it is kind of blocked by misusing several tools in my tech stack
one of funniest part is that commit of multitude of generated static files for publication takes multiple hours
I will try gray area listing of away, but not very far away, if that will not explode processing I will try listing also successes.
In such case both will match the same object without spotting problem.
I considered doing this but ATP data in general is not good enough for that on a global scale. Too many shop on the wrong continent or misplaced by several kilometres.
It is still in vague plans but I am not planning to do any time soon. Maybe for few spiders with known excellent quality?
This would allow to compute an interesting new metric: The average (or 90th percentile?) distance between ATP and OSM coordinates. This would give you some measure of the location accuracy per brand. And that in turn might be used to decide how much to trust ATP locations when proposing to add them.