Improving OpenStreetMap shop coverage with AllThePlaces

Mateusz_Konieczny · October 11, 2024, 5:48pm

My current plan is to run a server with API allowing to mark entry as bogus or added. Similar to what Osmose have. I admit that I am not entirely sure how to do this, yet.

And then get reports - for example if some spider is more often marked as bogus than used, then it is fishy.

(and also to look at StreetComplete edits with attempt to spot and stop people adding suggested POIs without verification, for example if they seem to not adjust POI locations or get changeset comments)

JesseFTW · October 11, 2024, 10:04pm

Worked for me now. Probably something transient from codeberg.

Robert_Whittaker · October 12, 2024, 1:10pm

For the UK, you might be better using my Chain Reaction tool. This doesn’t include everything from AllThePlaces (some have too poor data quality or are otherwise unsuitable, and others are still to be added) but I’ve done some additional checks and processing to make sure the matching is working well for what’s there. It also includes some store data that’s not from ATP. There’s no overall map, but the discrepancies between the data and OSM are shown (along with other things) in Survey Me.

(The use-case I was trying to cater for with Chain Reaction is to have locations of places that would benefit from a ground survey. I want to avoid false positives where-ever possible to avoid mappers going to out of their way for un-necessary checks. I therefore do quite a bit of pre-precessing and checking of the ATP data before adding it to the tool. Some of the datasets are much better than others.)

Mateusz_Konieczny · October 12, 2024, 3:12pm

Do you maybe have list of bad/rejected spiders? Maybe I should skip more (I already drop some)

Robert_Whittaker · October 12, 2024, 4:13pm

Not as such, no, but it would be good to have something. Most of the ones I can’t work with are just because the spider is completely broken and not returning any results.

There are a number of brands with caveats / warnings - these have red boxes on the chain’s page, e.g. https://osm.mathmos.net/chains/Q5204927/ . Mostly these are the greyed out ones in the list at https://osm.mathmos.net/chains/ which means they’re not feeding through to Survey Me.

I’ve also got a block of code that does special tweaks to individual datasets before they’re imported. These are mostly to fix minor problems with the spiders in ATP, which I haven’t got round to feeding back there yet. I want to look for a way to open that up and work through the issues.

How important these issues are depend on what you want to do with the data. (You may not care how the store names or addresses are formatted, or you might not be flagging unmatched stores in OSM so don’t mind if some of these are missing.)

SimonPoole · October 13, 2024, 10:59am

The question why (yes I know it is the StreetComplete way) you find it necessary to reinvent the wheel, when you could just provide the data to OSMOSE and have false positive handling etc already done, a public API for everybody instead of only SC, and you could simply add OSMOSE tasks as a source of SC challenges for the fanbois.

SomeoneElse · October 13, 2024, 11:50am

The “challenge” with Osmose is that a lot of what it flags aren’t problems at all, and OSM already has lots of problems with people “fixing” those. Osmose is very clear about the level of QA information it is providing - it already says on its front screen “In no case Osmose-QA should provide you the absolute right way to map, always keep a critical eye.”

However, in something like StreetComplete “tagfiddling from afar” really shouldn’t be an issue, so hopefully using Osmose there will be less of an issue than using Osmose almost anywhere else.

Mateusz_Konieczny · October 13, 2024, 12:08pm

For start, people running Osmose rejected adding this kind of dataset.
I asked in How process for adding suggestion from external dataset looks like? · Issue #2293 · osm-fr/osmose-backend · GitHub about what kind of datasets are welcome and they clearly rejected it.

Also, there are some negative consequences of having it visible on Osmose.

BTW, I am curious to what “yes I know it is the StreetComplete way” is referencing here. I am not really aware of reinventing wheel where it was not needed. And I am definitely not a fan of starting yet another project doing the same and then abandoning it.

If anyone is aware where it can be put - please let me know! I really would prefer to not reimplement existing thing. And I would really prefer to avoid sysadmining one more server. I neither like nor enjoy it.

cbeddow · October 13, 2024, 12:34pm

I think where it could be put is dependent on what the best way to add it to OSM is, and that depends on if it can be confirmed any individual place should be added.

In general it is really difficult to add a POI (business or operating venue) to OSM without having direct first hand knowledge (like StreetComplete asks for). Doing it via JOSM, Rapid, iD editor is not ideal unless you just happen to know the POIs well from personal experience, it’s not just easily confirmed by comparing to a satellite image. I would guess this is the concern of Osmose because it is the obvious problem I also see.

If it did make sense to make it an overlay for a desktop editing environment (and again I am not sure it does) it could make sense as a layer in the Rapid catalog, where various datasets live, often reformatted by Esri to have OSM schema so you can verify and add to the map (currently buildings, addresses, some roads and sidewalks). So Esri open data/community maps could host that and deploy a URL (consumable via Rapid or anywhere).

Osmose itself is actually a similar layer, along with the other Austrian hosted one about turns, I forget the name. These are all some ephemeral layer that users can take action on and overlay in an editor. Individual and knowledgeable user action is required for each item.

I think we are arriving at the fundamental issue: ATP as a data provider has scraped data (as @SimonPoole says it will always be at the mercy of the scraping, definitely true). ATP is not doing robust “signal” gathering to see if the data is accurate or providing strong evidence, although existence of a website with the address is helpful. In the end, ATP leaves it to an end user to risk using the data without quality guarantee. It’s the exact criticism of Overture POIs, or various other external datasets.

You’d have 1000x more ease using ATP if ATP was not just an aggregator, but was curating and carefully vetting the dataset, the way that perhaps a local government agency is doing with defibrillators or perhaps even a bike share company with docks, and so on. ATP’s quality will vary based on every individual upstream source and while they do have their own attempt to not ingest junk, it’s not going to produce the results that a POI company with paid employees (Safegraph, Dataplor, etc) or an organized local OSM community will achieve, where someone does the hard part which is vetting it, rather than getting a raw dataset. Again, exactly what people might criticize Overture POIs for, because it leaves the OSM users needing to do work to find the gems and fix the errors. I also agree with @SimonPoole for ATP, and I say for Overture too, that the best value would be map tiles with almost no existing POIs, but only if there is an OSM user present with strong local knowledge to verify things.

SomeoneElse · October 13, 2024, 1:23pm

(just to chuck another potential issue in here) ATP data isn’t necessarily “correct” or “wrong” - it can be a mixture. For example, here there’s a misplaced entry for this pub. The actual pub is in a different location here in OSM, and some of the ATP data is more up to date than OSM, even though the ATP location is far enough away that, on the ground, you can’t see the one from the other (obviously a challenge for SC).

I have no idea whether ATP can store information such as "this item has been definitively matched with XYZ object in OSM, and “A, B and C data from ATP are correct, but D, E and F aren’t”. I’d suggest that within ATP would be the right places as “metadata about 3rd-party datasets” probably doesn’t belong in OSM.

osmuser63783 · October 13, 2024, 6:39pm

Just to add another idea into the mix, you could look into putting it on Maproulette. It has a flag for challenges that require local knowledge, which I think hides them from armchair mappers. But I’m not sure if there are any mature solutions yet for solving challenges directly from within mobile apps. I’ve found a master ticket where they track integration progress here in case you want to have a look yourself.

Mateusz_Konieczny · October 13, 2024, 6:49pm

Not planned right now.

see https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset/issues/8

currently I am stuck on Tasks created with API cannot be rebuild (or at least error message fails to mention how) · Issue #2392 · maproulette/maproulette3 · GitHub so I am not planning to work on any MR integrations until I unbreak existing one

And anyway GET /projects returns ERROR: levenshtein argument exceeds maximum length of 255 characters · Issue #1153 · maproulette/maproulette-backend · GitHub bug blocks me from handling MR projects automatically

(though not sure how much can be done with it, but technical issues block it anyway for now)

SimonPoole · October 14, 2024, 6:14am

One if the most annoying failure modes I’ve seen is a soap shop (yes really) that has 3 actual shops (all in OSM) and then roughly 100 other locations (aka other companies shops) in which they sell their products … net result OSM is 100 locations short according to ATP.

I could argue ATP is not even wrong in this case.

Mateusz_Konieczny · October 14, 2024, 5:32pm

Do you remember its name? I tracked down similar case with ice cream shop and shoe shop, but I do not remember soap shop doing this.

eisa01 · October 21, 2024, 5:37pm

Would be quite nifty if we could create a standardized import/sync process for selected ATP spiders that have a “gold standard” approval from the local communities? Would be great to keep opening_hours up to date.

I’m sure there must be some of those

Mateusz_Konieczny · October 21, 2024, 6:41pm

this one sadly requires far more special processing, ATP data cannot be imported as is - alltheplaces/DATA_FORMAT.md at master · alltheplaces/alltheplaces · GitHub mentions some problems

I am currently preparing next thread, similar to this one that will discuss opening hours specifically - I want to confirm is it potentially viable before spending $BIGNUM hours on this.

Robert_Whittaker · October 25, 2024, 10:13am

Hi @Mateusz_Konieczny nice work here!

I’ve just had a look at your matching table for my local area: https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/52%201_index.html and noticed a few issues where it’s failing to match an ATP location to a store that’s mapped in OSM. You might want to have a look at the following:

Travis Perkins: https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/missing_shops_travis_perkins_gb_52%201.html fails to match to Way: ‪Travis Perkins‬ (‪547315153‬) | OpenStreetMap
Dunelm: https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/missing_shops_dunelm_gb_52%201.html fails to match to Way: ‪Dunelm‬ (‪292241201‬) | OpenStreetMap
Five Guys: https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/missing_shops_five_guys_gb_52%201.html fails to match to Way: ‪Five Guys‬ (‪144745480‬) | OpenStreetMap

For the last one (Five Guys) it’s possibly the distance that’s the issue. But for the other two I wouldn’t have thought that’s the problem. FWIW, in my Chains tool I consider precisely the OSM objects with the right brand:wikidata=* value for the matching, and then match based on distance within a threshold or a larger threshold if there’s also a matching (UK) postcode. For this to work well it requires some manual checking and adding of missing brand:wikidata=* tags - but the payoff is that you have much better control over the matching.

In your table at https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/52%201_index.html I think it would be good to also have columns for the total number of branches of each chain in ATP, and the number that you’ve successfully matched.

I’m confused by what the “% of ATP matched to OSM” is supposed to be. e.g. holland_and_barrett says it’s 10%, but there are no “missing in OSM” entries. If only 10% are matched, I’d expect the other 90% to show up in the “missing in OSM” column.

I think the maps would be more useful if they also showed matches as well as the missing stores. (Sometimes a match fails because there’s a duplicate in ATP, and this would allow you to spot this.)

As well as stores missing from OSM, the other data quality issue is stores on OSM that shouldn’t be there. Do you also check for OSM objects of each store type that haven’t been matched to an ATP record?

ivanbranco · October 25, 2024, 10:51am

Related post here:

Mateusz_Konieczny · October 25, 2024, 2:43pm

Thanks!

Mismatch may be on distance and is definitely on

shop=builders_merchant vs shop = doityourself

should we document shop=builders_merchant | Tags | OpenStreetMap Taginfo or retag shop=builders_merchant to some of existing tags?
should I consider shop=builders_merchant shop = doityourself as matching? If yes, what else may be worth matching?

shop=interior_decoration vs shop = houseware

is ATP badly classifying Dunelm? Should it be shop = houseware in ATP? Or maybe OSM is classifying it badly?
shop=interior_decoration and shop = houseware should be considered as matching… What else belongs in this group?

Yes, distance alone will consider it as separate and not matching.

Some of problems here are that the same brand:wikidata may be for both say supermarket and fuel station (Tesco).

Or the same for fuel station, convenience shop and parcel locker (Orlen).

Also, I tried to use wikidata codes once - you ebd in a rabbit hole of some brands having separate wikidata entries for subbrands, some not. Sometimes brand:wikidata is linking dead company entry sometimes dedicated brand entry. And often multiple at once.

More importantly I have not yet found case where brand:wikidata matching would improve matching.

For UK using postcode is likely helpful but it does not work well in global coverage where postcodes are tagged fairly rare or in various formats.

And for global processing adding missing brand:wikidata as part of making this tool is juts not feasible at all.

(some prices paid for having global tool - dedicated localized one are likely to remain better)

I have major TODO for matching: consider also website tag

I will try to reduce this confusion, I am not yet sure how.

But likely it will include customized distance thresholds for when POI is considered as gone (for some spiders it should be 15m, for some 1500m).

Next regeneration of data will be published with lower threshold for distance when object is considered as missing. If this goes well I will reduce it further, maybe eliminating gray area completely.

Or maybe it should be listed as a separate category?

I just drafted something that should report it as a separate category, lets see how well it will work.

for now it is kind of blocked by misusing several tools in my tech stack

one of funniest part is that commit of multitude of generated static files for publication takes multiple hours

I will try gray area listing of away, but not very far away, if that will not explode processing I will try listing also successes.

In such case both will match the same object without spotting problem.

I considered doing this but ATP data in general is not good enough for that on a global scale. Too many shop on the wrong continent or misplaced by several kilometres.

It is still in vague plans but I am not planning to do any time soon. Maybe for few spiders with known excellent quality?

hfs · October 26, 2024, 4:17pm

This would allow to compute an interesting new metric: The average (or 90th percentile?) distance between ATP and OSM coordinates. This would give you some measure of the location accuracy per brand. And that in turn might be used to decide how much to trust ATP locations when proposing to add them.