Improving OpenStreetMap shop coverage with AllThePlaces

Robert_Whittaker · October 27, 2024, 5:46pm

In my UK tool, I’ve only come across a few brands that I need to treat specially, e.g. to ignore certain types of objects that I’m not interested in. (My tool also treats multiple OSM objects that have the same website as being a single branch for the matching. So that allows you to not see e.g. a petrol station attached to a Tesco store as an unmatched OSM object.) There are a few ATP sets where I need to do some processing to change some of the brand:wikidata values to more specific sub-brands, but that’s not too bad.

There are a lots of cases where the branches in OSM do not have consistent amenity=* or shop=* tags and names=* values though - so I’m a bit surprised that matching on them is working so well for you. Perhaps if you had a single big table of sets of “similar” shop/amenity values then that would work with the way you’re doing it. I still think matching with the brand:wikidata values (possibly in addition to other criteria) is the best way to go. If you’re not worried about unmatched OSM objects, then I’d have though you could simply consider anything in OSM with a matching brand:wikidata in addition to the shop/amenity tags you’re already searching on to improve things. (Surely that would allow you to pick up some of the examples I gave of your tool missing matches to OSM objects.)

In terms of possible shop/amenity values for each ATP set, you could try looking at the tagging of existing OSM objects with the right brand:wikidata values. Anything that appears more than once or twice would probably be a good candidate to match on. See the different values for Dunelm at https://osm.mathmos.net/chains/Q5315020/

What I do with the postcodes is use them as a secondary signal for the matching. If there’s a matching postcode then I allow a match over a greater distance (currently infinite). This only gets me into trouble very occasionally if there’s more than one branch of a chain with the same postcode. You could do something similar with website=* matches. It might be dangerous to allow matches on the website over too large a distance though, since someone might have added the wrong website to OSM. At least with UK postcodes, there’s open data on their locations, and I have another tool that flags any addr:postcode=* values that seem to be mis-located.

Mateusz_Konieczny · October 27, 2024, 6:37pm

I will definitely look into suggestions - I plan to experiment further with possible matchers!

For things that can be quickly answered…

note that I am also matching by name part - name=Foobar pizza would match with say name=Foobar pizzeria

cbeddow · October 27, 2024, 9:22pm

Doesn’t Daylight distribution have a schema that could be used here, where POIs are grouped into subcategories? I believe it’s “subclass” in that table, this could be nicely reused to make groups on the OSM tags like tourism, food, etc

CjMalone · October 27, 2024, 9:24pm

NSI/iD has

Matija_Nalis · January 22, 2025, 12:26am

On a related node, SCEE (StreetComplete “Expert edition” fork) already includes OSMOSE support.

But as noted elsewhere, including ATP/OSM mismatches in OSMOSE is likely more problematic then useful.

You’d also likely need 1000x times more manpower. Perhaps ATP could allow (logged-in - OSM Oauth2? GitHub?) users to rate the quality of individual scrapers (i.e. good / usable / bad) and display that to potential users. Maybe someone should suggest something like that.

cbeddow · January 22, 2025, 12:35pm

Not necessarily more, it can be also done with extensive but expensive tooling. POI companies have this, AI helps. It doesn’t have to be correct, just has to trim down to the highest confidence stuff. Going being to reduce the hours needed.

But effort does have to go in somewhere. My point is that it’s raw data, therefore has the same problems as machine learning derived buildings form satellite imagery and so on, except that POIs also go outdated faster than buildings or roads.

Mateusz_Konieczny · January 22, 2025, 6:55pm

Another wrinkle/issue that I was thinking about recently.

lets take for example alltheplaces/locations/spiders/zahir_kebab_pl.py at 55799b3ebfd720f0ad297b191f430e841fe24291 · alltheplaces/alltheplaces · GitHub

it pulls lat/lon position of where map marker is displayed on their POI website

is it problem for us that this marker info is there to be shown on Googlr Maps?

see Bydgoszcz Skłodowskiej – Zahir Kebab

I assume that it is not a problem as we are taking their lat/lon info, not copying Google Maps. And anyway just because I overlay location marker on OSM background (or set of them) it does not make my data becoming ODBL licensed?

Though is that some kind of a blocker? Or potentially worth asking LWG about? I guess that it is in theory possible that they copied opening hours and locations info of own shops from Google Maps?

Matija_Nalis · January 22, 2025, 11:47pm

Correct. You are getting data directly from first party, not some 3rd party like google maps ,thus LWG interpretation allows you to operate on that data (including adding it to OSM).

Correct. Those extra marker locations are part of “Collective Database” (as set in Open Data Commons Open Database License (ODbL) v1.0 — Open Data Commons: legal tools for open data section 1.0 Definitions of Capitalised Words) and is explicitly not considered part of “Derivative Database”, and thus does not need to be republished under ODbL (unless you also publish your geospatial database containing both them and ODbL parts of Collective Database [i.e. OSM data], in which case 4.2 applies).

But in your case, those separate layer of markers is not “tainted” by ODbL of other layer (i.e. OSM basemap).

That being said, just my understanding, IANAL.

Robert_Whittaker · January 23, 2025, 8:54am

That’s fine if the first party has worked out the location themselves. But what if ATP is taking the location from an embedded Google Maps widget, and that widget worked out the lat/lon by geocoding a postal address that the first party supplied?

Mateusz_Konieczny · January 24, 2025, 7:23pm

I will do further checks, but cases I have seen is where site has lat lon` and instructs underlying map (Google Maps, OpenStreetMap etc) to show marker at that specific location.

TheUKHighStreet · January 24, 2025, 11:15pm

I have found a few storefinders (example Sweaty Betty) where there is no latitude or longitude and a map is produced on their site by feeding the address to google maps - but in that case the ATP spider wont produce a geo location, just an address, so I don’t think this is an issue.

CjMalone · January 24, 2025, 11:46pm

Website:

"https://www.google.com/maps/@{},{},19z".format(lat, lon)

Us:
“Who owns these coordinates? Is it Google? We actually don’t use them for mapping, some people use them for matching. But who owns these coords? Lets spend the next decade deciding, FYI IANAL”

Mateusz_Konieczny · June 12, 2025, 7:54am

my AllThePlaces pull request for StreetComplete is still 10% TODO comments by volume, but it works!

In Changeset: 167519295 | OpenStreetMap first POI got added based on ATP hints.

pictured: Odido convenience shop that was missing from OSM but was listed in ATP. It is no longer missing in OSM