Improving OpenStreetMap shop coverage with AllThePlaces

Mateusz_Konieczny · October 10, 2024, 4:13pm

One of things that can be improved in OpenStreetMap is POI coverage. AllThePlaces project can be helpful here.

This dataset seems very promising and with potential to help in mapping. But I want to take care to avoid damage. Use of external datasets has a long story of causing problems, despite best intentions.

tl;dr

You can go to https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/

Then click on area of interest for you

Then review data

Then complain, best in way described there while being specific what kind of data is horrifically broken. This will be very helpful and will result in silencing specific spider or fixing data. If you see no horrifically bogus data, you should probably look a few more minutes for it.

(if you are confused what is “spider” here or want to learn more, please continue reading)

ATP

ATP has more than three millions points of interest, many of them missing from OSM. This dataset is also in an active development, with problem data being fixed and more scanning spiders being added over time.

ATP can be used to make POI adding more efficient, detect missing shops, add missing properties (for example website tags) to existing objects and so on.

I already tested in my local area and on some trips and it allowed me to spot some missing shops and was helpful also in other ways.

ATP is a project collecting POI data from various sources and we can potentially use large part of that data for OSM mapping. The collection process captured about three millions POIs so far, and more data-gathering spiders are being added.

I made some contributions to ATP. They are really great at processing PRs quickly. Also, due to structure of project it is much less complex to contribute than to many other large projects. Even large part of issues that I reported was fixed!

Spiders

ATP projest can be described as having

single shared organizational layer
thousands of small spiders, each of them obtaining data from a specific source (for example, by parsing list of shops published by a given brand). Important part is that this sources are distinct, coming from completely different websites.

Data quality

Data comes from different sources. As result quality varies wildly:

paczkomat_inpost_pl and allegro_one_box_pl and orlen_paczka_pl have superb accuracy - they need it so people will not be confused in hunt for their parcel locker.

poczta_polska_pl has dubious accuracy, some claimed positions of post offices are in the middle of motorway

chorten_pl data quality is so bad is that I disabled this spider completely, as more annoying than helpful

You cannot really guess that without looking in detail at specific ATP data, related OSM data and preferably also local survey.

As another example - initially I was quite happy that I found Orlen spider listing many fuel stations not yet mapped in OSM. On deeper investigation it turned out that ATP was listing amenity=fuel brand=Orlen where Orlen had just hot dog stand or vending machines, for example at bus stations. After I fixed this all such missing fuel stations disappeared. Turns out that OpenStreetMap had all of them already. If I would be just a bit less careful I would add bunch of fake petrol stations.

And similar problems were present in other spiders, some of them are now fixed and some remain broken.

So you need to be careful when using this data. Some types of bad data can be discarded with smarter preprocessing (say phone = +undefinedundefinedundefined) but some are much more trickier.

Note that while data is published with tagging schema mirroring OpenStreetMap it is not an exact match - for example if someone wants to use opening_hours data it is in format with some subtle incompabilities are present.

And name keys in ATP very very often does not match at all what should be in name keys (cleanup is going on, but far from completion)

Using this data into OSM

As it often happens, imports may run into trouble as quality expectations are higher in OSM than elsewhere.

It is common that accuracy of POI locations in OpenStreetMap is higher than listing of objects by their owners. I run into it with bicycle parking data I obtained few years ago - it was unusable for import, but pointed me to locations worth surveying. And for many spiders you have various problems as mentioned in previous section.

But I think there is significant potential here

if OSM and ATP data is compared we can get list of places where OSM is likely missing POI - in general it cannot be just imported but may be treated like an anonymous note about missing shop.
- So it may be worth walking there and checking is POI actually there
- It could be helpful for mappers to show this as suggestion in mapping tools. And allow to quickly apply it, with single click or two - after user checked where the POI is actually located (and is it there at all!)
website tag look importable for cases where ATP-OSM objects clearly match
- link specific to a given POI can be useful both for data users and for mappers
- website property is not impacted by various subtle issues plaguing opening_hours and phone tags and other properties that could be imported from ATP
- though it still requires some advanced filtering to ensure high-quality matching
in some (very limited!) cases it may be possible to add POIs without local survey
- In some cases shop may have own website, with link present as part of object. Such website may include photo allowing to exactly locate where it should be mapped
- some types of objects (some parcel lockers, fuel stations) are possible to be verified as existing at given position with aerial imagery
- some spiders may have data so accurate to make outright imports possible (paczkomat_inpost_pl for example is promising)

Various communities have different approaches for use of such resources. Especially for use with limited human review I expect that various communities may have wildly different approaches. See for example differing acceptance of adding extra tags to match with external databases. It is considered as invalid with rare exceptions in some communities, having some acceptance in other. The same goes to question how rare errors need to be for edit to be acceptable. Maybe less errors than typical human mapping should be required? Less errors than expert mapper would make? Some other criteria?

So if you are planning to start importing data, please read and follow Import/Guidelines - OpenStreetMap Wiki

Note: I am working toward all of this listed ideas. I started work on it with intention of using such info in StreetComplete. I thought about adding website tags in Poland and USA - and if things go great then maybe suggesting it also elsewhere, I already tried mapping fuel stations remotely and I want to try again, maybe it will go better than with Orlen.

See actual data

You can see result of experimental processing at https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/ that processes and compares ATP and OSM data, throws away known dubious or incorrect parts and lists what seems likely to be useful for OSM mapper.

Click on area of interest for you to get listing.

I want to complain about something

Note: data issues such as “specific POI is offset by 10m / 100m / 400m” are sadly normal and not worth mentioning. That is why this data is in general not importable.

But if you see things like “all locations are massively offset” rather than “one of POIs is massively offset” please report it.

Even better, if ATMs are reported as banks or hot dog stands as fuel stations or fake shops are listed as existing - please report it.

You can report issues directly at GitHub - alltheplaces/alltheplaces: A set of spiders and scrapers to extract location information from places that post their location on the internet. if you confirmed that it is ATP fault and it can be fixed by ATP ( https://www.alltheplaces.xyz/ has All The Places | Map with raw ATP data
).

But if unsure or confused, please report them on Issues - matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data - Codeberg.org (requires Codeberg account) or write PM to me - via forum or to my OSM profile.

Legal “fun”

Legal/license/copyright/copyright like issues are not forgotten but discussed in late section to not scare away normal people.

As I understand we can use in OSM mapping

data from opening hour signs and shop signposts
opening hours info posted on own shop website
bunch of opening hours info posted on chain shop website (for example listing all shops of a major brand)

See LWG minutes that discussed last case.

AllThePlaces is a project that collects (among other things) data such as “here is what chaing XYZ claims about their shops”.

This may be worrying too much - but I am not 1000% sure is this ATP decision covering also locations, they discussed specifically opening hours. (LWG was discussing it partially because I bothered project maintainers about legal status of ATP data so I suspect that I am worrying too much).

There is also another unfun part - ATP data collects not only first-party data that was discussed there. It republishes for example https://ourairports.com/ (which self-describes itself as public domain). But in principle ATP data could start to include another dataset but has data with other restrictions. See case of OpenAddresses.io that contains some data described by project itself as “open” but with restrictions that make impossible to use them for example in OpenStreetMap. See also Maybe mention in readme that ATP takes data from first part websites? · Issue #8790 · alltheplaces/alltheplaces · GitHub

So it seems to me that any use of ATP data in OSM would need to review new spiders and check is it first-party data or otherwise usable.

I do not feel comfortable with some, spiders see Maybe mention in readme that ATP takes data from first part websites? · Issue #8790 · alltheplaces/alltheplaces · GitHub and for example I disabled for now ‘cbrfree_au’, ‘james_retail_gb’, ‘queensland_government_road_safety_cameras_au’, ‘terrible_herbst’, ‘thales_fr’ as slightly suspect and pending further investigation.

Why I am posting this?

If some (or all) of plans listed here seem to be a harmful/bad/hopeless/doomed idea, especially if problem was not mentioned - feedback is welcome. I spend quite a lot of time on this project and it seems promising but maybe I missed some major problem that should be addressed. Or maybe review mentioned here is inadequate.

Review of content and data listed at https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/ is welcome. “some POIs in spider XYZ are offset from given location” is not reportable and is a common unfixable issue but reports about all other data issues are welcome (best to report as described there, but you can comment also here in this thread)

GitHub - alltheplaces/alltheplaces: A set of spiders and scrapers to extract location information from places that post their location on the internet. seems a nice project to contribute, especially as bottleneck is not at PR review and merging - see Pull requests · alltheplaces/alltheplaces · GitHub - this happens quite quickly (with more than 6 000 PRs merged so far) and nearly all the work is strongly modularized into specific spiders so it is much easier to change or add something. I had good experience there.

If you managed to read until here - congratulations! Please comment if you have any concerns/feedback or if I missed something worth considering.

Feedback about processed data I prepared is also highly welcome.

And from my experience - ATP project is also accepting pull requests and acting well on it.

Disclaimers

This post is not an official statement of any organization.

I got external funding (from European Union via NGI0 via nl.net) for development of software to compare OSM and ATP data and to create editing support in StreetComplete.

Note that I will be paid on implementing this, not on releasing this editing feature to general community. So conflict of interest is reduced here. As I will be paid also when this feature turns to be misused/not usable/too dangerous and in the end not included in StreetComplete - or removed due to some problems.

I explicitly excluded from grant payment time not spend on making this comparison software and StreetComplete quest. For example time spend on potential import (and writing this post) comes from my hobby time.

jimkats · October 10, 2024, 5:59pm

Sorry if this has already been addressed, but what is the license of the data ATP offers?
One of the businesses I see in my area, is boxnow (a Greek parcel service) and on the map it shows the locations of some of the lockers. The issue is that the business states in their site (3rd section “USE LICENSE”) that their data should be used only for private usage (which of course includes their map of the lockers locations).

I know the data you mention above are about the opening hours only, which means the ATP for boxnow aren’t usable anyway as they don’t have hours (24/7, without access restriction). But I’m just wondering about the license compatibility of ATP towards OSM if the data they have aren’t license compatible with the original sources (wherever this applies).

I haven’t followed the whole ATP discussions, which I know occurs a long time now, so sorry if this has already been discussed. Thank you for your work anyway regarding that.

maptheworld2050 · October 10, 2024, 5:59pm

That’s awesome! So the idea is that StreetComplete could have quests like “Is <business> located here?” or “Is <OSM node> the same as <ATP listing>?” that would import data from ATP into OSM if you answer “yes”?

That sounds really promising!

Mateusz_Konieczny · October 10, 2024, 6:33pm

more of “this is supposed to be somewhere here, please select exact location” with option to report that it is not here at all or mapped already.

Mateusz_Konieczny · October 10, 2024, 6:38pm

as I understand https://osmfoundation.org/wiki/Licensing_Working_Group/Minutes/2023-08-14#Ticket#2023081110000064_—_First_party_websites_as_sources (I am not a lawyer, this is not an official statement of any organization) this can be at least in part ignored. In the same way as supermarket putting “you are not allowed to record opening hours and prices of our products without written permission of the Director” can be ignored.

And various organisations, including government institutions have a long history of claiming that something cannot be used despite them not being allowed to do so.

On the other hand OSM is not a good place to experiment with vague and unclear parts of copyright so maybe I should skip them in my listing?

Mateusz_Konieczny · October 10, 2024, 6:42pm

LWG was commenting specifically on opening hours question and I am wondering should I try to get more specific advise on getting also location data and other tags at the same time.

I do not want to overinterpret

Copying the opening hours of a business from its own website is fine. There are no copyright rights in factual information like opening hours. There’s no investment in the database for a business for its own opening hours, because that is something that the business has to have for its purpose of operating. A business does not have additional investment in a database, so there are no database rights to protect.

From a legal risk perspective, we do not consider accepting this information to be a legal risk to OSMF and therefore DWG is not going to revert these edits.

but it seems to also apply a bit more generally and not OH only.

(still not a lawyer, still not an official comment of any organisation)

SomeoneElse · October 10, 2024, 7:02pm

That makes sense - in the UK, I can certainly think of websites on which some information has been sourced from sources that would be incompatible with OSM, but the narrow focus of the LWG statement would exclude those, and not be problematic - while I can think of examples where a business is publishing opening hours from a third-party (malls etc.) I’d struggle to engineer an example where that could be from an incompatible source.

(at the risk of stating the bleeding obvious, I’m not a lawyer either)

Mateusz_Konieczny · October 10, 2024, 7:17pm

in meaning of “you should reask LWG to confirm that this tool is fine” or as in “that looks fine to use location and other extracted tags”?

SomeoneElse · October 10, 2024, 7:34pm

Neither of those - what the LWG said, including what data it applied to, looks fine to me

alan_gr · October 11, 2024, 6:59am

Maybe this is a dumb question, but how do brands get into AllThePlaces to start with? To add a brand, do we have to write spiders ourselves? I realise this is nothing to do with your tool. But I got a bit lost in the ATP documentation, maybe a quick summary here would be useful.

I ask because so many brands seem to be missing, at least where I live (Spain). There appear to be no banks at all, for example. I can only see one of the major supermarket chains. Other major retailers like Zara, Mango, Primor don’t seem to be present. All of these are in the Name Suggestion Index.

Is this typical, or does Spain have worse coverage than other countries?

(There are a lot of Belgian public transport stops on the map of Spain, for some reason - I’ve raised that directly at ATP).

Mateusz_Konieczny · October 11, 2024, 7:15am

yes

see say [toy_kingdom_za] Add spider by arrival-spring · Pull Request #10991 · alltheplaces/alltheplaces · GitHub or [TGI Fridays GB] add spider by TheUKHighStreet · Pull Request #10992 · alltheplaces/alltheplaces · GitHub that add new spiders

Brands can be making it easier (by making correct data easily available in .geojson or other standard format) or harder (by having bogus data, using bespoke format or by trying to do geoblocking/bot blocking)

fairly typical, only small part of chains has spider written - but at least for purpose of “lets list potential survey locations” it is not a blocker

Though obviously adding more spiders would increase usefulness of such tool.

(but for me at least more important is ensuring that existing spiders have no bad data, or were found and marked as ignored)

Mateusz_Konieczny · October 11, 2024, 7:19am

gbfs - non-existent public transport positions with tags from Belgium appearing on map of Spain · Issue #11008 · alltheplaces/alltheplaces · GitHub - thanks!

It would be helpful to add also link to map view (you may still have it in browser history) so location does not need to be reconstructed

SimonPoole · October 11, 2024, 4:11pm

The main issue I have seen with this is the large number of false positives (as in outweighing the actual issues in OSM by multiple order of magnitudes) that essentially lead to the effort being more about fixing ATP than improving OSM.

And as ATP will always be at the mercy of the scraped websites that effort is essentially not going in to anything that will maintain value.

Now in areas with low POI cover this might still be an acceptable tradeoff, in other places …

JesseFTW · October 11, 2024, 4:42pm

https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/42%20-72_index.html is down for me.

Other graticules work, so I suspect the page is just too large.

osmuser63783 · October 11, 2024, 5:01pm

Thanks, is there a map on that page that shows the pins for all the chains? Just so I can check a specific area.

It is interesting data, in the UK it shows some categories of POIs that we haven’t mapped at all - such as self-serve coffee machines, launderettes and photo booths (e.g. inside supermarkets and shopping centres).

starsep · October 11, 2024, 5:22pm

Map of all the chains is at All The Places | Map

osmuser63783 · October 11, 2024, 5:25pm

Thanks but I meant the output of the matching that @Mateusz_Konieczny did. E.g. all the stores (that may be) missing in OSM.

maptheworld2050 · October 11, 2024, 5:40pm

I think for lower quality data sources, the stated approach (using ATP as a tool for mappers to add POIs to OSM via StreetComplete and other editors, rather than directly importing) is good.

Once a scraped place has been verified to be accurate and properly geolocated by people on the ground it should be a lot easier to continuously pull accurate data from the spiders back into OSM (e.g. for things that change semi-frequently, like opening hours) with much lower risk of having bad data mixed in to that.

Mateusz_Konieczny · October 11, 2024, 5:42pm

I worried about this, but for example in Poland I recently run out of false positives and problems to fix in ATP and got more optimistic.

Note: that is after adding some filtering and throwing out entire classes of shops, without attempt to use opening_hours where more work would be needed etc.

Basically only using “somewhere here shop with such name and type exists”.

Still, I am especially enthusiastic about potential for finally having up-to date opening hours data, at least for some major chains.

It works for me? Can you try again?

Not yet, I opened #18 - Allow to view all locations reported as missing in a given area - matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data - Codeberg.org (either I can produce merged htmls or publish it with some more standard solution, like ATP map of their full data)

(or maybe some brave person will make a PR - if someone tried and failed, please let me know which part of code was terrible/confusing)

maptheworld2050 · October 11, 2024, 5:44pm

On a semi-related note, is there a planned feedback mechanism for what StreetComplete will do when a user finds bad data in ATP (like, a store that doesn’t exist or has incorrect tags)? Obviously it needs some way to mark the quest as complete. Will that mechanism notify ATP somehow?