One of things that can be improved in OpenStreetMap is POI coverage. AllThePlaces project can be helpful here.
This dataset seems very promising and with potential to help in mapping. But I want to take care to avoid damage. Use of external datasets has a long story of causing problems, despite best intentions.
tl;dr
You can go to https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/
Then click on area of interest for you
Then review data
Then complain, best in way described there while being specific what kind of data is horrifically broken. This will be very helpful and will result in silencing specific spider or fixing data. If you see no horrifically bogus data, you should probably look a few more minutes for it.
(if you are confused what is âspiderâ here or want to learn more, please continue reading)
ATP
ATP has more than three millions points of interest, many of them missing from OSM. This dataset is also in an active development, with problem data being fixed and more scanning spiders being added over time.
ATP can be used to make POI adding more efficient, detect missing shops, add missing properties (for example website
tags) to existing objects and so on.
I already tested in my local area and on some trips and it allowed me to spot some missing shops and was helpful also in other ways.
ATP is a project collecting POI data from various sources and we can potentially use large part of that data for OSM mapping. The collection process captured about three millions POIs so far, and more data-gathering spiders are being added.
I made some contributions to ATP. They are really great at processing PRs quickly. Also, due to structure of project it is much less complex to contribute than to many other large projects. Even large part of issues that I reported was fixed!
Spiders
ATP projest can be described as having
- single shared organizational layer
- thousands of small spiders, each of them obtaining data from a specific source (for example, by parsing list of shops published by a given brand). Important part is that this sources are distinct, coming from completely different websites.
Data quality
Data comes from different sources. As result quality varies wildly:
paczkomat_inpost_pl
and allegro_one_box_pl
and orlen_paczka_pl
have superb accuracy - they need it so people will not be confused in hunt for their parcel locker.
poczta_polska_pl
has dubious accuracy, some claimed positions of post offices are in the middle of motorway
chorten_pl
data quality is so bad is that I disabled this spider completely, as more annoying than helpful
You cannot really guess that without looking in detail at specific ATP data, related OSM data and preferably also local survey.
As another example - initially I was quite happy that I found Orlen spider listing many fuel stations not yet mapped in OSM. On deeper investigation it turned out that ATP was listing amenity=fuel brand=Orlen
where Orlen had just hot dog stand or vending machines, for example at bus stations. After I fixed this all such missing fuel stations disappeared. Turns out that OpenStreetMap had all of them already. If I would be just a bit less careful I would add bunch of fake petrol stations.
And similar problems were present in other spiders, some of them are now fixed and some remain broken.
So you need to be careful when using this data. Some types of bad data can be discarded with smarter preprocessing (say phone = +undefinedundefinedundefined
) but some are much more trickier.
Note that while data is published with tagging schema mirroring OpenStreetMap it is not an exact match - for example if someone wants to use opening_hours
data it is in format with some subtle incompabilities are present.
And name
keys in ATP very very often does not match at all what should be in name
keys (cleanup is going on, but far from completion)
Using this data into OSM
As it often happens, imports may run into trouble as quality expectations are higher in OSM than elsewhere.
It is common that accuracy of POI locations in OpenStreetMap is higher than listing of objects by their owners. I run into it with bicycle parking data I obtained few years ago - it was unusable for import, but pointed me to locations worth surveying. And for many spiders you have various problems as mentioned in previous section.
But I think there is significant potential here
- if OSM and ATP data is compared we can get list of places where OSM is likely missing POI - in general it cannot be just imported but may be treated like an anonymous note about missing shop.
- So it may be worth walking there and checking is POI actually there
- It could be helpful for mappers to show this as suggestion in mapping tools. And allow to quickly apply it, with single click or two - after user checked where the POI is actually located (and is it there at all!)
website
tag look importable for cases where ATP-OSM objects clearly match- link specific to a given POI can be useful both for data users and for mappers
website
property is not impacted by various subtle issues plaguingopening_hours
andphone
tags and other properties that could be imported from ATP- though it still requires some advanced filtering to ensure high-quality matching
- in some (very limited!) cases it may be possible to add POIs without local survey
- In some cases shop may have own website, with link present as part of object. Such website may include photo allowing to exactly locate where it should be mapped
- some types of objects (some parcel lockers, fuel stations) are possible to be verified as existing at given position with aerial imagery
- some spiders may have data so accurate to make outright imports possible (
paczkomat_inpost_pl
for example is promising)
Various communities have different approaches for use of such resources. Especially for use with limited human review I expect that various communities may have wildly different approaches. See for example differing acceptance of adding extra tags to match with external databases. It is considered as invalid with rare exceptions in some communities, having some acceptance in other. The same goes to question how rare errors need to be for edit to be acceptable. Maybe less errors than typical human mapping should be required? Less errors than expert mapper would make? Some other criteria?
So if you are planning to start importing data, please read and follow Import/Guidelines - OpenStreetMap Wiki
Note: I am working toward all of this listed ideas. I started work on it with intention of using such info in StreetComplete. I thought about adding website
tags in Poland and USA - and if things go great then maybe suggesting it also elsewhere, I already tried mapping fuel stations remotely and I want to try again, maybe it will go better than with Orlen.
See actual data
You can see result of experimental processing at https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/ that processes and compares ATP and OSM data, throws away known dubious or incorrect parts and lists what seems likely to be useful for OSM mapper.
Click on area of interest for you to get listing.
I want to complain about something
Note: data issues such as âspecific POI is offset by 10m / 100m / 400mâ are sadly normal and not worth mentioning. That is why this data is in general not importable.
But if you see things like âall locations are massively offsetâ rather than âone of POIs is massively offsetâ please report it.
Even better, if ATMs are reported as banks or hot dog stands as fuel stations or fake shops are listed as existing - please report it.
You can report issues directly at GitHub - alltheplaces/alltheplaces: A set of spiders and scrapers to extract location information from places that post their location on the internet. if you confirmed that it is ATP fault and it can be fixed by ATP ( https://www.alltheplaces.xyz/ has All The Places | Map with raw ATP data
).
But if unsure or confused, please report them on Issues - matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data - Codeberg.org (requires Codeberg account) or write PM to me - via forum or to my OSM profile.
Legal âfunâ
Legal/license/copyright/copyright like issues are not forgotten but discussed in late section to not scare away normal people.
As I understand we can use in OSM mapping
- data from opening hour signs and shop signposts
- opening hours info posted on own shop website
- bunch of opening hours info posted on chain shop website (for example listing all shops of a major brand)
See LWG minutes that discussed last case.
AllThePlaces is a project that collects (among other things) data such as âhere is what chaing XYZ claims about their shopsâ.
This may be worrying too much - but I am not 1000% sure is this ATP decision covering also locations, they discussed specifically opening hours. (LWG was discussing it partially because I bothered project maintainers about legal status of ATP data so I suspect that I am worrying too much).
There is also another unfun part - ATP data collects not only first-party data that was discussed there. It republishes for example https://ourairports.com/ (which self-describes itself as public domain). But in principle ATP data could start to include another dataset but has data with other restrictions. See case of OpenAddresses.io that contains some data described by project itself as âopenâ but with restrictions that make impossible to use them for example in OpenStreetMap. See also Maybe mention in readme that ATP takes data from first part websites? ¡ Issue #8790 ¡ alltheplaces/alltheplaces ¡ GitHub
So it seems to me that any use of ATP data in OSM would need to review new spiders and check is it first-party data or otherwise usable.
I do not feel comfortable with some, spiders see Maybe mention in readme that ATP takes data from first part websites? ¡ Issue #8790 ¡ alltheplaces/alltheplaces ¡ GitHub and for example I disabled for now âcbrfree_auâ, âjames_retail_gbâ, âqueensland_government_road_safety_cameras_auâ, âterrible_herbstâ, âthales_frâ as slightly suspect and pending further investigation.
Why I am posting this?
If some (or all) of plans listed here seem to be a harmful/bad/hopeless/doomed idea, especially if problem was not mentioned - feedback is welcome. I spend quite a lot of time on this project and it seems promising but maybe I missed some major problem that should be addressed. Or maybe review mentioned here is inadequate.
Review of content and data listed at https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/ is welcome. âsome POIs in spider XYZ are offset from given locationâ is not reportable and is a common unfixable issue but reports about all other data issues are welcome (best to report as described there, but you can comment also here in this thread)
GitHub - alltheplaces/alltheplaces: A set of spiders and scrapers to extract location information from places that post their location on the internet. seems a nice project to contribute, especially as bottleneck is not at PR review and merging - see Pull requests ¡ alltheplaces/alltheplaces ¡ GitHub - this happens quite quickly (with more than 6 000 PRs merged so far) and nearly all the work is strongly modularized into specific spiders so it is much easier to change or add something. I had good experience there.
If you managed to read until here - congratulations! Please comment if you have any concerns/feedback or if I missed something worth considering.
Feedback about processed data I prepared is also highly welcome.
And from my experience - ATP project is also accepting pull requests and acting well on it.
Disclaimers
This post is not an official statement of any organization.
I got external funding (from European Union via NGI0 via nl.net) for development of software to compare OSM and ATP data and to create editing support in StreetComplete.
Note that I will be paid on implementing this, not on releasing this editing feature to general community. So conflict of interest is reduced here. As I will be paid also when this feature turns to be misused/not usable/too dangerous and in the end not included in StreetComplete - or removed due to some problems.
I explicitly excluded from grant payment time not spend on making this comparison software and StreetComplete quest. For example time spend on potential import (and writing this post) comes from my hobby time.