Initial proposal for ongoing maintenance update of POIs from All The Places data

There have been many successful one-off imports of All The Places (scraped information about points of interest from brand websites) to add websites and contact details, including[1], [2], [3], [4], [5], [6] and more.

These are very valuable but, as we know, data like opening hours, phone numbers and even websites change over time and so I would like to start a proposal for continuous maintenance using ATP’s first party data to keep OSM up to date.

Update Logic

To ensure that no mapper’s work is overwritten with bad data I propose a strict stability rule where an update would only be made to OSM if:

  • ATP has an identical value for two weeks in a row
  • The ATP value changes and remains stable for a further two weeks
  • The current OSM value matches the older ATP value

Implementation

First stage: Updates are displayed on a dashboard webpage to be checked and reviewed manually. All spiders start here.

Second stage: Once reviewed, certain spiders could be enabled to have changes made for a limited set of tags by a bot account. Mismatching tags would be displayed on the dashboard.

Notes

  • No new nodes are being created or geometry being changed

  • Matches are made using brand, brand:wikidata and either website or ref

  • Stable data such as addresses and NSI tags would not be touched

Licensing and Copyright

Based on the LWG decision and previous discussions, only spiders which take data directly from the brand’s own domain will be included. Any spiders which use another service where rights are ambiguous will not be included.

Approval Process

Rather than overwhelming the forum with a new thread for every spider being added, I am suggesting the following:

  • Start by getting consensus on the bot logic and safeguards here
  • Do a gentle start, beginning with a few brands in South Africa where I can survey the opening hours of some locations to verify the spider outputs
  • Create a single forum thread to announce new spiders, linking to the previews of their matches. If no objections are raised after a suitable waiting period then the bot would be enabled to make updates

Looking forward to your feedback!

9 Likes

Sounds like a good idea! Would it be the same as your phone report? So, showing the difference between the two to manually accept them (in the beginning).

If you would like, I could check some brands from the Netherlands?

4 Likes

Similar idea, but probably a different display which does not emphasise that this is a safe fix in the way that the phone report does for any mismatching items.

I see the process something like this:

  1. Check spider for suitability and data quality
  2. Perform matching with OSM data, adding or updating website or adding a ref tag if the source appears to have stable IDs available (with community approval where appropriate).
  3. Add the spider to the tool and review matches on the dashboard
  4. After at least two weeks, enable the bot for the spider.

Steps 1 and 2 can happen now and are useful anyway (having correct and up to date websites in OSM).

I don’t love this idea but I don’t hate it either.

Could the spider DM the mapper who first set the opening hours, as a further guardrail? “SomeBot just changed the opening hours on [description of element] because [narrative of update logic]. You set those hours on [date]. You can [call to action] if you think the bot has made a bad edit.”

sometimes places don’t keep to their published opening times, after all. Especially where the times are published by head office.

1 Like

I would vote against using ATPs “ref” as OSMs ref.

4 Likes

unless they are actually matching ref definition

in general I would not add to OSM ref:atp or anything similar and remove already added

My reasoning for the ref was that I have seen several brands that have totally changed their store website URL in the past couple of years, sometimes without a redirect. If there is something like a store code that appears to be stable then adding that gives an alternative identifier to track the item and could allow to update the websites in such cases.

Hence my logic of checking that the originally mapped value matches the original value in ATP. If the brand publishes value A for opening hours and OSM also has value A (from whatever source) and then the brand intentionally changes their website to show value B then I would take that as evidence that the hours have changed. Additionally, this is why the plan is not “update all opening hours on all OSM objects” but only using spiders which appear to have good data.

Contacting the original mapper (bearing in mind limitations of OSM history of objects) would be technically possible, although it would require a bit more processing.

As I understand it, the limitation is that the current version shows only the most recent mapper to touch it. But the full version history is easily available.

if it is actually store code: fine

if it is just hash generated by ATP or similar: no

9 Likes

limitation here include that for example where someone moved shop object from node to way, deleted and recreated, reverted vandal, deduplicated it etc. it can be tricky to automatically find the original author

5 Likes

I make extensive use of ATP data in my Chains tool for the UK. This is more about existence of stores than attributes.

Based on my experience, here are some things to watch out for with your plans:

  • ATP can mis-parse things, and there can also be errors on the store’s own web pages. In terms of attributes, I only really look at websites and postcodes, but both can contain errors for individual stores. If you do automated comparisons and flag discrepancies for manual review, You may want to maintain a correction file as a way to suppress false positives.
  • There can be systematic errors in the APT data, e.g. websites constructed by a spider may not match the actual store website URLs. Sometimes I need to make bulk corrections during input into my tool. It would be good if you reviewed any apparent mass changes to attributes before any automated updates.
  • I only consider correctly tagged brand:wikidata for matches and then use geographic location and postcodes to match OSM objects to the ATP list. This mostly works, though sometimes neighbouring stores that are close together can get matched the wrong way round. Care would be needed if you are going to update attributes based on matches like this.
  • Sometimes OSM mapping leads to multiple objects with the same brand:wikidata tags for a single ATP entry (e.g. a supermarket that also has a pharmacy counter, a petrol station, a petrol station shop, a car wash and numerous amenity=trolley_bays). You may need some way to filter out the unwanted OSM objects to ensure your matching is to the correct one.
2 Likes

Thank you very much for this, that is all very helpful.

A corrections file would be quite possible so I will look to include that.

I would plan that spiders be reviewed in terms of website accuracy etc. before being included, but of course these can change. I will put in a safeguard stop if it is suggesting ‘too many’ changes.

I wasn’t going to use proximity matching, due to complexity and some poor location data. Instead, I was planning to use the website or ref, and in case of any duplicates being found, no edits would be made but the item would be reported. I think it’s unlikely that someone would add a website to a car wash, but I suppose a supermarket and a pharmacy counter could have the same website tagged. At the moment, these would be skipped due to duplication, or I could add category checking (shop=supermarket etc.)

I think category checking makes sense, since this data should already be in ATP because of the NSI matching.

1 Like

Why not also check addr:* for matches as an alternative to ref and website? Or is that what you meant by proximity matching?

Matching the address data from ATP and OSM would be too messy, they are not really equivalent and there are too many cases of missing address data in OSM anyway.

Other people have done work to do proximity matching and manual conflation of ATP data to add websites to OSM elements, so now we can use that to keep opening hours up to date.

I am thinking about check_date:opening_hours and so I wanted to do a quick poll to get some thoughts from others.

  • Always add check_date:opening_hours
  • Update check_date:opening_hours if it was previously tagged
  • Don’t touch check_date:opening_hours
  • Remove check_date:opening_hours
0 voters

In all cases, I think the value should be the date that the change was detected in the ATP data.

Most (all?) chain stores in Norway keep their websites updated with opening hours, so you can also make the argument that it should be today’s date and whenever the bot rechecks the data

(I suspect this automated edit will be perceived differently in different communities, so you may wish to roll it out country by country)

1 Like

Hello to all !

I’m new on this forum, but already working on a similar project for the French community. I create a tool to make conflation and user validation between ATP spiders and OSM points of interest (points, polygons, multi-polygons). The tool is named atp2osm, currently designed for the French territory (conflation is based on French departments, proximity and a matching data like phone number, email, website or name) but that can be easily adapted for another countries or areas.

To validate a spider, the user validate a sample of the dataset. The first feedback we have on the ATP data quality is not so great, as you can see on the history page, there is a lot of “canceled” (annulé in French) imports due to data issues in the dataset. So human checking is required for every spider I would say.

2 Likes

Thank you, and it looks like the two tools can work in parallel.

I see that you are not changing any tags, only adding. However, I would mainly be looking at changing tags.

Certainly, the data quality is variable, so there will need to be suitable review of spiders before they can be included.

By the way; “Voir plus” means “see more” and there you’ve the information why the spider shouldn’t be used for OSM.

I checked some results, here Intrend, Bas-Rhin.

  • From a wrong number on the website Intrend | Roppenheim The Style Outlets, atp removed the erronous “leading 0” (not leading as the international prefix +33 was added), so OSM has now a correct phone number.

  • From a detailled opening hours information Horaires d’ouverture | The Style Outlets Roppenheim, atp made a simplified version (omitting the bank hollidays and, more problematic, the days with different opening hours). So OSM has now incorrect opening hours for some days, but globally better than before.

So globally an improvement but checking can be tedious.

So human checking is required for every spider I would say.

Absolutely.