Proposed imports of website tags for Kaufland, based on ATP first-party data

This is basically the same as Proposed imports of website tags for NKD, based on ATP first-party data - just for Kaufland rather NKD

main difference is that links now go to Mechanical Edits/Mateusz Konieczny - bot account/Kaufland list - OpenStreetMap Wiki (object list) and to alltheplaces/locations/spiders/kaufland.py at master · alltheplaces/alltheplaces · GitHub (ATP code)


I matched OpenStreetMap and All the Places datasets.

Several matching rules were used to get set of cases where matches are very confident.

In some cases OSM has no website tag or one leading to the main page, while ATP has POI-specific one.

I propose to import them and repeat import if more matches would appear, for Kaufland brand.

The tricky part is that due to my mistake import mostly already happened - I am sorry for that, that is obviously a wrong order.

If import will be rejected or not accepted, I will revert all such edits.

If import will be accepted I will post changeset comments on relevant changesets explaining that it was in the end reviewed and accepted. And make future edits based on new matches, once they will be found. This is especially likely for newly opened shops.

OSM objects were edited / will be edited only if

  • there is no website and contact:website tags
  • or these tags link main page rather than specific store page
  • or these links redirect to a found website address

Data is taken by ATP project from Kaufland website, see alltheplaces/locations/spiders/kaufland.py at master · alltheplaces/alltheplaces · GitHub

You can see list of edits performed at Mechanical Edits/Mateusz Konieczny - bot account/Kaufland list - OpenStreetMap Wiki

Mateusz Konieczny - bot account - ATP import | OpenStreetMap already imports website for Poland in general and did some proper imports on USA. Editing is currently suspended as I want to cleanup one way or another this border-crossing imports. It took me few months before I had free time to write code necessary for cleanup, but I can now identify impacted areas. This thread is the next step for cleanup, some other brands in Germany and some other countries were also affected and will be also handled.

I have a decent experience with bot edits, see Mechanical Edits/Mateusz Konieczny - bot account - OpenStreetMap Wiki

6 Likes

Kannst Du mir bitte kurz erklären was “ATP” bedeutet?

Edit: Ah! ATP = All The Places

1 Like

ATP is GitHub - alltheplaces/alltheplaces: A set of spiders and scrapers to extract location information from places that post their location on the internet. which has bunch of spiders scrapping data from websites across internet and processing it

part of this data is first-party data (in this case data about Kaufland stores published by Kaufland) that, if I understood https://osmfoundation.org/wiki/Licensing_Working_Group/Minutes/2023-08-14#Ticket#2023081110000064_—_First_party_websites_as_sources right, can be used for OSM mapping.

once a week they for example automatically open Kaufland websites and record their info about shop location, address, opening hours, websites and so on - an publish it at https://alltheplaces.xyz/

I am processing it further using ATP-OSM experimental matcher, graticule map

You can se how they describe themselves at GitHub - alltheplaces/alltheplaces: A set of spiders and scrapers to extract location information from places that post their location on the internet.

(if you wanted simpler or more in-depth description please let me know)

1 Like

Sounds good to me.

3 Likes

Das ist jetzt der zweite Fall mit der falschen Reihenfolge der Bot-Bearbeitung. Wieviele dieser Fehler hat der Bot noch gemacht und möchtest Du uns noch häppchenweise präsentieren?

1 Like

Ich hab’ nicht mitgezählt, aber @Mateusz_Konieczny hat aufgezählt, welche es noch waren:

3 Likes

Es ist eine Zumutung… :frowning:

Ich lehne bis auf weiteres jegliche weitere Bot-Operationen in dem Zusammenhang ab. Für mich ist nicht sichergestellt, daß es nicht wieder zu einer solchen Situation kommt. Ich als Mapper kann zwar 1-2 oder vielleicht noch 3 Einträge prüfen. Mehr ist nicht überschaubar. Ich kann nicht einschätzen, ob alle Bot-Daten wirklich valid sind, oder ob da…

streckenkundler

@Mateusz_Konieczny, I would suggest to not start 5 more discussions on the same topic. It would be more efficient to have one discussion, whether we (the German community) want to keep the accidental changes your bot did or not. As the general feedback in all the 4 (?) threads is mainly positive anyway and I personally don’t think it’s worth the effort discussing, whether we want to keep the 3 IKEA-website changes or not…

If you want to run your bot in the future in Germany, better start a fresh discussion.

4 Likes

I would be entirely fine with such solution!

If anyone is interested in more detailed listing of affected objects, added links etc, then I would be happy to provide such info.

I would not expect checking every single value. If I would be doing review of not edit then I would be doing some of following

  • spot checks of some randomly selected entries
  • checking entries in my area
  • checking provided documentation
  • audit of source code doing changes
  • checking whether person making edit seems trustworthy
1 Like

Hello Mateusz,
I looked at your Wiki-page for Kaufland sites around my hometown (Darmstadt) and found only a minority of them.

OK:
https://www.openstreetmap.org/node/5941367266     https://filiale.kaufland.de/service/filiale/pfungstadt-2283.html
https://www.openstreetmap.org/way/201685962       https://filiale.kaufland.de/service/filiale/dieburg-4780.html
https://www.openstreetmap.org/node/5861506975     https://filiale.kaufland.de/service/filiale/oppenheim-2523.html

missing:
https://www.openstreetmap.org/relation/17477445   https://filiale.kaufland.de/service/filiale/weiterstadt-3980.html  # no website tag
https://www.openstreetmap.org/node/36742690       https://filiale.kaufland.de/service/filiale/roedermark-6470.html   # no website tag
https://www.openstreetmap.org/node/2465457118     https://filiale.kaufland.de/service/filiale/dreieich-4163.html     website=https://www.kaufland.de/
https://www.openstreetmap.org/way/61348724        https://filiale.kaufland.de/service/filiale/ruesselsheim-6440.html website=https://www.kaufland.de/
https://www.openstreetmap.org/node/4406517249     https://filiale.kaufland.de/service/filiale/raunheim-2253.html     website=https://www.kaufland.de/?cid=F3183B02C1200K01000W01030000D1000E1000F1000G1000H1000

The missing sites are all available via the “source_uri” .klstorefinder.json that I found at alltheplaces.xyz map.
As far as I see, they all would benefit by the update of the website tag and should match your target pattern.
Why are they not on the list?

1 Like

from what I see my matcher found the same links as you!

Some of this links were not added by that bot run 4 months ago but got matched by improved code, or by improved data. Or maybe I stopped bot before it added all of them?

Some still wait for automated verification whether link works.

One of them is blocked by already present website tag and will not be edited per

that is present out of caution to avoid breaking present data.

If there is interest I can list such mismatches in more readable form in addition to kaufland website import candidates / Import possibilities

from looking at kaufland website import candidates likely match is https://filiale.kaufland.de/service/filiale/roedermark-6470.html but whether link works was not yet verified

I am doing this as ATP occasionally features broken links (for one reason or another).

(that reminded me, I restarted script checking URL validity)


Relation: ‪Kaufland‬ (‪17477445‬) | OpenStreetMap - I see it now matched to Dein Kaufland Weiterstadt | Kaufland

(no idea why it was not matched earlier? maybe OSM data was improved in last 4 months? Maybe Kaufland website was fixed? Maybe my matching code was improved? Maybe I noticed bot escaping Poland while it was adding Kaufland links and it added only part of it?)


Node: ‪Kaufland‬ (‪2465457118‬) | OpenStreetMap - link not checked yet, would get https://filiale.kaufland.de/service/filiale/dreieich-4163.html


Way: ‪Kaufland‬ (‪61348724‬) | OpenStreetMap - now matched to https://filiale.kaufland.de/service/filiale/ruesselsheim-6440.html


Node: ‪Kaufland‬ (‪4406517249‬) | OpenStreetMap - matched to Dein Kaufland Raunheim | Kaufland - but import blocked as different website tag is present


PS: I also updated code of how this listing website is generated while investigating this matches, though it is not pushed immediately. Next update should make kaufland website import candidates slightly nicer (newlines in HTML, elimination of None links, changed “WILL BE WRONG” to “MAY BE WRONG”)

3 Likes

wiki page is up, see Mechanical Edits/Mateusz Konieczny - bot account/import Kaufland websites in Germany from ATP - OpenStreetMap Wiki

1 Like

Hi Mateusz,
to analyze the current situation in case of website tags on Kaufland supermarkets, I created the following Overpass-Turbo-Query:

The bright green (lime) circles represent the objects, last touched by your bot (where uid=22544915)[1].

The dark green circles are the branches which already have a website of the current format, matching the regexp[2]:
^https://filiale\.kaufland\.de/service/filiale/[a-z-]+[0-9]+\.html$

The bright red circles represent objects with no website tag. So it seems, that you probably have interrupted the bot while it was running. Otherwise I would not have expected so many of them to survive :wink: (BTW: green fill-color indicates objects, where the current value is tagged as contact:website=*).

This is an interesting point, where I would like to propose to make the bot stronger by weakening this rule for the sake of better tag-value-quality.
There are basically two different issues where the bot can detect a bad value by using regular expressions as quality check:

  1. website tag values should be specific, linking to a page for the branch and not for the brand. Such unspecific values may be tolerated in brand:website, but should not be used in website (see OSM-wiki Key:website:“Choose specific URLs over homepages.”).
  2. website tag values should use the current (up-to-date) scheme of the referred website. Old links (even if the website still supports them via redirects) are not worth to be preserved. Such out-of-date web links are regarded a legacy by the website provider and there is no guaranty for how long they may be still supported.

@1. Here is the pattern, that I would recommend to find such unspecific (replaceable) values:
^(https?://)?((www|filiale)\.)?kaufland\.(de|com)/?$
It currently finds (dark red in the picture above):

      # Value
     73 https://www.kaufland.de/
     15 https://www.kaufland.de
      6 www.kaufland.de
      4 http://www.kaufland.de
      3 https://filiale.kaufland.de/
      2 https://kaufland.de/
      1 kaufland.de
      1 https://www.kaufland.com/
      1 https://filiale.kaufland.de
      1 http://kaufland.de

@2.
2.1. The pattern ^https://www.kaufland.de/[?]cid=[0-9A-Z]+$ finds two objects (black in the picture) with the same value
https://www.kaufland.de/?cid=F3183B02C1200K01000W01030000D1000E1000F1000G1000H1000 which no longer seems to do what it may has, at the time it was stored.

2.2. With ^https://(www|filiale).kaufland.de\/service\/filiale.storeName=DE[0-9]+\.html$ we can find 28 links which are specific and still work, but if you search for the very same branch (Filial-Suche) you get the same page under a new (current scheme) url. These examples are shown in yellow in the above picture.

2.3. The pattern ^https://www.kaufland.de/Home/index\.jsp$ marks 4 unspecific and outdated examples (colored purple).

2.4. ^https://www\.kaufland\.de/Home/index\.jsp[?].*FilialID= finds 8 links (color: magenta) that look specific, but do not work as intended any more:

https://www.kaufland.de/Home/index.jsp?PageId=3&FilialID=6410
https://www.kaufland.de/Home/index.jsp?PageId=3&FilialID=5780
https://www.kaufland.de/Home/index.jsp?FilialID=4300
https://www.kaufland.de/Home/index.jsp?PageId=3&FilialID=2540
https://www.kaufland.de/Home/index.jsp?PageId=3&FilialID=3050
https://www.kaufland.de/Home/index.jsp?PageId=3&FilialID=1340
https://www.kaufland.de/Home/index.jsp?FilialID=5740
https://www.kaufland.de/Home/index.jsp?FilialID=8860

Finally, after eliminating all the above, there are four exceptional cases remaining:

(1) https://www.kaufda.de/Filialen/Norden/Kaufland-Bahnhofstrasse/v-f424320387
(2) https://www.schuhcenter.de/

(3) https://www.kaufland.de/Home/index.jsp?et_sub=BR_kombis&et_lid=457427&et_cid=17
(4) https://www.kaufland.de/Home/04_Kundenservice/004_Filialsuche/index.jsp

(1) really seems to be something special :smiley:
(2) looks like a mistake to be corrected.
(3) and (4) are probably more even older (kind of rare vintage) legacy links. :nerd_face:

You may not want to implement all/the varieties of @2. legacy patterns , but I think the category @1. unspecific (brand:website like) values are worthy to be automatically replaced.


  1. I did not find a way to access the uid in the CSS selectors. That is why I created a dummy tag xuid for it ↩︎

  2. I had to use an ugly workaround, because I did not find a way of how to use character classes with square brackets in the CSS selector ↩︎

1 Like