wikidata tag added to thousands of place nodes. Automated mass edit?

aseerel4c26 · November 17, 2016, 9:41pm

Please see the previous discussion at https://www.openstreetmap.org/changeset/43742323 (the edit series was up to date most active in Germany but other edits are in America, too). The user kindly stopped uploading further edits per my request to have a discussion. I would like to ask others of the community for comments and advice how to proceed in this case. Thank you!

I suggest to continue the discussion here in this forum (easier to read with several people participating).

DenisCarriere · November 17, 2016, 10:43pm

A lot of validation goes into finding the best possible Wikidata outcome.

Using a semi-automated approach can go very fast.

The tool that was used to find the Wikdata code can be found here:
https://github.com/DenisCarriere/geocoder-geojson

Which essentially you only need to input the name & lat lng of your search and the Wikidata code will appear.

In the background it will create a SPARQL query to Wikidata that will look something like this.

http://tinyurl.com/zhs93zu


SELECT DISTINCT ?place ?location ?distance ?placeDescription ?name_en ?name_fr ?name_es ?name_de ?name_it ?name_ru WHERE { 
  # Search Instance of & Subclasses
  ?place wdt:P31/wdt:P279* ?subclass
  FILTER (?subclass in (wd:Q486972))
  
  # Search by Nearest
  SERVICE wikibase:around { 
    ?place wdt:P625 ?location . 
    bd:serviceParam wikibase:center "Point(13.397 52.514)"^^geo:wktLiteral .
    bd:serviceParam wikibase:radius "15" . 
    bd:serviceParam wikibase:distance ?distance .
  }
    
  # Filter by Exact Name
  OPTIONAL {?place rdfs:label ?name_en FILTER (lang(?name_en) = "en") . }
  OPTIONAL {?place rdfs:label ?name_fr FILTER (lang(?name_fr) = "fr") . }
  OPTIONAL {?place rdfs:label ?name_es FILTER (lang(?name_es) = "es") . }
  OPTIONAL {?place rdfs:label ?name_de FILTER (lang(?name_de) = "de") . }
  OPTIONAL {?place rdfs:label ?name_it FILTER (lang(?name_it) = "it") . }
  OPTIONAL {?place rdfs:label ?name_ru FILTER (lang(?name_ru) = "ru") . }

  FILTER (regex(?name_en, "^Berlin$") || regex(?name_fr, "^Berlin$") || regex(?name_es, "^Berlin$") || regex(?name_de, "^Berlin$") || regex(?name_it, "^Berlin$") || regex(?name_ru, "^Berlin$")) .

  # Get Descriptions
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en,fr,es,de,it,ru"
  }

} ORDER BY ASC(?dist)

You also complained it would take too long to validate the users edits? So you have the nerve to raise an issue with the OSM community without even looking at the edit? You’re are creating a very hostile environment for users that want to contribute good data to OSM.

Feel free to contact me directly if you do find any flaws in the Wikidata codes, if you can’t find issues… please stop accusing users of poor edits when they are valid edits.

n76 · November 17, 2016, 11:47pm

I don’t have a strong opinion on this but will say that I noticed these edits occurring in the areas I am interested in and monitor. I looked at the edits in my area saw what they did and spot checked that they were good. Everything looked okay so I did not raise a ruckus.

My general feeling is that as long as the an edit improves the map and, if an import, the copyright allows it, then it is a good edit/commit. In an area where the map is blank or filled with data from a bad import from years ago, then even if the edit itself has some issues as long as the resultant OSM data is better than before the edit then it is acceptable.

muralito · November 18, 2016, 3:41am

Please keep calm, we all are working to improve the data.

First, I didn’t see any accusation of poor edits, just some colaborator asking for the details of how the changes were done. There is genuine interest in these details, for example to check for mistakes, or to reuse the code, algorithm, or strategy in other applicable cases.

Anyway, in the areas I am interested, during the last week I also saw changes adding wikidata (Same “Adding Wikidata tag to Places” from several users), I checked a few of them and all were good, but they were smaller changesets, only one or a few nodes in each changeset.

aseerel4c26 · November 18, 2016, 7:23am

@DenisCarriere: thanks for joining the discussion here!

Okay, and how are duplicate names handled? E.g. the common village name “Weiler” oder “Neustadt”? How to find the right one in wikidata if you just input the name? If it is visible in your posted code, please excuse me, I did not dig into that too deep yet.

I quite spent a good amount of time to look at your edit and discussing (see the changeset link in my first post). I also pointed out issues (loss of position precision).

That I did not. But we have a good practise for discussing imports and automated edits (however you want to coin this one here). And we have that practise for good reason. Please let’s work together to find the best outcome for this wikidata addition effort.

From my POV I currently see those issues to be clarified/documented:

loss of position precision
verifying that you found the right wikidata object
1a. place name exists several times in the real world, how do you choose from all the “Weiler” and “Neustadt” entries?
1b. the right place is missing in wikidata, but another one (one exactly) with the same name exists in wikidata
what about other osm objects which already have been tagged with the same wikidata ID. Should two OSM objects get the same wikidata ID? E.g. one on the boundary and one on a place node?
edit conflict resolution? No problem, I guess.

hadw · November 18, 2016, 9:41am

If geographical coordinates are being used from Wikipedia, that is a potential breach of copyright. Wikipedia has a more relaxed position on database copyrights and it is quite likely that a lot of geocoding in Wikipedia is based on Google Maps and therefore must not be transferred to OSM.

SomeoneElse · November 18, 2016, 9:43am

For info I’m in discussion with another wikidata added who has added a couple of questionable links locally to me.

The same problem crops up in all sorts of areas in OSM - imagine if a number of people carefully add roads to an inaccesible area from all available sources including aerial imagery, and then someone comes along and adds lots more “roads” along every straight line they can see (fences, hedges, etc.) - it devalues the work of the all the people who were careful adding data in the first place.

Personally, I’d only add wikidata tags of things that I was familiar with or had some understanding of. Anything else is guesswork.

SomeoneElse · November 18, 2016, 9:59am

Indeed - it’s reasonable to assume that a lot of wikidata positional information comes from Google. In the problematical examples near me it doesn’t look like that’s a factor (people are just searching for something that “might match” by name) but I’m concerned that processes such as https://github.com/mapbox/mapping/issues/242#issuecomment-261457939 explicitly do use wikidata positional information.

It’d be a shame if all wikidata information had to be redacted from OSM simply because a few people were careless.

LogicalViolinist · November 18, 2016, 11:08am

Ok just to confirm that in no way is geographocal coordinates being transferred to OSM.

wycbtma · November 18, 2016, 12:34pm

Hi Denis,
please don’t take offense so easily, just because the sheer number and speed of your changes prompted someone to ask for a break for a closer look by more people first. I’m not someone who could judge that stuff myself, but the difference between your more-or-less manual edits and “normal” manual edits still is the pure number! Numbers do make a difference for the level of caution, regardless the methods. Because the faster and the more mass changes, the higher the chance of mass-accidents too, and then it would be too late. For this reason alone already, I sure am glad if “mass-number” edits of any sort are treated with extra caution. So please don’t take asking for a short break in high-speed-mass-editing as offense and negative judgement already, it’s just caution. And it may very well be confirmed that your edits are great and fantastic and everyone is glad about it.

LogicalViolinist · November 18, 2016, 1:36pm

The only thing that is being added to OSM, is the ID of the wikidata which is available under public domain: https://creativecommons.org/publicdomain/zero/1.0/ No new information/location are being added to osm based upon wikidata/wikipedia as stated above that would be a risky move to take.

tl;dr :
The foreign key to wikidata is the only thing being added to osm, which in itself is NOT copyrighted and is in the public domain.

SomeoneElse · November 18, 2016, 2:43pm

Ignoring the fact that “public domain” isn’t really a legal status in England and Wales (where OSM is based) and ignores database rights, the argument that’s been put forward is that the use of wikidata location in matching clearly is using that data to contribute to OSM.

I have to say where I have looked at local wikidata contributions near me (not from you) there’s no evidence of proximity-based matching (indeed some of the matching errors would suggest that the local matching I’m looking at is based on name only). However you seem to be using a different mechanism (see https://github.com/mapbox/mapping/issues/242#issuecomment-261457939 ).

Whether it’s OK to use Google-derived data in this way is something that as a community we’d need to discuss. In the context of the UK it has been discussed on the “imports” list (see https://lists.openstreetmap.org/pipermail/imports/2016-March/thread.html#4342 ), and other communities may have had similar discussions that I’m unaware of. I’m not a lawyer and am unaware of any English and Welsh case law that could argue either for or against these concerns being valid, but do think that additions such as this should be discussed within the wider community - in the UK some communities were in favour of importing wikidata (which occurred) and some were against.

Of course, many people have added wikidata links implicitly (via iD) or explicitly with local knowledge - no-one’s complaining about that process because the wikidata location isn’t being used in the match; local knowledge is.

LogicalViolinist · November 18, 2016, 3:57pm

I’m going to play devil’s advocate:

Let’s say someone knows Berlin’s name and general location
So what we know is:
Name: Berlin
Location: Somewhere in east Germany, near poland kind of

If that user looked up “Berlin”, well guess what he might be using google derived data. Someone could have created the wikipage and wikidata based entirely off google’s location and name service. So with your logic being applied to the name as well you could never match any data from wikidata or wikipedia as it might have been created with google maps/bing maps(lol) data and visually confirming on the screen is still using said derived data.

There’s a difference between using knowledge to look something up versus blatantly copying data to create something knew. I will now take OSM’s example of looking stuff up:
What we know:
Name: Berlin
Location: 52.5170365, 13.3888599
So if we take the same steps as looking up the city, it comes down to the same thing as “using local knowledge”. OSM data has been created and shared by local mappers who are sharing said knowledge that Berlin is indeed located in East Germany.

To my knowledge Google cannot copyright gps locations of locations when the source of the gps location is local knowledge(OSM Editors). Again no data is being derived from wikidata, only the link or ID to the offsite database is being added.

It would be as if you had to look up a website of say Berlin Cathedral Church. So you launch Google, lookup “Berlin Cathedral Church” and find the website: http://www.berlinerdom.de/index.php and decide to add it to OSM. At that point are you deriving data from Google? No because it’s factual and verifiable information.

p.s. WHAT I DO NOT SUPPORT: is using wikidata to create places/nodes in OSM as those may be derived from copyrighted materials.

SomeoneElse · November 18, 2016, 5:16pm

So just to confirm - you are just using “name and local knowledge” searching and you are not using “location and proximity” searching (which was described in https://github.com/mapbox/mapping/issues/242#issuecomment-261457939 )?

With regard to statements such as “Google cannot copyright gps locations” I suspect you need to read up a little bit about England and Wales law (as opposed to others). There’s lots of prior discussion from around the time of the licence change; wikipedia’s starter page on the EU Database Directive is https://en.wikipedia.org/wiki/Database_Directive .

LogicalViolinist · November 18, 2016, 5:48pm

I was not using proximity(geolocation) to look up IDs for this. Most(I say most as some was from my geography lessons. Would that count as copyrighted material?) of the information came from the name and the osm node location (i.e. Berlin is in East Germany, in it’s own city-state, but most importantly in the middle-ish of Brandenburg state, same as I know London is South-Est UK and not a city near Toronto, Ontario, Canada)

I’ve been to Germany many times during my employment with my previous employer, so I know Germany well.

SomeoneElse · November 18, 2016, 6:13pm

Thanks - good to know.

maxerickson · November 18, 2016, 7:00pm

LogicalViolinist or DenisCarriere could you clarify how this tool is being used? Looking at the changesets, there are hundreds of changes being made per second, more than could possibly be done entirely manually in that time or even just reviewed. Are files somehow prepared with more thorough review and then just uploaded in rapid sequence?

There’s a real problem with uploading thousands and thousands of changes without explaining much about how you are doing it (I think I’ve read most of the explanations and am too thick to have actually figured out how I would do it so quickly myself) and then just saying that someone who wants to question the changes has to review them all. By the time they have done any review, you will have uploaded thousands more changes. This is part of the reason for the policy on automated edits. Not to make life difficult for people with good ideas, but to make sure that other people have a chance to understand what is being done prior to the work being carried out, to make sure that poor edits can be improved and so on.

wambacher · November 18, 2016, 7:14pm

same going on for thousands of relations:

e.g. https://www.openstreetmap.org/changeset/43751676

regards
walter

woodpeck · November 18, 2016, 9:00pm

I read a post on talk-ca in which someone from Canada suggested adding wikidata tags in Canada, via a task manager, and I thought well that sounds like a measured approach. Seeing people adding automatically matched wikidata tags world-wide without prior discussion certainly is something else. Apparently I misread “Join for a more data rich Canada”… I’ll revert these mechanical edits but archive them so that if in the future a decision about mass-adding these tags should be reached, the work was not for nothing.

DenisCarriere · November 18, 2016, 9:42pm

@woodpeck on what basis are you reverting this?

Last time you reverted Ottawa, you deleted buildings, places, POIs. You’re workflows are very destructive to OSM and someone should be monitoring your reverting process because it’s very poor.

OSM Community, please validate @woodpeck reverts since he will actually be deleting data instead of reverting.

We have many examples of @woodpeck poor revert workflow.