Persistent and stable identifiers

SomeoneElse · December 27, 2022, 5:27pm

It depends - sometimes that makes sense, sometimes not. If multiple entities share the same physical space there are plenty of ways in OSM to split them.

Minh_Nguyen · December 27, 2022, 5:35pm

The purist view of Wikidata is that it’s just a collection of claims made by others. So if there are two competing ideas about the area covered by “Serbia”, however defined, then those claims can coexist with qualifiers saying who supports or rejects each claim, or in what context each claim is valid.

If the conflicting statements get at the fundamental question of what Serbia is, then ideally there would be separate items. I spend as much time conflating and deconflating items in Wikidata as I conflate and deconflate dual-tagged features in OSM. Both projects are a work in progress.

stevea · December 28, 2022, 5:12am

To be fair, I think it is also correct to say that “wide understanding of ontologies like Wikidata” (and other authority controls) is also “a work in progress.” Speaking personally, even for someone somewhat-well-versed in the use and evolving science of libraries, classification, ontologies, linguistics, computer science, cartography and related digital technologies, these aren’t necessarily new concepts, but they are evolving and emerging. Especially as we are to consider them to be “widely understood.”

They are “somewhat understood,” by a certain segment of “classification geeks” (no offense to geeks, I consider myself a member of geek communities). They are getting to be better understood by a wider segment of people who use such systems, but I would not say that they are “widely understood.” Yet.

salgo60 · December 28, 2022, 11:38am

Agree but dont we have a nice user case with Wikidata and Open Street Map with two dedicated open communities were we could maybe get SKOS working and learn more…

As said Wikipedia Wikidata I feel has not started adopted SKOS as much as I feel we should

even when we have more Wikipedia articles in more languages that are different between languages we dont use etc…
having > 7500 external identifiers I feel Wikidata is a good place to start be better learning more about SKOS and also persistent identifiers… I see it as a work in progress

A little related blogpost Denny Vrandečić one of the architects of WD wrote A categorical imperative?

fititnt · December 29, 2022, 4:36am

@salgo60 would you be interested in creating a dedicated thread here to discuss how would be the SKOS version of the tagging schema we have today on Wiki? I think it would be worth it.

Context

Some frontend applications also works with SKOS, such as Skosmos , TemaTres, iQVoc , SkoHub , which are used by like FAO (https://agrovoc.fao.org/), UNESCO (https://vocabularies.unesco.org/), etc, etc, etc. Not sure which one the European Union uses, but EuroVoc (Browse by EuroVoc - EUR-Lex) seems to be a custom app that exports whatever they use for humans to input the data.

Compared to formats that allow strict semantic inference (OWL, at some level RDFS, SHACL, …) SKOS is good enough to be viable to encode (likely even somewhat automate) data mining from Wiki (e.g. generate daily data dumps people unlikely to disagree with encoding, maybe with the data, but not how to encode the data). We may never fully agree on an “upper ontology”, but SKOS would be plausible. However, this would require some conventions on the output, because obviously this would allow some extra reusability outside TagInfo (like if people try to create “pocket dictionaries/thesaurus”

Post edit: I’m commenting this because would be willing to make the software implementation (public domain license) on the data mining for it, so I’m not just saying expecting someone else would do it.

salgo60 · December 29, 2022, 6:41am

You are more than welcome… I have spent 6 years on Wikidata and created > 25 external identifiers (user salgo60) and my feeling is that persistent identifiers is a must but its also important to define how two knowledge domains are connected using e.g. SKOS

WD and OSM can never be Same As in the sense if x is not identical to y,
then there must be some property that they do not share maybe we need better semantics to tell that this WD object is the same as defined on a map in OSM or its a narrower term…

stevea · December 29, 2022, 7:08am

I would think that “better” ontology description (systems) allow exactly this: how to more-precisely describe that two things are not the same, and/or that one object has “a more narrowly-defined scope.”

Some potentially good reading: Ontology - an overview | ScienceDirect Topics . You’ll note that the very first thing that needs doing in “Ontology Engineering” is identification of purpose : at the outset it is important to be clear about why the ontology is being built and what its intended uses are. If we don’t start with (or have already) at the very least THIS for both OSM and Wikidata (or whatever…TemaTres, SkoHub…), you’ve not only likely lost the audience, you may be somewhat lost yourself. I don’t say this to insult, I merely wish to build on strong foundations. (And by the way, I don’t know what SKOS is, and maybe many others don’t, either, even as we do our best to follow this thread, including self-education and following the trail to read up on SKOS. Similarly, Wikidata, at least in my experience, “arrived suddenly” into OSM and I felt very much like “hey, fellow OSMers, figure out why this ontology has crashed in here on your own, as, I’m not going to explicitly tell you”).

I know it can be tedious to give everyone a primer on what one is talking about (all the time? no,…) but it can be helpful in a forum like this and on topics like these to do a bit of that. Not spoon-feeding, but a few breadcrumbs on the trail can and does help.

woodpeck · December 29, 2022, 8:57am

While linking with Wikidata makes sense for some things, I have more than once encountered problems where someone on Wikidata abused OSM as a geometry storage for Wikidata items, creating relations in OSM that have no place here just to “have something to link to”.

I think that forcing stable identifiers on our mappers would create an extra burden that would detract from our purpose. Every single object would become a potential link target and you’d never know who links to it and with what purpose. I like the current situation better, where we can explicitly link to Wikidata where we think it makes sense, and when we decide to split or merge objects we can determine if and where to keep the link.

Obviously, any method that would make it harder to add stuff to OSM would be an inacceptable burden for mappers, like having to obtain an ID from some ID authority, or a requirement to add distinguishing information. OSM is a project for everyone to participate in. What enables you to make good contributions to OSM is local knowledge, not subject matter knowledge. You do not have to be an expert in the domain of the thing you’re adding to the map - you can add a tree without knowing what kind it is, and you can add a transformer without knowing how many secondary coils it has.

Anyone can take OSM data and do with it what they please on the “output side”, but manipulating the “input side” so that some non-OSM purpose is easier to reach will always be a problem.

To be honest, I am quite happy with Wikidata and OSM being different worlds with different approaches, and would prefer them to be kept at arm’s length. Or maybe the length of a bargepole while there are some people invested in both projects, the Wikidata mindset is often very different from that in OSM. Wikidata folks in OSM are more likely to run imports and mass edits from the comfort of their office chair than to go out and map, and that’s not a good influence on OSM.

stevea · December 29, 2022, 10:13am

I’ve said this in other places and in other ways, and I don’t want to detract from @woodpeck 's clearly-heartfelt enthusiasm for OSM more-or-less “as it is now.” And while I share that myself, too, I also see an all-or-nothing kind of bifurcation, when it doesn’t have to be that way.

Part of what I find so fascinating about this new(er) Discourse instance is discovering new-for-me and deeply intellectual and technically-advanced topics (like this, again, for me). And I don’t want to sound curmudgeonly (brittle, bad-tempered like an old person who wants to see no change) so I’m very open to “testing” (as ideas, in my mind first) this exciting, new stuff that intersects with OSM. I do so not looking forward to eschewing it completely, saying, “oh, no, too far different from the OSM I’ve known for so many years.” Rather, I want to find a sweet spot: let’s say I change 1% of my efforts to “something new” (an approach, a tool, a new structure for data…) and keep 99% of what I do the same, and I get a great deal more value for changing that 1%. Maybe it’s 4% or even 10%, but if I “double my value” by using a different tool 5% of my time (with no more time invested in my OSM efforts, except the time it takes to invest in learning something new), I very well might do that.

The sweet spot would certainly include “OSM being there, largely for the people who expect it to be there as they know it, AS they know it.” I don’t want to “go too far” in a wild, radical new direction, but I’d be willing to stick my toe in the water if it is exciting, positive and highly leveraged. I can always “pull back” and there is that sense of “balance” and “feedback loops work” (when a human and some technology are mixed up together). I suspect I am not alone in this desire to find this harmonious balance, while exploring new technologies, new paradigms and new methods for “how I OSM.” Maybe I’m dreaming, but really, I think I’m simply looking towards our future, but with both excitement and caution. We can adapt to change, in fact, we should embrace it, knowing we have both a gas pedal and a brake pedal.

pangoSE · December 29, 2022, 12:46pm

I’m an old mapper who started with Wikidata a few years back and learned about the tools, imports, constraints on the infrastructure etc. Through it all I have had a focus on hiking trails, campsites and related amenities.

I have followed a number of discussions about imports into OSM and Wikidata. I’m quite careful when doing edits in OSM en masse it’s very different from Wikidata where automated jobs are easily reversible on a large scale. In OSM that is not the case so seeking consensus’s before making changes on a scale is very important.

I like to craft map and improve by moving about and collecting data myself. I found that it helps me keep in shape and have something meaningful to do

With Wikidata everything is done from an armchair.

When I made my hiking trail matcher I would have really benefited from a good stable identifier for every single segment of hiking trails in the world, but no such identifier exists. It is thus quite difficult at times to match hiking trails and their segments between OSM and WD.
I have had a very hard time finding good datasets for trails in Sweden which pretty much makes it impossible to have a good coverage of hiking trails in Sweden in Wikidata unless I want to manually investigate a host of websites and scrape or manually extract information and put it into Wikidata which I would like to avoid.

The advantage of having official data in Wikidata about hiking trails is that we could potentially find trails that are currently completely missing in OSM and create notes or similar so we can improve the coverage.

pangoSE · December 29, 2022, 1:06pm

Could you link to the osm elements? I don’t understand the problem, have you raised the issue on the talk page in Wikidata?

pangoSE · December 29, 2022, 1:13pm

Hi. You can read more about SKOS Simple Knowledge Organization System - Wikipedia there.

In short it is a better, more advanced way to link between two heterogeneous datasets like OSM elements and Wikidata items because most of the time the element in OSM is not exactly the same as in Wikidata.

Magnus gave an example above. Here is another: how are campsites mapped in OSM? Is a firepit and a bench element a campsite?

How do we link between the campsite item in Wikidata and OSM? Would it be ok for other mappers if I create a new relation with the firepit and bench and grass and give it a name and link it to the corresponding wikidata item?

Is the campsite in OSM broader because it even includes the trashcan but the Wikidata item does not?

stevea · December 29, 2022, 1:36pm

Thank you greatly for that link; I’m devouring the article like I’m hungry!

SomeoneElse · December 29, 2022, 1:43pm

See the claimed administrative boundary of Serbia and the administrative boundary that Serbia controls. See here for a bit of background, but I’d suggest trying to read a bit more widely around the topic, because different points of view are a great help here.

No. I’d have a more productive conversation with next door’s cat** .

** I have previously attempted to draw attention to geographical errors in wikipedia (“XYZ place that is listed as a village isn’t actually a village”), but it essentially fell on deaf ears. Wiki* cares less about accuracy than the fact that there is something that can be cited (ob XKCD), even if the thing being cited is completely out of context.

Minh_Nguyen · December 29, 2022, 3:30pm

When I started seeing this kind of misuse, especially with boundary types that aren’t very suitable for OSM, I was very tempted to riff on your well-written essay on the difference between relations and categories but couldn’t find the words.

Fortunately, the Wikidata project recognized the problem too and introduced an alternative to OSM linking, “geoshape” statements that link to GeoJSON files hosted on Wikimedia Commons. Some of these files are ODbL-licensed Overpass query results. (This is also an alternative to the English Wikipedia’s previous practice of scraping Google Maps directions into a KML file and dumping it into a wiki template. ) Geoshapes could still use better documentation and awareness among Wikipedians.

Minh_Nguyen · December 29, 2022, 4:06pm

In principle, Wikipedia and Wikidata are more appropriate projects than OSM for accommodating different points of view. To the extent that OSM includes both sides of a dispute, it’s to keep the peace within our project or because the “ground truth” is seriously contested. But Wikipedia and Wikidata are also capable of including points of view that are notable despite having fewer facts on the ground.

Case in point: when the Afghan government fell last year, there was very intense edit warring over every Wikidata item related to the country’s government, especially this item about the national flag. Aside from applying semi-protected status (preventing new or anonymous users from editing the item), the project defused the situation by creating an item about each historical variant of the flag and indicating which group accepts or rejects each one.

This nuanced approach directly benefited OSM because some Afghan flags had been mapped as part of flag displays (UN headquarters, Afghan embassies, world-class hotels, mosques, community centers, etc.), but these establishments didn’t suddenly start flying the Taliban flag! The nuance on Wikidata ensured a measure of stability for projects that use Wikidata in conjunction with OSM. The name suggestion index made sure mappers tagged flagpoles with flag:wikidata set to the more specific item about the 2013 flag design, so that people wouldn’t see an out of place, offensive Taliban flag abroad based on country=AF alone.

OSM’s mechanisms for describing geopolitical disputes are comparatively underdeveloped. When keys like disputed_by were first proposed, the proposal relied on very strict criteria about who can dispute a boundary, in an effort to distill these disputes down to a set of simple two-letter codes. But what is to be done about the many disputes between subnational entities? To illustrate this point ad absurdum, I invoked “any tags you like” and tagged the boundary disputes between neighborhood councils within Cincinnati, Ohio, using ad-hoc identifiers. Wikidata identifiers would’ve been more usable and self-documenting, even for humans.

ZeLonewolf · December 30, 2022, 2:13am

I operate a site that uses OSM data (boundaries, and streets). I maintain links to the OSM ID numbers of the original objects. I periodically query the database to see if those objects have changed, been deleted, or if new objects (of the type I care about) suddenly appear. So I’m very familiar with this concern from the data consumer side.

My view on this is:

it’s not a real problem, for the following reasons:

When an object changes, the data consumer software has to make an opinionated decision about what to do. Let’s say your favorite local restaurant closes its doors, and then re-opens as a different restaurant in the same location. Is it the same object or a different object? That answer will depend on the data consumer. Someone making software for restaurant reviews would consider it a different object, and reviews for the old restaurant shouldn’t carry over to the new one. However, another data consumer concerned about tracking user sightings of architectural styles of different buildings in a city might consider them the same object and want to link that history together, and will only care if a building is knocked down and rebuilt.
The world is messy. When one object becomes multiple objects, or multiple objects become one object, the data consumer software has to make an opinionated decision about what to do.
OSM IDs only change if a user deletes and recreates a new object for a feature. This rarely happens. Therefore, data consumers can rely on changes being a relatively rare event that can be handled as an exceptional case.

So my opinion is that a persistent ID is not a useful concept, and someone stating that it’s an issue owes the burden of proof on that, especially if the “solution” results in any kind of burden being added to mappers, as Frederik points out.

I agree with this, and I think wikidata is a fine solution to the problem of linking an object to a real-world concept. It also has the benefit of providing a place to store information that isn’t appropriate for OSM, but we put it there anyways. I look forward to the day when we might even stop storing wikipedia, wikipedia:xx, and name:xx tags for objects with a wikidata tag, as these are all directly accessible in wikidata and pulled in by major data consumers today. Every so often, someone goes through and adds this or that language name:xx tag to many (usually place=) objects, likely in an automated fashion to transliterate them to a different script.

While there is initial support for wikidata in iD and JOSM (via plugin), it’s not as robust as it needs to be. In my ideal world, a user could specify what information they would like to add, and then the editor sorts out which database the information needs to live in without the user having to muck around in the inner details of either (unless they wanted to).

Even with wikidata, this doesn’t solve the fact that sometimes things change in real life, and data consumers must devise appropriate strategies to deal with that. That’s a data consumer problem, not a database problem.

I don’t think it’s helpful in a community project to trot out stereotypes, suggest guilt-by-association, and then use that to label people and then dismiss them as bad for the project. For example, it is a tired stereotype that German mappers only do high-quality craft mapping and meet regularly at pubs to do so, while Americans only sit in their armchairs and dump large quantities of low-quality data into the map. Yet, I have seen multitudes of examples of American craft mapping and surveying, and the most recent instance of a (now blocked) mapper repeatedly importing low-quality data against community wishes in the US was a mapper from Germany!

I would rather promote the idea that people are complex, communities are complex, and sometimes people will agree, and sometimes they’ll disagree. Even though I disagree with @pangoSE’s view that stable identifiers are a problem that needs solving, I think it was perfectly valid for them to raise the discussion topic, and I certainly don’t think they’re a bad influence on OSM merely because of their activity on the wikidata project.

Minh_Nguyen · December 30, 2022, 2:48am

This seems like an argument for limiting a stable identifier scheme to a particular kind of feature, or a particular “layer”, if you will. To the extent that one can generalize about POIs, they turn over much more frequently than other kinds of features. Boundaries are nearly at the other end of the spectrum (except perhaps in active military conflicts, or this city boundary that changed weekly throughout the ’50s and ’60s).

There’s actually a cottage industry of stable identifier schemes for POIs. If you’ve ever searched the Internet for a POI’s website as you map it, you’ve likely encountered some of them in the URL slugs of various online ordering systems and business listing aggregators like Yelp, Foursquare, and Facebook. I suspect that, so far in this thread, we’ve already thought much harder about the problem of persistence and accuracy than some of these systems have.

SK53 · December 30, 2022, 2:22pm

This is way easier, but still non-trivial. I implemented such a thing for business customers at a bank in the late '90s, slightly complicated by the original idea being part of an experimental project to apply data mining & big data to customer segmentation, not supportingg relationship managers. Interestingly, the existing persistent customer identifier was not suitable for this part of the business.

In Great Britain we now have open data access to several persistent identifiers created by national authorities:

UPRN (Unique Property Reference Number): includes all sort of properties, not just houses, but pillar post-boxes, substations, mobile phone masts, and for some odd reason ponds. See wiki.
USRN (Unique Street Reference Number): wiki.
TOID (TOpographic IDentifier): these are the Ordnance Survey’s persistent identifiers and a few have surfaced in various open data sets released by them. So far the most useful I have found are the TOIDs for street names, but for use outside OSM. The technical spec. for use of TOIDs mentions some of the aspects of persistence (e.g., when a building changes size/area over a certain amount it gets a new one).

For many features it is relatively easy to create a relationship between these and OSM elements (one example of OSM addresses, buildings and related to UPRN here), without having to add the external identifier to OSM.

The main deficiency of the open data sources of these identifiers is that there is next to no metadata, so UPRNs include ones for defunct properties without even a flag indicating that. I’m raising them here because they are well-documented cases of persistent identifiers of geospatial features.

salgo60 · December 30, 2022, 4:22pm

That is the weakness with an open plattform…it depends that you have dedicated people, In Sweden we have some “superusers” who care about things like that…