Persistent and stable identifiers

Jez_Nicholson · December 24, 2022, 8:07pm

Okay, I’ll throw out an unpopular opinion: this is all academic. The majority of nodes do not get deleted and replaced, and the ones that can be logically linked to Wikidata even less so. You all are creating clever new schemes to solve something that doesn’t need solving.

Shoot me.

SomeoneElse · December 24, 2022, 8:32pm

My experience is that that is exactly correct. I do various QA on bits of OSM (making sure certain sorts of relations are still valid, filling in gaps in route relations where they have been introduced by accident, that sort of thing). I’ve been doing this for a couple of years now and all I’ve seen is that occasionally a cycle route will be split into a few pieces when it gets unwieldy, and perhaps a super-relation will get created. If you look here you can see that there’s maybe been 1 change per month of the currently 1600+ trails checked (UK+IE).

Elsewhere, perhaps nodes for POIs might get replaced by polygons, but that’s easy to spot too.

Minh_Nguyen · December 24, 2022, 9:50pm

I think I share the view that what we’re able to model in Wikidata probably suffices for most of what we ourselves intend to do with stable identifiers. However, the subtext behind this thread is that, apparently, some OSM data consumers have found it so important to fashion an identifier-based conflation system for OSM data that Overture Maps is touting it as one of their selling points. I guess this is an exercise in seeing if we could bring some of that functionality in-house.

Stable identifiers aren’t really about tracking the history of a real-world feature over time. It’s more about being able to hang some extra metadata off of an OSM feature with some confidence that it refers to the same real-world feature. Change tracking only becomes relevant to the extent that either OSM or the external data source can become outdated.

I don’t have much experience with authority control on POIs, but I’ve seen that linear referencing schemes naturally arise from trying to tie static road data to historic or real-time traffic and incident data. Even so, at some point, the best stable identifier scheme compares poorly to obtaining fresh data and purging stale data. Users don’t care whether you managed to match the traffic jam to exactly the right spot as much as they want the colors on the map to be current. Likewise, restaurant reviews from a decade ago aren’t necessarily trustworthy anymore, even if the cuisine and owner stay the same.

dieterdreist · December 25, 2022, 12:51am

Since several people commented about the challenge of knowing when something changed or not, a good safe approach is to intentionally make it hard to give identifiers for the first wave of amenities, and I mean not “Wikidata level notoriety”, but “Wikipedia level notoriety” (aka already be famous).

we already have identifiers for things with wikidata level notoriety, it‘s the tag “wikidata” These also already cover all wikipedia level notoriety things.

IanH · December 26, 2022, 3:24am

I see nothing wrong if we or another party use one or more entity tracking IDs. The OSM key needs to be well-formed according to some established naming rules. The name shouldn’t have to be altered due to the agency or source of the index’s name changed. This should prevent any updates not directly affected by changes by the object representative of the node(s).

o_andras · December 26, 2022, 11:34am

I don’t know, I was just throwing alternatives at the wall

That post is about Placekey.

That’s not what I suggested. X@Y are supposed to be in that post the Placekey ID derived from X and Y. From what I understood from their site, the IDs depend on both geospatial location and something else not well specified.

That is the problem of having to maintain an extra DB. If some address isn’t in OSM, no big deal, maybe you can still find the street, just not the housenumber. But if you can’t assume high-quality IDs, then might as well have no IDs at all and save yourself some (lots of) work.

I think the point of this thread is to find a solution, but that doesn’t lead to a solution, unless on OSM’s side we can garuatee some things.

Of course, people may be OK with less than ideal (e.g. ID123 from the external DB is now mapped on OSM by w987 instead of n654), and that’s fine. But in that case we don’t need to have this discussion at all on OSM.

“The majority of cars on the roads don’t have accidents. Telling people to drive slowly, to not drink before driving, etc are just solutions to a problem that doesn’t exist.”

This exagerated tongue-in-cheek “analogy” is just to point out, again, that if the ID system isn’t trustworthy, then it’s basically worthless IMO (see right above).

BTW I’m neither for nor against stable IDs, I’m for OSM. I’m just here discussing possible ways to marry both, without sacrificing anything in OSM.

OP disagrees?

@Minh_Nguyen you can’t add everything and anything you map on OSM to Wikidata, like the mom&pops greengrocer shop at the corner, or the less than popular cafe that’s nonetheless been there for decades.

See above.

Minh_Nguyen · December 26, 2022, 11:48am

I was responding to @Jez_Nicholson, not the OP. (When you reply to a whole post, Discourse counterintuitively indicates that post in a little icon in the top-right corner.)

Sure, there’s a long tail of things in OSM that may not ever make it into Wikidata. However, my point is that the use cases for identifiers that normally occupy our attention as mappers – largely solvable with the help of Wikidata – are not the end-all-and-be-all of use cases for identifiers. In other words, I don’t think a (kind of) stable identifier system based on OSM would be a total fool’s errand, even if we ourselves may not enjoy any tangible benefits from such a system. Whether it’s strategically important for this project to work on this system is another matter.

SomeoneElse · December 26, 2022, 11:59am

It isn’t - as I said above, I use OSM IDs as “permanent IDs” for external checks and they are more than good enough for that.

o_andras · December 26, 2022, 2:40pm

I know, but who’s “we”? “We” seems to suggest that everyone is looking into stable IDs for the same purposes, hence “OP disagrees”, because you can’t use Wikidata for OP’s purpose. And I’m not sure what even are the purposes that “we ourselves” have. I don’t have a use for it, for example.

And here it seems you’re contradicting yourself. Yes, there’s no fixed set of purposes common to everyone, and not even common “us mappers”, so saying that “our” purposes (as mappers) are largely solvable by wikidata doesn’t make sense. OP certainly doesn’t think the problem is solvable by wikidata, otherwise I don’t think they would have created this thread. (It’s implicit that OP is also a mapper)

Your examples of routes can be explained simply by “relations”. Relations are a mystery to the majority of mappers, so the fraction of mappers that touch relations is for certain miniscule. And how often is it that some bicycle route, or hiking trail, or whatever else changes in reality? There’s not much reason to keep modifying routes on OSM if they don’t change in the physical world. Plus, I think that a relation does not show up in changesets only because its constituent objects have been modified, right? Given these points, it’s only natural that the IDs of those relations are kind of stable, but you can’t say the same for nodes or ways in general.

In some dense areas I’ve seen the same restaurant/etc mapped twice or thrice (as a node or way), right next to each other, for no good reason. If you have to pick one of them, which do you expect to be the most stable? The one closest to its physical position? The one with the best details? The oldest? All of them are good candidates to be the final representative, but all of them are good candidates to be removed too.

Adamant1 · December 26, 2022, 3:03pm

Essentially zero since the project started according to this chart. Understandably since they are extremely obtuse and way to easy to mess up. Also, in a lot of cases they aren’t used probably anyway. So stable identifiers would be a major improvement to the places where the use cases overlap. Although, obviously not a 1/1 replacement.

SomeoneElse · December 26, 2022, 3:06pm

No - ways as well, and very generic ways at that. Occasional updates are needed as I mentioned earlier, but it’s not a major issue.

Minh_Nguyen · December 26, 2022, 6:24pm

I know not every mapper has the same goals or ideas. There’s no contradiction; I’m just trying to point out why this discussion may seem academic to some here but still have value. Do you disagree?

dieterdreist · December 26, 2022, 7:06pm

And how often is it that some bicycle route, or hiking trail, or whatever else changes in reality

it is not exceptional for a hiking trail to change in reality, and changes to hiking relations happen even more because every way split of a member will create a new version

fititnt · December 27, 2022, 5:01am

So, from the discussions and what’s seems to be common looking at very old IDs and relations that still exist, the relations/ways/nodes on OpenStreetMap, because their full history plus (what is complained as not allowing fast imports) care to avoid any kind of deletion are, by definition, already close to persistent/external identifiers. What’s actually not as desirable (compared to most authority control) is when concepts requires more than one unique identifier, which can happens for example since one way may be split because needs to have different metadata on a smaller part, so the initial information is copied on all it’s new parts

Potential counter argument: what about nearly created features that duplicate something, then get deleted? What about old things (like entire ways without any metadata, likely result of bad Imports), that eventually someone deleted in the future delete? My reply: even persist identifiers such as DOI can be aliased to new ones (see Changing or deleting DOIs - Crossref) and when truly in error (in OSM equivalent, either early duplicated not used outside or very old, but without any metadata at all to be used by anyone) could point to a truly 410 gone forever page without even explanation (like https://www.crossref.org/_defunct-doi/)

So, I’m genuinely open to criticism or comments here. But my argument is that inside OpenStreetMap, its data is far more often than not already consistent (even before any try any schema/validation on top of it), likely even allowing a decent level of full retroactive research anyone would expect from most well crafted library heading systems.

Potential counter argument: but if we compare Wikidata (SPARQL) to OpenStreetMap (Overpass QL), the Wikidata approach seems more organized! My reply: the way OpenStreetMap is organized (explicitly geodatabase, also far less scary to collaborate than Wikidata directly), even with is free tagging approach, by using the most popular conventions such as the ones to render would make OpenStreetMap data in RDF form much more well interlinked than Wikidata is for places. But it needs to turn inference on to have it pre-cached (which could easily expand data to a point make BlazeGraph collapse). Contrary to naive opinions, Wikidata is far less complete for places than OpenStreetMap is, and is no surprise that OSM data is often used to argue other datasets.

So, saying node/way/relation are okay, means the discussion of “persistent IDs” is not necessary? It is. But the actual downside is not quality, but duplicates not as easy to track outside. The expectative of a library catalog (or in DOI terms, an Register Agency) would be if users ask for an old ID, that evolved for something else, we expand this new collection.

Hypothesis: have an dedicated metadata search for IDs that “evolved”

The early idea from @SimonPoole and other comments from @SomeoneElse and @Jez_Nicholson (about nodes/way may be not trivial do use as external reference, but are unlikely to be removed) already are reasonable.

In addition to the idea of eventually having explicitly persistent identifiers (because even relations have limitations), my hypothesis is with some strategy to (even if not in the main API, but something looking at full history) to transform an old ID of something as an alias for something new. This might be easier for things that started with some kinds of nodes or relations (ways I’m really not sure if is as easy)

This alone could help a lot to reduce need to create high level identifiers for something that still on its infancy (such as a point to represent a hamlet (Tag:place=hamlet - OpenStreetMap Wiki) that can start as node, but become area, maybe even over time become a village or town). Also, if we explicitly document this approach, then we could make them stable for the outside world (again, here considering that OpenStreetMap always points to what is known to exist in current state, so for historical meaning outside users would need the date).

Since some people here already discuss RDF/Wikidata etc, maybe we could eventually have some sort of script or proxy that is designed to “upgrade” things that changed. This both would require some online link and (since very likely would be used a lot, in special for data conflation) how others could run it locally without speed limits. I could try do the code for this, but still interested on the strategies/algorithms we could use!

PS: the suggestion to we explicitly make some sort of metadata search for evolution of some items does not replace need of persistent IDs or like what I agree with @Minh_Nguyen comments, we try to improve data conflagation and/or terminology cross-walk without rely too much on outsiders. It’s easier for data already in OpenStreetMap format, but for compare external data, latitude/longiture plus maybe some extra metadata would tend to be usable in special to match data to help humans.

salgo60 · December 27, 2022, 10:42am

Isnt one challenge that Open Street Map lacks SKOS support?

A Wikidata item and an OSM item are “always same as” it would make sense to support SKOS ?

same as
- on the subject owl:sameAs " When owl:sameAs Isn’t the Same: An Analys of Identity in Linked Data"
nearly same as
narrower
broader

Wikidata supports using SKOS as a qualifier but I would like to see it also in OSM…

Example

A church
- is the OSM object same as the Wikidata object or is the Wikidata object also the cemetery or all buildings next to the church,

Having identifiers is a key see FAIRDATA F1 “Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset”

but also defining the relation using SKOS make sense and if we change the item in OSM it maybe makes sense changing just the SKOS relation…

pangoSE · December 27, 2022, 11:37am

I like this. So this in practice could be done with a new tool and set of tags.
Wikidata as tag should be avoided and replaced by
wikidata:sameas
wikidata:narrower
Etc.
Streetcomplete could have a new view where it presents the wd object and osm object and ask if the osm one is sameas borader, narrower, etc

dieterdreist · December 27, 2022, 2:23pm

A Wikidata item and an OSM item are “always same as” it would make sense to support SKOS ?

they are similar, but wikidata and OpenStreetMap items, while (ideally) somehow referring to the same thing, aren’t “same” in a mathematical way, they are both independently defined through different systems, which may contradict each other if you compare their properties and meanings of those properties and relationships. And both systems change all the time, independently of each other.

SomeoneElse · December 27, 2022, 4:01pm

Wikipedia / Wikidata aren’t even consistent within themselves. This wikidata item allegedly matches these two wikipedia entries, yet those pages have a very different idea of the extent of the country that they describe. OSM has objects for both, and wikipedia in a sense has too (it has different language pages), but both wikipedia pages are linked to one wikidata item, which makes no sense.

SK53 · December 27, 2022, 4:44pm

Another example: Oldmoor Wood. Originally added to wikidata as a “human settlement” only changed to “woodland” after 6 years. This shows the item refers to two things: a name which can be erroneously attached to other (or non-existent things) and a wood. Obviously, someone decided that the name was more important for persistence. If I’d been doing it this would have been marked as obsolete and a new valid identifier created with some link to the previous value. In this case the resolution is straightforward, but I’ve come across many similar examples of Wikidata where this is not simple. The commonest is where a wikidata item refers both to a village or other settlement and an administrative boundary with the same name.

aighes · December 27, 2022, 5:04pm

There are also plenty of cases in the other way around. Consider about all those historic buildings nowadays contain a museum. So there is a wikipedia/wikidata entry for the building and another one for the museum. But in OSM, there is only one polygon describing everything.