Persistent and stable identifiers

fititnt · December 24, 2022, 1:45pm

This is, by definition, impossible. And a hint is that a popular name for those who have such IDs for others use is authority control.

Even features that seem 100% procedural, let’s say latitude and longitude in a popular reference system WGS84, still need to be maintained: the algorithm needs to be published. Those who create hardware and software need to understand how to translate to other forms. The “maintaining” might be a book in a library that’s good enough to still be usable, but need to be maintained. This is less clear with things we take for granted, like what “0 1 2 3 4 5 6 7 8 9” means, but even datetime formats have conventions.

By logic, IDs created by direct procedural translation of another thing also need to be maintained (even if it means publishing and incentive people to use it over alternatives).

Then, we have the opaque identifiers, which are considered the best for the very long term, because new people have less incentive to change then without actually having funcional issues. The arks have discussion on why they are against even having organization as part of the prefix (like happens with some old DOIs) because eventually organizations might change names, so they might try to change the entire prefix or… give up keeping record of old ones

Note that even centuries ago, very few (often elite, sponsored by kings) were able to write books, yet very few survive, as people could decide by the cover that the book doesn’t seem worthy. Much more content is produced today, and the role of authority control is very, very important. Most people still complain about OpenStreetMap internal representation without being aware that it is more close to a library catalog, with full history even for very specific nodes, than discartable geometries that would be mere layers. Note how upset not just Data Working Group, but old contributors become when they see people deleting content: they’re reaction is similar to libraries seeing someone burning books.

However, different from DOIs (which there’s more effort to deserve to have one; and strict set of minimal metadata) new kinds of IDs to represent concepts such as the one for OpenStreetMap may from time to time be created by mistake and in massive quantities. If duplicated, then aliases from duplicates need a pointer to recommend reference. But (like they do on Wikidata) things poorly defined, even if without user request, might be worth delete. But even this procedure needs to be known, so let people with less fear.

My argument here is the following: the idea of identifiers for places, necessarily, means authority control. Think about centuries in future. However, the very nature of being geospatial, could allow easily the equivalent of concept identifiers on OpenStreetMap be far, far more updated by bots and by companies than happens on Wikidata after their initial setup by humans (or the first organization that starts the definition, but using their identifier as one of the properties) because is easier to make inferences based on location thar are resolvable better than in Wikidata. (because most of the time OpenStreetMap already only accepts things in space and time).

So, replying @o_andras, while the idea of UUID, made by Overture, is in fact perfect for distributed concept, if we assume OpenStreetMap as authority control (which is how is considered by the way document very well, I closing full history of every node), we would still have something such as serial incremental number at least for the concepts that have some level of notoriety. But even for things that would be heavily automated (think like a few big collaborators adding data) my next point makes me think that no single big player could create an identifier alone without others agreeing with more relevance.

(Hypothesis) strategy to deal with both notoriety and long term survival for concepts for place: require baseline standards on how interlinked is the definition

I had one idea about how to deal with persistence of identifiers that may not have sufficient information and already are not clearly likely to be notorious (like administrative boundaries): we, even more strict than Wikidata, enforce minimal metadata, maybe even some delay time (like weeks). The DOI, while allows even those authorized to issue codes to have private uses, for example does this for what’s expected to be used in public:

https://www.doi.org/doi_handbook/4_Data_Model.html#4.3.1

So, while not necessarily as big as UUID 4, if generic amenities (which is not clear their relevance) could get some sort of identifier, but not the same kind of the shorter ones, otherwise we would have far easier the issues the @SimonPoole pointed out. The types of places of interest that would likely to have more spam (so people even being paid add metadata to them, like happens on Wikipedia), then would have stricter requirements, however always more focused on what could make them well defined to be interlinked (avoid take in calculation for example ephemeral data that most users would do anyway, like what it sells).

A “DOI approach” would means automatically that some human need to decide if that place deserve a code (even if this is somewhat algoriticaly, like today is on Wikipedia that after some time a page may become a Wikidata Q item) and even then, much, much more medatada. This would means like take in consideration to value if can have or not an identifier, users add data of inception, etymology of the name, if this shop is part of some brand, etc, etc, etc, things that are less likely to change. Sadly, we cannot add personal information on OpenStreetMap, but if the shop had some famous founder in another authority control (like Wikidata Qs) then it could add as founder that person.

The idea of enforcing even stricter metadata for the same kind of shorter identifiers like administrative boundaries could have does not mean these places could have other more algorithmic or by user/company requests. We can do both. But the “first-class” persistent identifiers (even if places cease to exist) in my opinion should only be allowed if at least in theory they could eventually be cared for.

Since several people commented about the challenge of knowing when something changed or not, a good safe approach is to intentionally make it hard to give identifiers for the first wave of amenities, and I mean not “Wikidata level notoriety”, but “Wikipedia level notoriety” (aka already be famous). This alone could allow sufficient time to think about (but already seeing how the encoding is working) so we could start to define minimum metadata for both humans and anyone else (like companies) would need to have.

fititnt · December 24, 2022, 2:33pm

A built-in “notification system” might be for OpenStreetMap what interlink equivalent pages between languages was for Wikipedia.

Starting with far less items also helps to perfect tools to trigger notifications based on changes. Today for example some people already watch for changes on pages on the Wiki, and since people on OpenStreetMap are more exigent than Wikipedia/Wikidata, if we create such items, people will get interested in known changes since last visit.

Also, there is an obvious advantage of having another way to query the concepts, however today this somewhat would still be possible with advanced queries. But some way for people to know changes customized for their needs, not. And if someone wants a geographic region (like their city) then they focus on the concept of the region and be easier for anyone to create apps that allow them to filter updates customized for that person.

Some people already just enter OpenStreetMap to change some place they found an error. I would say that if we come to a point to know how to trigger notifications, for new users that changed points of interest (not something like just a road) it could be notified at least once when someone else’s also updated their Point Of Interest. For power users this might not be great, but for likely new users could get more engaged with the project.

Disclaimer: despite what I’m saying about trigger notifications, I do not have full proof of concepts on this and I know people complain Wikibase does not handle it very well. But if I would have to make some feature to convince everyone that it is worth going ahead, notify changes (however with option to ignore some types of minor modifications) seems to be what deliver greater impact. Note that compared with Wikipedia, collaborators on OpenStreetMap (except the Wiki itself) is not notified about changes at all.

SomeoneElse · December 24, 2022, 3:32pm

(it’s unclear to me here whether you’re taking about something like “placekey.io” here or some database that someone in OSM has control over)

(assuming you’re talking about "something that people in OSM have control over):
If X and Y are just OSM tags, then storing something derived from other OSM tags inside OSM doesn’t make any sense. Of course, you can store it outside, and use it to check when something in OSM has changed.

Like with everything else in OSM, whoever volunteers to do that.

If it’s in a list outside OSM it’s not a problem - whoever maintains that list is in charge.

(if you’re talking about placekey.io or a similar externally-maintained database)
That’s just another primary key in another third-party database which we have no control over, of which there are lots in OSM already - wikipedia, wikidata, FHRS, etc.). They can be useful - for example should https://www.openstreetmap.org/way/117236239/history change hands, the FHRS ID for it will likely change too. Lots of tools are available to keep track of changes. Subject to licensing, you could do something similar with other third-party data.

Any non-open data from a third-party may be “here today, gone tomorrow” - there are lots of defunct startups in this area, so unlike open data I wouldn’t rely on it as a long-term solution.

Jez_Nicholson · December 24, 2022, 8:07pm

Okay, I’ll throw out an unpopular opinion: this is all academic. The majority of nodes do not get deleted and replaced, and the ones that can be logically linked to Wikidata even less so. You all are creating clever new schemes to solve something that doesn’t need solving.

Shoot me.

SomeoneElse · December 24, 2022, 8:32pm

My experience is that that is exactly correct. I do various QA on bits of OSM (making sure certain sorts of relations are still valid, filling in gaps in route relations where they have been introduced by accident, that sort of thing). I’ve been doing this for a couple of years now and all I’ve seen is that occasionally a cycle route will be split into a few pieces when it gets unwieldy, and perhaps a super-relation will get created. If you look here you can see that there’s maybe been 1 change per month of the currently 1600+ trails checked (UK+IE).

Elsewhere, perhaps nodes for POIs might get replaced by polygons, but that’s easy to spot too.

Minh_Nguyen · December 24, 2022, 9:50pm

I think I share the view that what we’re able to model in Wikidata probably suffices for most of what we ourselves intend to do with stable identifiers. However, the subtext behind this thread is that, apparently, some OSM data consumers have found it so important to fashion an identifier-based conflation system for OSM data that Overture Maps is touting it as one of their selling points. I guess this is an exercise in seeing if we could bring some of that functionality in-house.

Stable identifiers aren’t really about tracking the history of a real-world feature over time. It’s more about being able to hang some extra metadata off of an OSM feature with some confidence that it refers to the same real-world feature. Change tracking only becomes relevant to the extent that either OSM or the external data source can become outdated.

I don’t have much experience with authority control on POIs, but I’ve seen that linear referencing schemes naturally arise from trying to tie static road data to historic or real-time traffic and incident data. Even so, at some point, the best stable identifier scheme compares poorly to obtaining fresh data and purging stale data. Users don’t care whether you managed to match the traffic jam to exactly the right spot as much as they want the colors on the map to be current. Likewise, restaurant reviews from a decade ago aren’t necessarily trustworthy anymore, even if the cuisine and owner stay the same.

dieterdreist · December 25, 2022, 12:51am

Since several people commented about the challenge of knowing when something changed or not, a good safe approach is to intentionally make it hard to give identifiers for the first wave of amenities, and I mean not “Wikidata level notoriety”, but “Wikipedia level notoriety” (aka already be famous).

we already have identifiers for things with wikidata level notoriety, it‘s the tag “wikidata” These also already cover all wikipedia level notoriety things.

IanH · December 26, 2022, 3:24am

I see nothing wrong if we or another party use one or more entity tracking IDs. The OSM key needs to be well-formed according to some established naming rules. The name shouldn’t have to be altered due to the agency or source of the index’s name changed. This should prevent any updates not directly affected by changes by the object representative of the node(s).

o_andras · December 26, 2022, 11:34am

I don’t know, I was just throwing alternatives at the wall

That post is about Placekey.

That’s not what I suggested. X@Y are supposed to be in that post the Placekey ID derived from X and Y. From what I understood from their site, the IDs depend on both geospatial location and something else not well specified.

That is the problem of having to maintain an extra DB. If some address isn’t in OSM, no big deal, maybe you can still find the street, just not the housenumber. But if you can’t assume high-quality IDs, then might as well have no IDs at all and save yourself some (lots of) work.

I think the point of this thread is to find a solution, but that doesn’t lead to a solution, unless on OSM’s side we can garuatee some things.

Of course, people may be OK with less than ideal (e.g. ID123 from the external DB is now mapped on OSM by w987 instead of n654), and that’s fine. But in that case we don’t need to have this discussion at all on OSM.

“The majority of cars on the roads don’t have accidents. Telling people to drive slowly, to not drink before driving, etc are just solutions to a problem that doesn’t exist.”

This exagerated tongue-in-cheek “analogy” is just to point out, again, that if the ID system isn’t trustworthy, then it’s basically worthless IMO (see right above).

BTW I’m neither for nor against stable IDs, I’m for OSM. I’m just here discussing possible ways to marry both, without sacrificing anything in OSM.

OP disagrees?

@Minh_Nguyen you can’t add everything and anything you map on OSM to Wikidata, like the mom&pops greengrocer shop at the corner, or the less than popular cafe that’s nonetheless been there for decades.

See above.

Minh_Nguyen · December 26, 2022, 11:48am

I was responding to @Jez_Nicholson, not the OP. (When you reply to a whole post, Discourse counterintuitively indicates that post in a little icon in the top-right corner.)

Sure, there’s a long tail of things in OSM that may not ever make it into Wikidata. However, my point is that the use cases for identifiers that normally occupy our attention as mappers – largely solvable with the help of Wikidata – are not the end-all-and-be-all of use cases for identifiers. In other words, I don’t think a (kind of) stable identifier system based on OSM would be a total fool’s errand, even if we ourselves may not enjoy any tangible benefits from such a system. Whether it’s strategically important for this project to work on this system is another matter.

SomeoneElse · December 26, 2022, 11:59am

It isn’t - as I said above, I use OSM IDs as “permanent IDs” for external checks and they are more than good enough for that.

o_andras · December 26, 2022, 2:40pm

I know, but who’s “we”? “We” seems to suggest that everyone is looking into stable IDs for the same purposes, hence “OP disagrees”, because you can’t use Wikidata for OP’s purpose. And I’m not sure what even are the purposes that “we ourselves” have. I don’t have a use for it, for example.

And here it seems you’re contradicting yourself. Yes, there’s no fixed set of purposes common to everyone, and not even common “us mappers”, so saying that “our” purposes (as mappers) are largely solvable by wikidata doesn’t make sense. OP certainly doesn’t think the problem is solvable by wikidata, otherwise I don’t think they would have created this thread. (It’s implicit that OP is also a mapper)

Your examples of routes can be explained simply by “relations”. Relations are a mystery to the majority of mappers, so the fraction of mappers that touch relations is for certain miniscule. And how often is it that some bicycle route, or hiking trail, or whatever else changes in reality? There’s not much reason to keep modifying routes on OSM if they don’t change in the physical world. Plus, I think that a relation does not show up in changesets only because its constituent objects have been modified, right? Given these points, it’s only natural that the IDs of those relations are kind of stable, but you can’t say the same for nodes or ways in general.

In some dense areas I’ve seen the same restaurant/etc mapped twice or thrice (as a node or way), right next to each other, for no good reason. If you have to pick one of them, which do you expect to be the most stable? The one closest to its physical position? The one with the best details? The oldest? All of them are good candidates to be the final representative, but all of them are good candidates to be removed too.

Adamant1 · December 26, 2022, 3:03pm

Essentially zero since the project started according to this chart. Understandably since they are extremely obtuse and way to easy to mess up. Also, in a lot of cases they aren’t used probably anyway. So stable identifiers would be a major improvement to the places where the use cases overlap. Although, obviously not a 1/1 replacement.

SomeoneElse · December 26, 2022, 3:06pm

No - ways as well, and very generic ways at that. Occasional updates are needed as I mentioned earlier, but it’s not a major issue.

Minh_Nguyen · December 26, 2022, 6:24pm

I know not every mapper has the same goals or ideas. There’s no contradiction; I’m just trying to point out why this discussion may seem academic to some here but still have value. Do you disagree?

dieterdreist · December 26, 2022, 7:06pm

And how often is it that some bicycle route, or hiking trail, or whatever else changes in reality

it is not exceptional for a hiking trail to change in reality, and changes to hiking relations happen even more because every way split of a member will create a new version

fititnt · December 27, 2022, 5:01am

So, from the discussions and what’s seems to be common looking at very old IDs and relations that still exist, the relations/ways/nodes on OpenStreetMap, because their full history plus (what is complained as not allowing fast imports) care to avoid any kind of deletion are, by definition, already close to persistent/external identifiers. What’s actually not as desirable (compared to most authority control) is when concepts requires more than one unique identifier, which can happens for example since one way may be split because needs to have different metadata on a smaller part, so the initial information is copied on all it’s new parts

Potential counter argument: what about nearly created features that duplicate something, then get deleted? What about old things (like entire ways without any metadata, likely result of bad Imports), that eventually someone deleted in the future delete? My reply: even persist identifiers such as DOI can be aliased to new ones (see Changing or deleting DOIs - Crossref) and when truly in error (in OSM equivalent, either early duplicated not used outside or very old, but without any metadata at all to be used by anyone) could point to a truly 410 gone forever page without even explanation (like https://www.crossref.org/_defunct-doi/)

So, I’m genuinely open to criticism or comments here. But my argument is that inside OpenStreetMap, its data is far more often than not already consistent (even before any try any schema/validation on top of it), likely even allowing a decent level of full retroactive research anyone would expect from most well crafted library heading systems.

Potential counter argument: but if we compare Wikidata (SPARQL) to OpenStreetMap (Overpass QL), the Wikidata approach seems more organized! My reply: the way OpenStreetMap is organized (explicitly geodatabase, also far less scary to collaborate than Wikidata directly), even with is free tagging approach, by using the most popular conventions such as the ones to render would make OpenStreetMap data in RDF form much more well interlinked than Wikidata is for places. But it needs to turn inference on to have it pre-cached (which could easily expand data to a point make BlazeGraph collapse). Contrary to naive opinions, Wikidata is far less complete for places than OpenStreetMap is, and is no surprise that OSM data is often used to argue other datasets.

So, saying node/way/relation are okay, means the discussion of “persistent IDs” is not necessary? It is. But the actual downside is not quality, but duplicates not as easy to track outside. The expectative of a library catalog (or in DOI terms, an Register Agency) would be if users ask for an old ID, that evolved for something else, we expand this new collection.

Hypothesis: have an dedicated metadata search for IDs that “evolved”

The early idea from @SimonPoole and other comments from @SomeoneElse and @Jez_Nicholson (about nodes/way may be not trivial do use as external reference, but are unlikely to be removed) already are reasonable.

In addition to the idea of eventually having explicitly persistent identifiers (because even relations have limitations), my hypothesis is with some strategy to (even if not in the main API, but something looking at full history) to transform an old ID of something as an alias for something new. This might be easier for things that started with some kinds of nodes or relations (ways I’m really not sure if is as easy)

This alone could help a lot to reduce need to create high level identifiers for something that still on its infancy (such as a point to represent a hamlet (Tag:place=hamlet - OpenStreetMap Wiki) that can start as node, but become area, maybe even over time become a village or town). Also, if we explicitly document this approach, then we could make them stable for the outside world (again, here considering that OpenStreetMap always points to what is known to exist in current state, so for historical meaning outside users would need the date).

Since some people here already discuss RDF/Wikidata etc, maybe we could eventually have some sort of script or proxy that is designed to “upgrade” things that changed. This both would require some online link and (since very likely would be used a lot, in special for data conflation) how others could run it locally without speed limits. I could try do the code for this, but still interested on the strategies/algorithms we could use!

PS: the suggestion to we explicitly make some sort of metadata search for evolution of some items does not replace need of persistent IDs or like what I agree with @Minh_Nguyen comments, we try to improve data conflagation and/or terminology cross-walk without rely too much on outsiders. It’s easier for data already in OpenStreetMap format, but for compare external data, latitude/longiture plus maybe some extra metadata would tend to be usable in special to match data to help humans.

salgo60 · December 27, 2022, 10:42am

Isnt one challenge that Open Street Map lacks SKOS support?

A Wikidata item and an OSM item are “always same as” it would make sense to support SKOS ?

same as
- on the subject owl:sameAs " When owl:sameAs Isn’t the Same: An Analys of Identity in Linked Data"
nearly same as
narrower
broader

Wikidata supports using SKOS as a qualifier but I would like to see it also in OSM…

Example

A church
- is the OSM object same as the Wikidata object or is the Wikidata object also the cemetery or all buildings next to the church,

Having identifiers is a key see FAIRDATA F1 “Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset”

but also defining the relation using SKOS make sense and if we change the item in OSM it maybe makes sense changing just the SKOS relation…

pangoSE · December 27, 2022, 11:37am

I like this. So this in practice could be done with a new tool and set of tags.
Wikidata as tag should be avoided and replaced by
wikidata:sameas
wikidata:narrower
Etc.
Streetcomplete could have a new view where it presents the wd object and osm object and ask if the osm one is sameas borader, narrower, etc

dieterdreist · December 27, 2022, 2:23pm

A Wikidata item and an OSM item are “always same as” it would make sense to support SKOS ?

they are similar, but wikidata and OpenStreetMap items, while (ideally) somehow referring to the same thing, aren’t “same” in a mathematical way, they are both independently defined through different systems, which may contradict each other if you compare their properties and meanings of those properties and relationships. And both systems change all the time, independently of each other.