Persistent and stable identifiers

ezekielf · December 30, 2022, 9:01pm

The place node for my local city was first created 15 years ago and the ID is still the same. It is not difficult for a careful mapper to keep the history, thereby keeping the ID stable as well. So we can and do have stable IDs in OSM, but that can only be as long as the object’s type (node, way, relation) is not changed. While I agree that it’s not a massive problem, I do find it a bit unfortunate. I always feel a little bit bad when I need to change the type of a feature, since I know that will break the history and change the ID. Wikidata IDs or other external identifiers offer a workaround since they can stay the same across OSM type changes, but it seems a bit odd that we don’t have our own global object IDs that could serve this purpose as well.

It’s certainly true that sometimes a changed object is now different enough that is should not keep the same ID. However, I don’t find this a compelling argument against the concept of stable IDs in OSM. If we had such a thing then we’d come up with guidelines for when keeping the old ID is appropriate and when a new one is. This is what is already happening when other external identifiers are added, removed, or replaced from OSM objects.

eisa01 · December 31, 2022, 11:11am

I’d say there would need to be a different persistent ID for the concept “building” and one for the “amenity” in your example

And the “amenity” ID would need be carried over regardless if it changes being tagged with a node/way etc.,

Anton_Khorev · December 31, 2022, 2:36pm

The issue as stated in the original post doesn’t exist because abstract persistent identifiers with no particular purpose don’t exist. Since they don’t exist you don’t add them to osm.

Malls sometimes have visible numbers for these slots. If I only map shops inside as nodes I usually add them as addr:door. This tag would serve as an “entity link”. But ultimately it’s a type-id-version combination if you want some kind of id to track POI changes over time. You record somewhere all of the elements with their versions that correspond to POIs that you’ve surveyed. When you resurvey the place, you compare your records with changes both in the osm db and on the ground. This will require you to some degree to work against delete-happy users who delete closed shops/restaurants/etc.

When I recheck POIs I usually keep the oldest one. The “best details” (most tags) are often on the newly added node because it came from some database/SEO/marketing source when they care more about adding their stuff to osm rather than keeping osm data consistent.

Anton_Khorev · December 31, 2022, 2:54pm

Having stable ids looks like a good idea for things that are notable and change slowly. In this case you have careful mappers not erasing the edit history and wikidata ids. With smaller POIs this idea doesn’t look as good.

rtnf · January 2, 2023, 11:17pm

I’m currently experimenting on this problem. Here’s my basic approach :

Make completely separate database that stores persistent and stable identity regarding geographical object. Here we could store more complex metadata regarding that geographical object, not limited by current OSM tagging scheme.
Link that database to its corresponding “unstable” OSM Id.
RSS-based notification that monitors each change in the corresponding “unstable” OSM ID. Rely to semi-manual crowdsourcing approach to resolve this conflict (to minimize conflict with fellow OSM contributor, i’d rather to update the OSM ID in the external db instead of enforcing new rule in OSM, for example).

mikelmaron · January 3, 2023, 3:56pm

The majority of nodes do not get deleted and replaced, and the ones that can be logically linked to Wikidata even less so.

Anecdotally agree. Would love to see analysis that makes an attempt to quantify how stable identifiers are in practice over time. Even with a naive interpretation of what keeps an object “the same”, it would be informative to know by type of feature and geography how frequently an OSM identifier ends up deleted or referencing something else.

SimonPoole · January 3, 2023, 4:33pm

Define “something else” (see also Persistent and stable identifiers - #7 by SimonPoole). If we had that then we could potentially come to a different conclusion than that it is application/use case dependent (and correspondingly what a useful persistent id is for a specific application). .

mikelmaron · January 3, 2023, 4:46pm

I understand the trouble with an absolute definition of change. I think using some naive assumptions for the analysis would be a fantastic starting point to ground the discussion in some practical measurements.

SimonPoole · January 3, 2023, 4:59pm

The problem is that any such analysis is just not going to lead anywhere (regardless of the criteria), because while what @Jez_Nicholson writes is true, a stable id that works by chance (even if it is quite likely), is definitely not going to cut it. I’ve even personally broken the premise 100s of times personally by changing the tags on shop/amenity/leisure nodes.

As I pointed out in Persistent and stable identifiers - #21 by SimonPoole we do have a possibility to freeze a specific object and we can, if on de-referencing it we determine that it isn’t the latest version or it status has changed, determine a new persistent id. For some applications this is perfectly good enough, for others not so.

Tordanik · January 7, 2023, 1:08pm

Such an analysis could be useful to check certain assumptions hold. For example, my current assumption is that identifier instability for POI-type elements is mostly caused by actions that could be avoided with the two relatively feasible steps of

introducing a shared ID space for all OSM element types
evolving editing culture to discourage unnecessary breaks in object identity, and to encourage clean separation between different entities (such as buildings and their occupants)

Whether we want to do that would be a separate question, but I do see value in knowing what our options are.

And of course I’m talking about a “naive” definition of change here. There will inevitably be the edge cases/“Ship of Theseus” problems, but my feeling is that rough guidelines for when an object is/isn’t the same are enough for a practically useful solution.

mmd · January 7, 2023, 1:38pm

Are you implying here that a node 1000 in v1 might be turned into way 1000 in v2 to keep the id stable? Wouldn’t this break the history of a node or way, unless you redefine today’s concept of an object history to allow changing object types?

Tordanik · January 7, 2023, 1:52pm

That would indeed be one way of doing it. It could also be implemented as deleting node 1000 and creating way 1000 at v1 as long as the API prevents the creation of way 1000 while a node 1000 exists. Either way, it would be a major breaking change to the API.

Something along those lines would, however, appear to be the only approach that would enable element IDs (as opposed to, say, additional attributes stored in tags) to be even potentially viewed as stable identifiers. Anything that changes when a mapper turns a node into a way or area cannot be a stable identifier of a real-world feature.

dieterdreist · January 7, 2023, 2:06pm

Anything that changes when a mapper turns a node into a way or area cannot be a stable identifier of a real-world feature.

wouldn’t you quite often turn many nodes into just one way? What about 2 relations replacing an object that was represented as a way (combined what is two objects after the change)?

mmd · January 7, 2023, 2:33pm

I think this would cause fewer side effects. The issue I see is that by using a shared serial number object for all OSM object types, you end up with fairly large numbers for all object types. Some applications assume 32bit numbers for way ids and relations, and would no longer work.

Also, the API won’t let a user define the object id of a newly created object as of now. You have to provide a negative placehoolder id which is then replaced by the next available number in the serial number object.

In theory, you could allow positive values here, and make sure that the object doesn’t exist yet. It will get messy though, if a user decides to undelete node 1000 while way 1000 still exists. This new implicit dependency between object types would indeed be a breaking change.

Minh_Nguyen · January 7, 2023, 4:25pm

To elaborate on this point: Some data consumers (such as the Overpass API, Mapbox Streets, or OpenMapTiles) already find it necessary to combine the three ID namespaces into one or derive another kind of ID from these IDs within the same namespace. For some, the motivation is to stick something in GeoJSON’s built-in feature ID property; a simple numeric ID compresses better than a string ID prefixed by the element type.

Data consumers combine the namespaces by adding some arbitrary number to way and relation IDs or applying some other predictable formula to avoid numeric collisions. This reduces the available numeric range for each element type, and there’s some pain whenever a data consumer needs to change the formula on its clients. That said, if such a solution were to be implemented for existing elements, then the same solution could apply to negative IDs for yet-to-be-created elements.

Wikidata took a different approach, last year adopting a new, partly redundant property that treats a consolidated OSM element ID as a non-numeric ID, using node/, way/, or relation/ as a namespace. There were some concerns about stability in response to that proposal, and I continue to hear some grousing about how the property creator interpreted consensus in that discussion. Regardless, a non-numeric ID would be the most extensible representation, at a cost to convenience and data compression.

fititnt · January 9, 2023, 3:05pm

People here will also be interested on this thread on the OSM Data Model started yesterday: