Service offering changesets with their respective old objects

flohoff · February 28, 2024, 9:21am

Hi,
i was thinking about analysing changeset changes and one of these issues is always at the start of that pipeline that changesets only contain the new or then current versioned e.g. changed objects. So one always starts with fetching ALL the version-1 objects in that changeset. This is pretty resource hungry and time consuming which makes stuff like OSMCha slow.

Is there any service offering the changesets as seperate objects including their respective old object versions?

I guess that would be fantastic one-stop-shop for all these “lets analyse changesets live” ideas not in the situation to set up a full planet postgis database to quickly fetch version-1 objects.

Flo

pnorman · February 28, 2024, 10:13am

Do augmented diffs from overpass meet your need?

flohoff · February 28, 2024, 11:27am

I just had a look at the API and i dont even get what i get in return of the API call. Its not changeset ids, its not real intermediate changes, its some “before” “after” but for me this looks like a query on the current OSM dataset processing timestamps of objects or something.

Not really what i had in mind.

Format ob the change is something i had in mind though.

I thought more about a kind of “minutely diff” file containing the changesets with their corresponding before/after objects.

ikonor · February 28, 2024, 12:16pm

Is the reason for your question that OSMCha is slow or do you have a different use case in mind?

OSMCha has a backend batch job that queries the changes diff for each changset with an adiff query from it’s own Overpass API instance and caches the result in a JSON format per changeset on AWS S3. So usually getting the changes in the client is fast, as it only needs to request that prepared JSON.

I was going to suggest those cached S3 JSONs, see also the blog post “Preparing accurate history and caching changesets”, but it seems they’re currently empty (#719).

flohoff · February 28, 2024, 3:10pm

I’d like to to more QA based on live changesets coming in. (OSMCha was just an example on already existing users of such information) - And for me OSMCha is unusable slow. Most of the changesets dont even load at all. I tried multiple browsers and since the move its basically dead for me.

The point is - imaging we’d like to watch for changesets building a swastika. If i would like to hide the changes i would create a polygon - and then in a later changeset move parts of the nodes into places they need to be.

As i dont see the “full” picture of the change applied its hard to make a geometry detection. I’d need the before, the after, and all linked objects to make that decision.

As i could imagine tons of validations like self intersecting polygons, far distance move of nodes, link of nodes between foreign objects (linking ways to polygons). For all of this i would need before and after and possibly linked objects.

Currently this is pretty heavy lifting as one would need to have a osm db mirror, and before applying the changeset dump all “to be touched” objects, and then enrich the changeset with that. And you might even want to have all ways affected by the node moves in your enriched changeset.

So my imagination was a service where i could request “enriched” changesets which have before/after objects, attached/linked objects and probably even a “bbox” of the befores data. (Imaging a swastika built out of multiply polygons never touched in the same changeset)

And then simply attach your validation pipeline, whatever that is, to that service.

Flo

ikonor · February 29, 2024, 8:10pm

Overpass Augmented Diffs, or adiff queries in general, actually do provide some of those requirements. The OSMCha backend has a validation pipeline that compares changeset before/after state with osm-compare and feeds findings into the “Flagged Features” tab in the client.

However, using adiff to query changes in a changeset by it’s bbox and time range, as OSMCha and achavi do for lack of a better option, has a number of issues: many changesets can’t be queried at all and many have missing or even wrong changes. See also related issue lists of OSMCha and Overpass.

So I agree, there should be a service that is dedicated and optimized for requesting full changesets. I already suggested to evaluate potential alternatives (osmcha-frontend#652) that might be used/extended to build such a service.

drolbr · March 2, 2024, 5:33am

The Augmented Diffs have already been mentioned but are optimized for minute diffs or regional diffs, and not treating changesets special.

The background of this is that the OSM data model is not fit for the straightforward use cases of “changeset visualisation”. The mother of all problems is the moved node in an otherwise unchanged way, but further issues exist. Every service is necessarily a workaround.

I suggest that you try the following which will deliver all objects touched in the changeset in its state immediately before the changeset started. Note that there are changesets which may have multiple versions of the same object, so “version - 1” is not a generally safe approach and not further considered here.

Download and process the changeset (replace 148036051 by the actual changeset id, but it is a useful example) per

curl 'https://www.openstreetmap.org/api/0.6/changeset/148036051' \
  | grep -E '<changeset' | awk '{ print substr($3,13,20); }'

This gives you the start date of the changeset.

Download an process the list of objects in the changeset (replace id again) per

curl 'https://www.openstreetmap.org/api/0.6/changeset/148036051/download' \
  | cat download | grep -E '<(node|way|relation)' \
  | awk '{ print substr($1,2)"("substr($2,5,length($2)-5)");"; }'

This is the list of touched objects.

Send (replace {{start_date}} with the obtained date and {{touched_objects}} with the obtained object list)

[date:"{{start_date}}"];
(
{{touched_objects}}
);
out meta;

to get a strict list of the touched objects as they were immediately before the changeset.

The remaining caveats are:

In rare circumstances two or more changesets are intertwined. In this case the response is well-defined but may contain parts of the intertwined changeset. This is in general not solvable by design of the OSM data model.
The objects are of limited use because they may lack geometry. There are various options for geometry by Overpass which have different pros and cons. Pick your favourite.
The approach is limited to about 50000 objects per request because Overpass has a 1 MB upper limit for the request size. If you care for the few changesets that exceed that size (if any) then you can simply split that into two or more requests.

Mateusz_Konieczny · March 2, 2024, 8:28am

AFAIK especially Overpass has also a problem in cases where multiple changes, from different changesets, were applied in the same second to a given object.

Also in cases where no edits from changeset A happened after any edits from changeset B (though that may count as special case of “two or more changesets are intertwined”).

drolbr · March 3, 2024, 8:05am

AFAIK especially Overpass has also a problem in cases where multiple
changes, from different changesets, were applied in the same second to a
given object.

The Overpass API works as intended here.

To get back to the geometry for problem. In absence of an explicit linking of versions, the ways and nodes effectively get linked via the timestamps.

In particular, this is a well-defined and well understandable concept. The geometry of a way at a given timestamp is then computed from the coordinates of the latest versions of the referenced nodes that were already in place at that timestamp.

In contrast, if you were to allow multiple different coordinates for nodes in the same second then you forego geometry of ways at all. An example:

<node id="1001" lat="50.1" lon="1.1"
  version="1" timestamp="2024-03-01T07:00:00Z"/>
<node id="1001" lat="50.1" lon="1.2"
  version="2" timestamp="2024-03-01T08:11:12Z"/>
<node id="1001" lat="50.1" lon="1.3"
  version="3" timestamp="2024-03-01T08:11:12Z"/>
<node id="1002" lat="50.2" lon="1.1"
  version="1" timestamp="2024-03-01T07:00:00Z"/>
<node id="1002" lat="50.2" lon="1.2"
  version="2" timestamp="2024-03-01T08:11:12Z"/>
<node id="1002" lat="50.2" lon="1.3"
  version="3" timestamp="2024-03-01T08:11:12Z"/>
<way id="100" version="1" timestamp="2024-03-01T07:00:00Z">
  <nd ref="1001"/>
  <nd ref="1002"/>
</way>

Now, if you would base geometry on node version, had the way at any point in time

a geometry (50.1 1.2) (50.2 1.3) based on applying 1001v2 then 1002v3 or
a geometry (50.1 1.3) (50.2 1.2) based on applying 1002v2 then 1001v3 or
some random rules to skip some node versions, denying the existence of any of these two intermediate states?

OTOH changesets are on a semantic level a group of changes that the human user decided to be belonging together. The community more than once has even urged people to give a meaningful comment on every changeset.

How likely is it that a user is shaping two groups of changes in parallel and completing them at the same moment? The much more likely course of events is that the user has completed one changeset and the next one then seconds to minutes later, and only artifacts of the upload process might result in this situation. Or an editor that disregards the rules for good changesets.

In fact, only StreetComplete exhibits that behaviour. In every other editor, the human users apparently work on one task after, properly comment the changesets, and have no need to upload multiple changesets in the same second.

There is zero priority to implement a StreetComplete changesets special mode in Overpass which, as a severe side effect, make it much harder to explain the data model to fellow mappers and data users for no real benefit.

ikonor · March 5, 2024, 8:19pm

Mixing changes is actually a frequent issue (list) when querying StreetComplete changesets. I quickly found a current example just by looking at a couple of SC changesets. Overall, those are not such a big deal, as edits are from the same user, often within seconds.

But from a single changeset perspective this is just wrong:

For changeset 148249084, the comment says it’s about benches, but the tag diff also includes shelter from v2 of way 406047691 (history), that belongs to the previous changeset 148249083.

Why not? We would “just” need to forget about timestamps and compare versions instead?

The idea is to fix this by having a full history database that extends the data model with versioned refs and intermediate “minor” versions (for elements affected by moved nodes). Those would be derived on import and update, so at query time it would be all about resolving id references, without any timestamps involved.

The way in the example above would get two additional minor versions, that represent the geometry-only changes, and versioned refs in the derived full history database:

<way id="100" version="1" timestamp="2024-03-01T07:00:00Z">
  <nd ref="1001v1"/>
  <nd ref="1002v1"/>
</way>
<way id="100" version="1.1" timestamp="2024-03-01T08:11:12Z">
  <nd ref="1001v2"/>
  <nd ref="1002v2"/>
</way>
<way id="100" version="1.2" timestamp="2024-03-01T08:11:12Z">
  <nd ref="1001v3"/>
  <nd ref="1002v3"/>
</way>

The assignment to minor versions would probably be done by changeset (split to multiple when mixed).

Now if we wanted to visualise the changeset containing moved nodes 1001v2 and 1002v2, they would resolve to way 100v1.1 and compare that to its previous version 100v1.

The concept of minor versions seems to be implemented by OSHDB and osm-wayback:

The @contributionChangsetId can be different from the general @changesetId in cases where a contribution stems from changes child elements referenced by an OSM element, e.g. when only the nodes of an OSM way are rearranged or moved. This is sometimes called a “minor version”.

Response Parameters — ohsome API 1.10.1 documentation

We call these changes minor versions. To account for these edits, we add a metadata field to an object called minor version that is 0 for newly created objects and > 0 for any number of minor version changes between a major version. When another major version is created, the minor version is reset to 0.

State of the Map US 2018: OpenStreetMap Data Analysis Workshop - Blog von Jennings Anderson

flohoff · March 6, 2024, 8:55am

I get some of the problems now - Had a look at the replication changesets and one of the issues is that ways and relations do not have versioned refs.

So we always assume we have only one version but we drop the temporaral change of objects on the floor.

So its basically impossible from the changesets to get an exact replica of the changes the user made as it might be intermingled with other changesets and you never know which node versions this way referenced.

doh

So the only place we REALLY know what the user uploaded is the API - So we would need to write out the changes uploaded by a user from the API - and even then. A changeset may be open for hours and changes beeing added to it at any point in time. So there is no “point in time” for a changeset. So a changeset is just a loose aggregation of changes and we only use the version for avoiding “last write wins” type of problems by notifying the user of conflicts but on an individual object level.

Hmmm

Flo