What is the structure behind the paths of GeoFabrik's osc.gz files?

I would like to see how much Mapillary imagery helps OpenStreetMap. My first attempt at this analysis is here, but it has the serious mistake of not accounting for changesets. I would like to correct that error and look into changesets. As an example, I looked at Lithuania’s GeoFabrik page.

The line I find most relevant:

  • .osc.gz files that contain all changes in this region, suitable e.g. for Osmosis updates

The .osc.gz link leads me to:

where state.txt is:

# original OSM minutely replication sequence number 5572108
timestamp=2023-05-12T20\:21\:24Z
sequenceNumber=3693

and 000/ contains the empty directories of 001 and 002, and the non-empty 003:

If we download 563.osc.gz and descompress (curl http://download.geofabrik.de/europe/lithuania-updates/000/003/563.osc.gz -o 563.osc.gz, then gzip 563.osc.gz --decompress), we get 563.osc. First few lines:

<?xml version='1.0' encoding='UTF-8'?>
<osmChange version="0.6" generator="osmium/1.14.0">
  <modify>
    <node id="32083700" version="25" timestamp="2023-01-02T20:10:26Z" uid="0" user="" changeset="0" lat="54.6681112" lon="25.2388765">
      <tag k="highway" v="motorway_junction"/>
      <tag k="name" v="Tūkstantmečio g."/>
    </node>
    <node id="56619670" version="4" timestamp="2023-01-02T08:23:47Z" uid="0" user="" changeset="0" lat="54.7106057" lon="25.3659565"/>
    <node id="56619671" version="4" timestamp="2023-01-02T08:23:47Z" uid="0" user="" changeset="0" lat="54.7104167" lon="25.3655478"/>
    <node id="101293595" version="21" timestamp="2023-01-02T20:07:43Z" uid="0" user="" changeset="0" lat="55.3669568" lon="23.8882335">
      <tag k="crossing" v="uncontrolled"/>
      <tag k="crossing:island" v="no"/>
      <tag k="highway" v="crossing"/>
    </node>
    <node id="249526548" version="12" timestamp="2023-01-02T08:23:47Z" uid="0" user="" changeset="0" lat="54.7126483" lon="25.3739687"/>
    <node id="266373801" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7025542" lon="25.2678349"/>
    <node id="266373802" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7026282" lon="25.2676597"/>
    <node id="266373803" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7026471" lon="25.2670489"/>
    <node id="266373804" version="3" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7025945" lon="25.2665295"/>
  </modify>
  <delete>
    <node id="266373805" version="4" timestamp="2022-11-06T07:23:49Z" uid="0" user="" changeset="0" lat="54.7025665" lon="25.2666452"/>
  </delete>
  <modify>
    <node id="266373808" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7022033" lon="25.2660959"/>
    <node id="266373809" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7019941" lon="25.2663473"/>
    <node id="266373810" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7018598" lon="25.2665701"/>
    <node id="266373811" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7017539" lon="25.266781"/>
    <node id="266373812" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7017206" lon="25.2668672"/>
    <node id="266373813" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7017328" lon="25.2673217"/>
    <node id="266373814" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7017784" lon="25.2674612"/>
    <node id="266373815" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7018759" lon="25.2676558"/>
    <node id="266373816" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7020041" lon="25.2679182"/>
    <node id="266373817" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7021039" lon="25.2678356"/>
    <node id="266373818" version="5" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7021954" lon="25.2677495"/>
    <node id="266373819" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7022408" lon="25.2677484"/>
    <node id="266373820" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7023579" lon="25.2675678"/>
    <node id="266373821" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7024179" lon="25.2675946"/>
    <node id="266373822" version="5" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7024073" lon="25.2677976"/>
    <node id="266373823" version="4" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.702462" lon="25.2678543"/>
    <node id="266373824" version="5" timestamp="2023-01-02T15:04:09Z" uid="0" user="" changeset="0" lat="54.7024963" lon="25.2678598"/>
    <node id="267148401" version="3" timestamp="2023-01-02T11:27:28Z" uid="0" user="" changeset="0" lat="54.7035275" lon="25.29938">
      <tag k="noexit" v="yes"/>
    </node>

Which seems nice. The .txt files are similar to the one quoted above. To proceed successfully, it seems I need to understand though how these osc.gz files are structured.

Based on what logic OpenStreetMap edits end up in 563.osc.gz, or another file, ie 564.osc.gz? I am interested in how this path structure is set up, and what changes go to each file.

Alas, the osmchange files that you’re looking at now won’t see changeset metadata either. As an example, here is the osmchange file for a recent changeset of mine. The source for that change is in changeset metadata, here. The relevant “minutely change file” (analogous to your osc files at GeoFabrik) is here. You can see that there is no source information in that file.

1 Like

Thank you. How did you find the right URL for the minutely change file?

What you need is the world-wide “Latest Weekly Changesets” (changesets-latest.osm.bz2) dump from https://planet.openstreetmap.org/.

Tools to import that into a PostgreSQL database are e.g.: ChangesetMD, osmchanges-postgres. The Osmium Tool changeset-filter command can filter that file by timestamp and bbox before importing.

As a preview of the content structure of that file you could look at a minutely changeset file, e.g. 027.osm.gz.

There are also cloud-based options to query changesets as an alternative:

For a quick analysis of recent changesets you can search for “mapillary” in changeset comments or tags with these online tools:

1 Like

To answer the original question, the structure of the OsmChange replication files is described in Planet.osm/diffs.

2 Likes

Yes. That’s “all changesets since the dawn of time” (or the introduction of the concept of changesets). If you just want to search for text in it you can try:

bzcat changesets-latest.osm.bz2 | grep -B 3 -i 'sometext' > sometext.txt

where “-B 3” says "please print 3 lines of context before the line that contains “sometext”. With grep you can print lines after the match with -A too.

1 Like

I knew it was below https://planet.openstreetmap.org/replication/, clicked on “minute” in there, and then drilled down until I had found the file with the timestamp that corresponds to my change. The files and directories are written sequentially, and I knew the time of my change from the original changeset.

1 Like

Osmium Tool can convert to OPL File Format (Object Per Line) that’s easier to grep and also filter by time, e.g. since Mapillary started, like this:

osmium changeset-filter --after 2023-05-01T00:00:00Z -f opl changesets-latest.osm.bz2 | grep 'sometext'

Country-based analysis will be a bit more involved, see woodpeck’s comment in the other thread.

2 Likes