OMA - a new OSM File Format

About a year ago I wrote a study on the OSM data model (in German) and discussed it in the German forum. This discussion led me to develop a new OSM file format, which I did during the last year (accompanied by a converter and a library to use it).

It’s almost ready now, but before I finalize version 1.0, I’d love to get some feedback. So I wrote a blog post about the new format. I’m planning to add more posts that will go into more details in the coming weeks.

More information can be found on GitHub:

5 Likes

That sounds awesome!

In your post you describe very detailed, how to parse “smaller” objects. How does this work with your chunks and larger objects, like if I want to get the Rhine river instead of the Viktoriastraße?

How would you think your file format would perform on rendering task? Means like more or less I need 75% of the data at some point. I would think resolving the nodes and ways still makes sense, though the clustering might be not so useful (or am I wrong) and would this save conversion time?

Did you thought about being able to update your files or would the intention be to keep the *.pbf, keep it updated and convert to your format?

1 Like

That’s where the blocks jump in: Only the blocks with key waterway need to be searched (and only the ones in the chunks with ways). That’s even a smaller part of the file than was needed for the Viktorstraße. Querying the Rhine river took 2.2 seconds. Creating this image took 3.6 seconds (it included querying the boundary of Germany):

(The Rhine river at the south west border of Germany seems not to be part of Germany. At least it’s not in the extract.)

I wondered how long it takes to create a map with all rivers (waterway=river) of Germany. That was even faster (2.4 seconds), because knowing the value of the waterway key I could make use of the slices:

That depends strongly on what you want to draw. If you need 75% of the file you obviously need to read 75% of the file. Reading the whole file takes a little bit more than 2 minutes. So if you need 75% it should take about 1:30 at least.

There is still one advantage: You can query the elements in the order you need them: First the background areas, then the ways and so on. You don’t need to keep them all in memory and sort them. And in most cases the querying time doesn’t differ much. (Respecting layer=* might complicate things though.)

But there are tasks that take much time. For example if you want to query everything that’s lit (lit=yes) you need to read the whole file. There is no shortcut in this case.

There are plans for change files, but I havn’t started with that yet; the file format needs to be stable first, I think. (There are no plans for history files, though.)

If I understand this well, you are dropping the topology information by storing coordinates rather than node information. For some usecases this will be an advantage and for others it will create problems, e.g. simplifying boundaries or other adjacent polygons (or polygon borders and lines, e.g. fences or walls) will only be possible for individual objects and as a simplification effect there may turn up gaps or overlaps because of this.

1 Like

That’s correct, although I wouldn’t call it “dropping the topology”. Assuming nodes with identical coordinates being identical, you can recalculate topology, whenever you need it. With Oma format the focus on fast operations has shifted. (That’s why data formats are considered more important than algorithms in computer science.)

Concerning the assumtion: I think it is quite reasonable, because it’s very difficult to create two separate nodes with identical coordinates with the popular editors. But there might be cases, where this does not work as expected.

You have to be careful in this case, of course. Depending on your needs you might have to recalculate the topology constraints for the objects in question, as I pointed out above. I never tried things like that though.

This use case joins the rendering question , typically to render at low zoom, you need to simplify borders, rivers, motorways, …

I thought about this simplifying question, but could not come up with an example, where there is a real problem.

I tried to simplify motorways and rivers. In case of motorways my algorithm was able to remove about 30% of the nodes. No visual difference can be seen. For further removal, the ways first need to be reconnected (OSM tends to dissect highways quite often), which is something that would take much more time to implement for me. (For rivers about 70% of the nodes could be removed and there is some minor difference visible. Still, reconnection would improve things here too.)

As I would like to understand this problem, maybe one of you two could give a concrete example, where references to separate nodes are clearly better?

Let me give it a try:

In current file formats, node 297920737 is connecting Viktorstraße and Krautsberg. Doesn’t matter how I modify the position of that node, it will always connect the both roads.

After you resolving the both ways, there is no connection anymore, as I understand your description. Both ways will contain a point at N51.2739774, E7.1994162 but might be simplified in different ways. I would assume, that will result in gaps while rendering and not sure about routing or querying stuff like, roads connected to Viktorstraße

Don’t get me wrong, above might be not the problem, your file format tries to solve.

Well yes, could happen. But depends strongly on how you simplify. If you just remove nodes, the gap would also be there with referenced nodes. To avoid this, simplifying can only be done by moving nodes, never removing them. I’m not sure, if this is really, what I think of simplifying to be. Anyway, I got the idea.

Here’s a very common simplify algorithm use for geo data : Ramer–Douglas–Peucker algorithm - Wikipedia

2 Likes

For areas this might be the better choice to simplify geometry - Visvalingam–Whyatt algorithm - Wikipedia

1 Like

You have to be careful in this case, of course. Depending on your needs you might have to recalculate the topology constraints for the objects in question, as I pointed out above. I never tried things like that though.

yes, for many usecases it is not relevant whether two overlapping lines are the same or just “coincidentally” overlapping, or whether two ways are connected or just happen to end at the same coordinates, but if you need this information I think you should not throw it away and try to reconstruct it afterwards. If you don’t need it then it’s great if certain queries on the file perform better than with any other file format, and nice names as well.

1 Like

OK. I’ll just accept it like this, and keep it in mind during future development. However, for the time being I’m not going to change the format to keep the topological information. The overhead and the drawbacks this would cause would far outweight the improvement (in my opinion).

1 Like

The second post of my blog series is now published:

1 Like

Third post:

1 Like