Data Normalisation & OSM

SK53 · January 4, 2023, 10:20am

Continuing the discussion from Multiple delimited names in the name tag:

OSM is the epitome of an unnormalised database, and works pretty well as such.

Normalising issues such as the one in the thread is often very complex, as @Minh_Nguyen demonstrates in his reply.

This is true even for well defined use-cases. I have fond memories of a data model for handling public holidays in Switzerland which required about 50 entities IIRC, but just would not have worked, for instance, in the UK. My post box examples in the lightning talk I gave at Heidelberg SotM are simple examples of normalisation problems which OSM simply just does not have to worry about. There are many things in OSM way more complicated than a post box.

For the record, I’m a huge fan of the power of appropriate approaches to normalisation in systems design, but as someone who did it for a living I’m also aware that it can also be an obstacle when managing the “Good, fast, cheap, choose two” constraint.

fititnt · January 10, 2023, 1:06am

Even on ideal scenario, I believe there’s at least two groupings that need to optionally allow be further customized by stricter specification plus a clean up stage (aka the topic)

By region. Start by defaulting at world level, then get specific.
Both for memory efficiency and because conflict rules (that would break reasoning) optionally test for cases in which only one “layer” is tested at once. This also allows us to unpack nodes/ways that actually have several concepts mapped togueter, so it is really not viable not to make such processing without stages.
Things that are too hard to make in more semantic levels, but trivial to do while loading data (e.g. think parsing things like these delimiets, also the opening hours etc) do very early. when in doubt, discard data and warn.

Point 1 is mostly how the world already works. While it is desirable to have a global unique schema, sometimes it is easier to tolerate small shifts by region.

Point 2 is mostly how the average user of data (with exception by those who actually print full maps, like the Carto) tend to focus. They already tend to focus on small parts, like just roads plus some amenities, while not caring about trees mapped individually (which, again, would blow up memory)

Point 3 is mostly because it takes much less lines of code, can allow some hardcoded optimizations, and also deal with deprecated data without actually needing to change. It could be possible for example to use even high level semantics to select data (e.g. RDF, SHACL, etc) but treat values as opaque strings, then some code actually parses such strings into something we may use or not. Another use of 3 would be if it is too hard to use other methods to “upgrade” data for other models, someone could just use IF ELSES to actually coerce the data for the tags it would prefer.

aighes · January 10, 2023, 2:09am

Does it? I would rather say the none-normalized part of OSM aka everything not documented in the wiki is pretty likely not considered in the common renderers.
For example: Indeed OSM allows me to insert highway=Bundesstraße, but no one will render it and there will be most likely another mapper changing it to the proper, normalized tag highway=primary.

My experience is rather: OSM prefers normalized data, but is not enforcing it.

fititnt · January 10, 2023, 3:17am

I took some time to read past attempts on “make OpenStreetMap data more semantic” in the RDF sense, however the ones that are actually still used today have a strong relationship with delivering the full thing. The more academic and/or buzzwordy the terms the less likely to be adopted. And I think there’s practical issues (e.g. people actually tried in production, but didn’t have performance as alternatives).

If we consider that OpenStreetMap data (because it is almost always explicitly anchored with spatial relation) the de facto optimized size of its form on a traditional triplestore would be massive. So if Wikidata is already looking for alternatives to handle their query engine, there’s no tool today able to handle full OSM data. The closest to this is this paper

but it takes as much as 48 days to process the full planet. So imagine that realistically speaking, it is better to work towards using the explanation of how concepts are represented in tags to then rewrite these so the actual query could be run very efficiently on an overpass. Since these queries would be complex for humans to write by hand, I think we could already use the way to represent concepts both for validation, but in some years the full query. If necessary, we try to optimize some cache for Overpass, but generic databases for SPARQL will not cope with OSM ever.

One problem is that pretty much every potential buzzword we try to use to explain something new before actually being ready, someone in the past promised as revolutionary, but then didn’t deliver. It might feel hostile at first, but even the complaints about DWG against imports (from what I’m perceiving) is because by far the easiest way to break OSM is by imports, so it makes sense. On Wikidata, bad Imports would be far harder to perceive than OpenStreetMap because on OSM anyone can see things in the map, but how to visualize abstract things that may be lost on Wikidata? If we assume that it is easier to see errors on OSM than Wikidata, and that average non-experts, not massive imports, tend to improve the result, things make sense.

With all this said, also being realistic that even developers from Overpass augmented in public that’s is hard for developers on OSM to have “buy in” from ideas I’m new, but it seems that it is not about being hostile to new ideas, just that developers have a strong culture of “talk is cheap, show the code”. I do understand people got some hope with Sophox being somewhat fast, however blazegraph was one of the worst tripestores to archive GeoSPARQL compliance

So yes, I fully agree we could try ways to make data, like the separators from fields, and other simpler things. And also that this has a massive potential to be reused inside OpenStreetMap, because every developer will prefer this (but, again, unless things are very broken, we assume it is better to assume original data is kept unchanged).

About formalize better the tags and concepts themselves

Ok, going back to what I think might be easier, the closest to be the place to store how values are expected to be are the OpenStreetMap Wiki. TagInfo is the closest production-ready use of data from the Wiki (and even the Data items got stuck). Also, Eveb Wikibase and Semantic MediaWiki somewhat store data as if it was content of Wiki (however they’re more more formal than Infoboxes, but the underlying storage is still just plain text on an SQL database.

However, the current implementation of the equivalent of OpenStreetMap Infoboxes doesn’t hold sufficient Information. And if we start adding new parameters, since this would obviously be used to check consistency of data, we need to check the consistency of what checks the consistency. Add to this that even if we restrict people adding new parameters, over the years things might conflict with each other, so we literally would need to plan ahead the full thing.

Also, some complaints on OpenStreetMap, like the idea of attaching a specific tag to a Wikidata item was discarded because often it was used wrong. This means eventually even the idea of what “primary highway” concepts represent must never be the tag we use for it, because is the same as try to argue with human ontologist that name of the person is the person (TL:DR; the IRI to represent abstract idea of “primary highway” being different from the tags allow for stricter checks, including formalize difference between countries). So, my argument here is that, since we cannot rely on external identifiers (such as a code to express the concept of World/Continent/Country-name, then we need to formalize it because it is used for other rules.

I know this might feel hard, but structural concepts cannot be offloaded, not even to Wikidata. Labels and translation, ok, but not something that could break continuous integration pipeline. Never would have “buy-in”. I mean, eventually we could start with the low hanging fruits, like the concept of word, then attach default rules, so if someone in the world digit one additional zero like in a higthway=residential + maxspeed=200, then worst case scenario, the generic rule would apply and reject that data.

On this case of the initial proposal (about how to split fields) might not even need full RDF, just encode such information with the tags themselves (and in case of regexes, since there’s more than one flavor of regex, then we would need to recommend have for every popular language). And for things that are too complicated to make by rules, then this must also deliver snippets of software that anyone could use for that rule.

SK53 · January 11, 2023, 8:06pm

@fititnt if you want a discussion please don’t post so much at any one time, it makes the thread a bit of a monologue.

@aighes I think you may be merging normalisation with harmonisation when discussing a highway tag. That latter (having values which are synonyms of near synonyms in a tag value) is a different issue, and I agree that it works pretty well for widely used tags. One is about how the structures used to store the data, the other is about data values themselves.