OpenStreetMap DTD, XSD, JSON Schema, etc for commonly shared data

fititnt · November 18, 2022, 4:02pm

I’m interested in eventually creating mapping data at more semantic level (easier automated data integration) and document how people can use it by existing standards. However I noticed that despite the reference format being XML (tooling more mature than JSON and binary formats), the wiki at OSM XML - OpenStreetMap Wiki admit this:

No official .xsd Schema exists. See below and the OSM XML/XSD, and OSM XML/DTD pages for details of unofficial attempts to define the format in those languages!

Since syntax precedes semantics (and is computationally cheaper), already would make sense DTDs/XSDs/etc ready to use!

The idea draft

I see no problem in also having the Wiki documentation like is today (because this is important for humans!), however this idea would eventually have some sort of central repository for the DTD / XSD / etc, and all other common formats used to exchange data between tools even if it is not XML (e.g. if JSON outputs, then common practice would be to have JSON Schema), which is important for non-humans (or tools helping humans, in special developers, find errors).
Then, while not in the same repository (but avoiding being split in several ones, e.g. do something like a monorepo for test cases) have examples of data using these formats. While not ideal to store huge files, it could over time have a considerable amount of data (including small binary formats), so this is why not be on the same main repo. Also, the way CI works, is very convenient to do git clone than download files.
Today is viable for free use GitHub actions to install the dependencies (which likely to be somewhat intense, and while the specs unlikely to change over time, the validations tools might break ) to automate checks online of the validators themselves, so could be viable receive small changes over time and without need to people install the tooling in their computers.

What I could do to help this idea

Well I’m new to OpenStreetMap, so I am learning things as I go!. And setting the minimum from the idea above, by just looking at what already exists in the wild is viable. No problem to me do public domain all the way (so others can copy paste things in their codebase or don’t need to cite authors), however, if this get minimally usable, the ideal scenario would be recreate the repositories in something with review for others, preferable one or more developers from software of APIs or tools that exchange data in some way. I’m very careful to make this viable for others to keep running, however at first it could be easier to do a quick rush.

I’m not as sure how active on OpenStreetMap I would be In next years, so since this is something that less people get engaged (but is important for developers) my idea on this post not just get more feedback on what’s worth to add the schemas, but some approach do donate the thing for others review. In this aspect, one good question: how could someone donate schemas? It’s something between code and data.

Other comments

Anyone feel free to make other comments or suggestions! Might take some time for me to compile more practical examples, but at this moment I’m mostly reading what is on the Wiki. But obviously, I’m likely to miss what is relevant.

But I noticed that (while less common on OpenStreetMap than other communities) sometimes people rely on projects, but things get not maintained. Schemas themselves don’t stop working, but while I’m new here, making it easier for others to validate changes makes it more future proof.

Edit 1: One initial tag, the “data-validation” was removed. This might lead to confusion.

Minh_Nguyen · November 18, 2022, 5:42pm

OSM isn’t built on standardization, at least not semantic standardization. As this is originally a British project, perhaps it’s fitting that we have only an unwritten constitution. Any tags you like is such a fundamental principle to OSM tagging that a schema can only at best be a set of commonly accepted presets, if you will. Rigorously cataloguing the set of approved tags or even the set of tags with their own wiki articles would get you a bizarre collection of tags with obvious gaps and inconsistencies.

Depending on your use case, id-tagging-schema in conjunction with the name suggestion index can serve as a starting point for an XML schema. These repositories don’t claim to be authoritative or comprehensive, but the JSON files and build scripts are used by many tools such as iD and Overpass turbo, so they have quite a bit of influence.

fititnt · November 18, 2022, 7:56pm

About the Semantic vs Syntax

Yes! even with simple syntax validation, it is possible to do some hardcoded simplified “allowed” tags! However, the focus here is more at file level, or data transport level. One major limitation is that the expressiveness of enforcing pairs of <tag k="" v="" /> (key=value) might not scale very well. Actually even projects that already do some semantic suggestion, already consider some additional context. For example, the Name Suggestion Index seems to make Open-world assumption, so if it finds something it does not understand, it does not forbid it. These types of things (in particular with very large dictionaries) are very complicated to do with this type of syntax validation.

Code example

I will use small bit of code from https://wiki.openstreetmap.org/wiki/OSM_XML

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.0.2">
<node id="1831881213" version="1" changeset="12370172" lat="54.0900666" lon="12.2539381" user="lafkor" uid="75625" visible="true" timestamp="2012-07-20T09:43:19Z">
<tag k="name" v="Neu Broderstorf" />
<tag k="traffic_sign" v="city_limit" />
</node>
</osm>

On this case, all lines and attributes would could be encoded by some languages that allow specify syntax of file, including the <tag k="name" v="Neu Broderstorf" /> and <tag k="traffic_sign" v="city_limit" />. However, is also true that the combo name=Neu Broderstorf + traffic_sign=city_limit MAY have some meaning (semantic), but the syntax just care to know which values are allowed inside the k=“” and the v=“” (even if empty) like <tag k="" v="" />.

Often these values are expressed with some sort of regex, with full Unicode range without differentiate unicode points from writing systems, so “א” (Hebrew Abjad), “ऄ” (Devanagari abugida) “A” (Latin alphabet) and “日” Kanji logogram) would ideally be assumed be the same (except if some application still making differentiation, then is better reflect what is, let others users decide about that). However, what is often common, are restrictions on control characters: Most popular are the ones on ASCII table such as NULL, LF, DEL, … (not their [A-Z] representation, but low level binary encoding).

Important note: for obvious reasons, even if some tool accepts some range that would cause problems (Unicode ranges often cause some issues and bugs), it makes no sense to document this type of unintended public even if it exists in any tool. This would mean both accept security reports privately and only contact developers also in private. If necessary, at least for major tools, we could already only discuss in private.

From the syntax point of view the tag names and the tag values would be a regex (which might vary by application), similar to whatever is acceptable on changeset="", lat="", etc.

Likely “target audience”

Considering the context, the deliverables of this are more relevant for developers of tools (very, very limited target audience) and people doing other tools to validate files which contains OpenStreetMap data.

Maybe a second public would be anyone already interested in keeping syntax (not semantic) schemas in already published standards.

Conflict of people’s opinions is less applicable

Since the public already is small, and the validation is actually documenting how computers deal with shared data, the likelihood of different opinions would be far smaller. The “ground truth” becomes what the software already is doing (except in special cases related to security, then we would not replicate bugs).

One edge case might be when it is not a security related issue, but someone could try to use the repositories either to complain about APIs or to “change the files” without actually be a reflect of the ground truth (e.g. person wants to change the reality because don’t like that). This was something I was already thinking explicitly about writing for the maintainers not to get engaged or recommend in other channels.

Some need to get more examples of data

One obvious example would be to create the syntax schemas for the v0.6 API, however even the Wiki mentions that there are some variations in some tools. So, I think these variations need to first have example cases, so after engoth example, the way to create the schemas might either be full individual or some type of extension of the reference one.

Also, there’s the formats used for synchronizing data. These also would need to be documented, even if it is to say that it reuses some other schema. I do understand that they are likely to be working for over a decade, but less formal documentation at data transport level (not talking here about user tags) still relevant. So if the thing becomes good enough over time, then either others could ready some existing example or when implemented (e.g. not merely says it planning to do, but actually have proof that is usable) it is possible to have some central place to publish.

Also, as soon as the first files and tests start to be ready and well tested, any new version will be pretty much a lot of copy and paste.

mmd · November 18, 2022, 8:57pm

The OSM API XML and JSON formats are maintained in the GitHub - openstreetmap/openstreetmap-website: The Rails application that powers OpenStreetMap repo. This might be a good location to store XSDs, in case maintainers feel it’s worthwhile including them in the repo.

Different tools have introduced sometimes incompatible changes and extensions.

A noteworthy example is Overpass API with its XML format extension for out geom; output. Incompatible changes not only affect the elements themselves, but also the sequence in which they appear in an XML message.

As an example, the OSM API always returns data as nodes, ways, relations (in that sequence), sorted by id. Overpass API again is free to print elements in whatever sequence was defined in the query.

Some tools may omit some of the attributes, such as user id, display name or changeset id (incompatible change). You may have come across such XML files when using Overpass API. I expect similar changes to occur once GDPR related changes will be implemented on API level.

I remember a few people bringing up the XSD schema topic on the dev mailing list: [OSM-dev] Schema for v0.6 OSM files? … maybe try and search for similar posts. I also found another example posted by one of the JOSM maintainers OSMSchema.xsd · GitHub based on work by someone on the old forum.

By the way, my impression is that there’s only limited interest in such a schema. I worked with both Overpass API and OSM API, and don’t recall that I ever needed them.

Peter_Elderson · November 18, 2022, 9:25pm

I am trying very hard to understand the need for this. I know xml-files need xsd files to describe/make sense of what is in them. I compare it, made very simple, to a bunch of csv datafiles where you need a description of the tables, header lines and columns to do automated tasks.
I have worked with a mapping package (Mapforce) that could read xml files and perform the mapping to a database application, and did so using the xsd to know the structure and syntax of the xml.
At the other side it knew the database structure as given by the database definitions for the application.

Biased by this experience, I think it would be possible to devise a certified osm-clone where all the attributes are unambiguously and uniquely defined, and then map all known variants of the same attribute or attribute set that exist in the osm database to the one unique unambigous attribute in the unified clone, and be sure of what each element is and means.

Data users would then no longer be forced to handle all possible variants. The could simply work with what’s in the certified clone. Sort of a hub, for them. I know many applications do a lot of preprocessing to create or fill their own unambiguous database clone for the application.

Do I have the right idea of what this topic is about?

fititnt · November 19, 2022, 2:03pm

Humm, you are thinking of additional creative ways to use the definitions in one or more formats! The program you commented makes me remember another type of definition, the XML transformation language. Other container formats (let’s say, JSON) do have their own ways to enforce syntax check (so it would make sense to have the best standard for that container) but maybe it will not be as trivial to find mature standards to convert between formats in all potential containers.

I think that the primary goal should be syntax validation (and very, very basic documentation) of the fields of each individual format alone. Something viable to keep public domain (and also allow copy pasting). But in a second repository be the test cases. Since there exist standards that allow someone to encode how to convert between formats, and these standards do not require significant changes to the main definitions, there could exist people willing to encode.

About naming things

Naming things is hard. Also, there’s a problem with branding and soft forks (something near equal, but not the same, but we need to be strict). Also, since one main use is to make tools easier to use directly, the names shouldn’t change.

Actually, even if at least two repositories would be needed, the full URLs could change over years, but individual files change would break even own scripts and potentially very complex test cases.

With all this said, I think the canonical name should be opaque (such as S1234) with some pattern to define what’s the version also being opaque like some sort of suffix with incremental (regardless of being beta or not) for increments . However if there is always some way (maybe simply not specifying a version at all) would point to “latest version”. So, the names of the files would always be easy to parse with scripts, and somewhat be very compact to create rules, but what people name then (S1234V1 = acme v0.6) could be in some human editable metadata table.

Generating documentation about what file contains syntax for what is not hard, but people could change their idea over years, that the de facto files would have less incentive to change.

Trivia: these opaque numbering could help a lot for things with small changes. If things get too repetitive (maybe just a small field) maybe some automation could generate the files.

About test cases

Since it could potentially have a large number of small specifications to avoid very similar things, then the test cases would also start to be very similar.

Also, input vs output start to make less sense. Programs also often will add timestamp or add label name that generated, so automated tests for compliance could ignore these noise by default. Ignoring time and ignoring names, then there’s room to allow very compact scripts to test a lot of things (maybe even bruteforce round trips between formats) if some human handcrafted index says that the program is expected to work in that context.

The more complex implementations of data transformation round-trip are unlikely I would be able to bootstrap in some weeks, however by at least having opaque naming on the test cases, makes such scripts much easier. Also, as soon as some developer of just one application creates automation to test others, then this is generalized.

About archiving and URL

Yes, at least the specifications that are for the OpenStreetMap API itself, makes this could be added on the site. However the repo with specifications is more likely https://www.schemastore.org/ (several providers), which always have centralized copy, even if developers don’t have a site or if their site goes offline, but their tools keep being used.

Something like PURL like w3id

While for a very long term (including backup) get few editors and publish on something like Zenodo (e.g. additional copy with DOI) for production URL, I think we could go with https://w3id.org/ . The underlying implementation allows use of Apache (htaccess rules) to construct redirects, so production repositories could be anywhere. PURL from Archive Org is similar (and running far more years), but interfaces have less options.

But to go with W3ID, while anyone as long as is the first to request, can ask for a prefix, the ideal for serious case would have more people (don’t need to be the same maintaining the repository) that could be put on the list to be allowed to change the prefix. This sort of is similar to DNS today, however all changes are public.

The idea of PURLs is the closest to what librarians would do on the web. I will not request any prefix there but just saying that if we have some trusted people (better already a few ones trusted by developers) they could be the ones allowed to request changes of the entire non permanent URL.

Let’s try something that could work in the next 20 years. If we look at the past, as popular as GitHub is today, the equivalent would be something like SourceForge so we should assume URLs can change, and an additional level of indirection makes sense.

The argument about for OpenStreetMap data itself vs how to parse it (very long term archiving issues)

I noticed that at this moment, some places already have backups from over 10 years ago and that some dumps are even archived for very, very long term (e.g. similar to what libraries do) by different individuals. But here there’s one inconsistency: snapshots of data likely to survive 200 years without some way to explain how it could be understood in 200 years.

This might seem more obvious for binary format, however even the text format, XML, would be something alien in the long term if it had to be reverse engineered (and even then, could be done with human errors). As the time passes, archivists may delete things that don’t seem relevant, so without proper treatment, even examples of code to show how the format works might not survive or will survive in places no one would find in the very long term.

So, the more places to store how to parse OSM formats (at least the ones used for backup) is also relevant in this context. And if is not complex to explain in a few pages of code how to convert any other standard in the same are, then it should also be done, because in the long term, people might use the specification to convert by hand to another format, then that format more likely to have programs that know how to use. Having “DOIs” (Zenodo is one place to even store code) help with this, and the metadata allows others to find it in the very long term.