Why HTML Special Characters (like &) found in .osm files?

greencaps · January 24, 2010, 12:21pm

It puzzles me to find HTML Special Characters in .osm files.

In a recent download from geofabrik (http://download.geofabrik.de/osm/) of Turkey turkey.osm.bz2 there is a node with id 312788027. You can see the data for that node here: http://api.openstreetmap.org/api/0.6/node/312788027 You see an & in “Cambo İskender & Kebap” but if you go to the source of that page you will see “Cambo İskender & Kebap”.

Now for a browser it is ok and expected that in a source & is written and not a single &.

But why in an .osm file? For as far as I know xml does not require an & to be written as &

The .osm data is for all applications. Why should apllications find html special chars in .osm data ? In my opinion this does not make sense. I did not investigate the planet dump for this (soooo… much time needed) but assume that the planet dump will contain them too.

I write this to be corrected if I’m wrong.

Tordanik · January 24, 2010, 12:33pm

I didn’t know myself, but I googled (or rather, “wikipedia-ed”) it and found this:

Bikeman2000 · January 24, 2010, 6:47pm

The necessity of entities depends on the file’s character encoding, not on the filetype.
Outdated encodings like ISO-8859 had a very limited amount of characters.
To “extend” this amount the web browsers supported “entities” (like or ä) which are obsolete when you use UFT-8.

In XML UTF-8 is the default encoding, the support for older encodings is just for backwarts compatibility.
http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

An XML file contains markup and character data. If you have an XML-file containing

<name>John</name>

“John” is the character data and the rest is the markup.
Since markup and character data are stored together in a single file a XML parser needs information how to distinguish between markup and character data.
Because XML is stored in a text file the only possibility to give a parser this information is to choose certain characters as separators.
This results in the restriction that you must not use certain characters within the character data.

greencaps · January 25, 2010, 9:55am

Thank you both for your effort to explain this case. I spent some time reading mentioned document. But it dazzles me after some time. So much specifications and no examples. I need small examples to understand.

I’m willing to accept This results in the restriction that you must not use certain characters within the character data.. But I do not understand why an & belongs to them. Could you give an example that showed its restricted use?

Reading the specification it is stated that if a xml document is UTF-8 encoded it can still contain non UTF-8 tekst.

Now I would like to use this possibility. See the node below:

The placename Мотилі is in cyrillic UTF-8 (all together 12 bytes). I would like to add an extra tag with the placename Мотилі in 1 byte cyrillic (all together 6 bytes)(I have to specify a codepage to then I think).

What should I add to make this possible?