Why HTML Special Characters (like &) found in .osm files?

It puzzles me to find HTML Special Characters in .osm files.

In a recent download from geofabrik (http://download.geofabrik.de/osm/) of Turkey turkey.osm.bz2 there is a node with id 312788027. You can see the data for that node here: http://api.openstreetmap.org/api/0.6/node/312788027 You see an & in “Cambo İskender & Kebap” but if you go to the source of that page you will see “Cambo İskender & Kebap”.

Now for a browser it is ok and expected that in a source & is written and not a single &.

But why in an .osm file? For as far as I know xml does not require an & to be written as &

The .osm data is for all applications. Why should apllications find html special chars in .osm data ? In my opinion this does not make sense. I did not investigate the planet dump for this (soooo… much time needed) but assume that the planet dump will contain them too.

I write this to be corrected if I’m wrong.

I didn’t know myself, but I googled (or rather, “wikipedia-ed”) it and found this:

The necessity of entities depends on the file’s character encoding, not on the filetype.
Outdated encodings like ISO-8859 had a very limited amount of characters.
To “extend” this amount the web browsers supported “entities” (like   or ä) which are obsolete when you use UFT-8.

In XML UTF-8 is the default encoding, the support for older encodings is just for backwarts compatibility.

An XML file contains markup and character data. If you have an XML-file containing


“John” is the character data and the rest is the markup.
Since markup and character data are stored together in a single file a XML parser needs information how to distinguish between markup and character data.
Because XML is stored in a text file the only possibility to give a parser this information is to choose certain characters as separators.
This results in the restriction that you must not use certain characters within the character data.

Thank you both for your effort to explain this case. I spent some time reading mentioned document. But it dazzles me after some time. So much specifications and no examples. I need small examples to understand.

I’m willing to accept This results in the restriction that you must not use certain characters within the character data.. But I do not understand why an & belongs to them. Could you give an example that showed its restricted use?

Reading the specification it is stated that if a xml document is UTF-8 encoded it can still contain non UTF-8 tekst.

Now I would like to use this possibility. See the node below:

The placename Мотилі is in cyrillic UTF-8 (all together 12 bytes). I would like to add an extra tag with the placename Мотилі in 1 byte cyrillic (all together 6 bytes)(I have to specify a codepage to then I think).

What should I add to make this possible?