Transliterations of names to other scripts than Latin

Users of Latin script have come to expect that for areas where it is not the local script, maps are available that show names of geographic locations in Latin script in addition to the local script so that they are able to read them. This is not controversial, and I think it should be equally non-controversial that users of other writing systems than Latin script are entitled to the same expectation. I am most familiar with Bulgaria Cyrillic script (which is foreign to me, my native language is Dutch): paper maps in Cyrillic of areas Bulgarians commonly travel to are available and Google Maps shows many place names first in Cyrillic and then in local script if if the preferred language is set to Bulgarian.

By this post I would like to start a thread to discuss if and how to implement addition of transliterated names to locations on OSM. This subject has been touched upon in other threads such as

https://community.openstreetmap.org/t/name-ru-bij-nederlandse-plaatsen/78589/24

https://community.openstreetmap.org/t/tool-to-find-and-fix-incomplete-multi-lingual-names/143881/7

https://community.openstreetmap.org/t/discussie-over-het-automatisch-vertalen-van-osm-name-tags/102590/26

https://community.openstreetmap.org/t/tagging-names-transformed-from-one-language-to-another/125144

but has not been discussed systematically afaik.

I would like to address some of the concerns in these earlier discussions here.

  1. It should be data consumers that should provide transliterations, not OSM

Transliteration usually requires local knowledge such as the pronunciation of the name. Cyrillic has been developed for writing the Bulgarian language, so it is adapted to it and is almost completely phonetic. To be able to correctly transliterate a foreign name, its pronunciation needs to be known. Afaik AI (LLM) is limited to written language so does not have the necessary information about pronunciation. Therefore, it needs a human to correctly transliterate a name, and it can’t be automated. I assume this is also true for other scripts that are usually developed to write the sounds of a specific language (Japanese katakana for instance). These correct transliterations need to be stored somewhere, and since OSM aims to store geographic information, it would be a good place to store it. We already store exonyms, so why not transliterations? Or would Wikidata be a better place to store all non-local names?

  1. Most places don’t have a name in foreign languages that are different from the local name.

Transliteration is not translation: the name is still in the local language, but written in a script that is not commonly used for it. The same is often true for place names in other writing systems that are transliterated to Latin script, but that doesn’t stop us from adding it to the map… it’s so useful for map users that can’t read the local script, so why not help them?

  1. I can’t verify that the added transliteration is correct.

It’s true that there is often no ground truth that can be used to verify correct transliteration as the foreign script is unlikely to be displayed on any sign. I think we have to assume good faith here, and hope that the mapper who added it is familiar enough with both the location (how the name is pronounced) and the foreign script. There are often other mappers with the same familiarity who can check the correctness and adjust if necessary. Just like translation, transliteration is not an exact science and sometimes choices have to be made (whether the place has an exonym in the foreign language or not, to transliterate or translate, which foreign character to use to transliterate a sound, etc.) that can be disputed. That doesn’t mean it’s complete anarchy, however: most cases are obvious, there are conventions on how to transliterate difficult cases, and it can be discussed which transliteration is most likely to be understood correctly by map users that read that script. I think the fuzziness of transliteration is similar to that of tagging the surface of an unpaved highway: plenty of discussions here on what the differences are between ground, dirt, compacted and gravel… Anyway, if you can’t verify the correctness of a tag, that is not a reason to remove it: Don’t remove tags that you don’t understand

  1. The list of foreign names will become too long

That’s only a problem if we run out of database space. We could limit the number of names in foreign scripts that are only slightly different (Bulgarian Cyrillic vs. Russian Cyrillic, for instance) by deciding that the Cyrillic transliteration should be in name:ru=* (Russians being the largest group of users of the Cyrillic script) and name:bg should only be added if the spelling is different. There is such an agreement between the Serbian and Bulgarian OSM communities, see here https://community.openstreetmap.org/t/serbian-names-for-bulgarian-places/111488. The same may be possible with other closely related writing systems such as Arabic & Persian, Devanagari & Bengali, Thai & Lao, etc. (I don’t know enough about these scripts to know if this is feasible). We could also agree that if for a certain writing system, several different transliterations are possible, we should add only one: the one most useful to map users. We should also not transliterate alt_name, etc. If a true exonym exists, it should always be preferred.

  1. Transliterated names should have a different tag to be able to distinguish them from true exonyms

That can be done, but I don’t see (yet) why it is necessary or useful to make the distinction.

4 Likes

Do you want to limit this discussion strictly to transliteration, or would you also like to cover the broader topic of how geographic names are adapted between different languages and writing systems?

For context, Ukrainian legislation actually uses two different approaches depending on the direction:

The first is strict ā€œexportā€ transliteration of Ukrainian names into the Latin script according to the official resolution of the Cabinet of Ministers of Ukraine:

Ukrainian Name Latin Transliteration
Бровари Brovary
Š‘Š¾Ń€ŠøŃŠæŃ–Š»ŃŒ Boryspil
Š—Š°ŠæŠ¾Ń€Ń–Š¶Š¶Ń Zaporizhzhia
ŠžŠ“ŠµŃŠ° Odesa

The second is the ā€œimportā€ of foreign geographic names into Ukrainian. This process goes way beyond simple transliteration, as it actively incorporates phonetics, morphology, and lexical adaptation. For instance, there are official rules for representation Bulgarian geographic names in Ukrainian, which often result in translating descriptive parts rather than just changing the script:

Original Adapted
Еминска планина Š•Š¼Ń–Š½ŃŃŒŠŗŃ– гори
Єасковско възвишение Єасковська височина
нос ŠšŠ°Š»ŠøŠ°ŠŗŃ€Š° мис ŠšŠ°Š»Ń–Š°ŠŗŃ€Š°
1 Like

Don’t extend this further - for new tagging standards, use name:[native language]-Cyrl for transliterations, same as the recent standard to use name:[native]-Latn for Latin transliterations.

1 Like

I appreciate your focus on the practical user experience that a project like ours should enable. I think we both largely agree that user’s expect this information to a degree:

The difficulty is that these transcriptions or transliterations fall between the cracks. While the local community can of course maintain the native-language names and overseas language communities can help maintain exonyms, who is able to maintain the transcriptions and transliterations that are somewhat foreign to both groups of mappers?

This is specialist information that generally requires training, except for certain cases where the information happens to be signposted. But transcriptions and transliterations are only useful on a map if they appear predictably, not if they appear haphazardly because we were only mapping the signs.

For comparison, name:pronunciation=*/name:*-fonipa=* also requires some specialized knowledge of IPA, and we only tag it on an exceptional basis. But IPA pronunciation guides don’t appear as often on rendered maps. And in practice, most mappers add these tags by looking up names in Wiktionary or other guides. Would we welcome similarly looked-up transcriptions and transliterations in OSM?

The first issue that comes to mind is that some language pairs have a variety of ā€œbestā€ transliteration schemes, depending on the audience and use case. The notion that the local community knows best is too simplistic. What’s to say that the signposted transliteration scheme is the one that the user knows how to read? This is a particular problem for the indigenous languages of Southeast Asia, Africa, and North America. A succession of wildly different orthographies often reuses the same symbols for conflicting purposes. Unfortunately, not all of these schemes have standard codes in BCP 47 yet.

Often when we talk about relying on Wikidata, we’re referring to the labels and aliases at the top of every item. These are important, but if a data consumer needs to distinguish between exonyms and transliterations or between multiple transliteration schemes, it should instead consult properties such as native label (P1705) and name (P2561), especially if qualified by transliteration or transcription (P2440), Hanyu Pinyin transliteration (P1721), ISO 9:1995 (P2183), etc.

Wikidata doesn’t require any name to be verifiable on the ground, so more widespread coverage is possible than in OSM. On the other hand, OSM goes into much greater geographic detail than Wikidata. A data consumer needing comprehensive coverage of transliterated names of points of interest would need to fall back to an automated transliteration library. Incidentally, I’ve proposed that Planetiler support Wikidata name statements for map-optimized labels, but data consumers will only take notice once these properties and qualifiers reach critical mass.

2 Likes

Unverifiable tags are not tags one does not understand. They are tags one understand as unverifiable. ā€œDon’t remove tags you don’t understandā€ does not apply here, but verifiability does, because transliterations are not an objective fact: Verifiability - OpenStreetMap Wiki

1 Like

Честит Ген на ŃŠ»Š°Š²ŃŠ½ŃŠŗŠ°Ń‚Š° писменост! :slight_smile:

This makes the transliteration verifiable, so I don’t think adding these names to the map could be controversial. Bulgaria has a similar system, described in a law, so this is used to transliterate Bulgarian names to Latin script (stored in int_name) because it is what is shown on signage and is therefore the most useful system for Latin-reading map users.

Again this makes the transliteration verifiable and adding these names to the map should not be controversial. The list contains translations of descriptive parts of names, so these could be considered official exonyms. It shows one of the dilemmas of transliteration: should the descriptive part be translated, transliterated, or omitted? It depends on language traditions, but I think in most cases these traditions are clear and not controversial.

3 Likes

Do you want to transliterate only place names, or anything that has a name in osm, i.e. all streetnames, shops, … ? Because if so, space requirements could be considerably larger.

Personally, I don’t see the point of storing this information in osm. I think that on the fly machine transliteration could do a decent job, if not now, than in a few years.

And remember, it doesn’t have to be perfect, it’s only to help people who can’t read the local script to pronounce a name. And even a perfect transliteration is no guarantee that they can.
To put it another way, if I see a perfect transliteration of a Chinese village name in Latin script, I’d probably still do a poor job of saying it out loud. I mean, it could be pronounced in an English voice, or Dutch, or something else. Maybe there are sounds in the local language that can’t even be written in a transliteration.
A text-to-speech engine could be of more use.

2 Likes

This would be formally more correct, but not very practical because there are different versions of Cyrillic script. Which key should be used for ŠˆŠ°ŠŗŠ¾Š²Ń†Šø, the Serbian Cyrillic version of the Bulgarian village name Яковци? name:bg-Cyrl-sr=ŠˆŠ°ŠŗŠ¾Š²Ń†Šø maybe? This can quickly get very complicated, while I don’t yet see a practical use case for it.

Using name:ru has practical advantages: it is already recognised by data consumers, that it is Cyrillic script is clear from the tag value, and in this case it is useful as a fallback value in case a name in another language that uses Cyrillic script (name:bg for instance) is not available. We could even decide that if name:bg would have the same spelling as name:ru, it doesn’t need to be and shouldn’t be added. Most other languages that use Cyrillic script are also Slavic languages, so their users will likely be able to understand the name:ru name even if it is a translation, while speakers of non-Slavic languages that use Cyrillic often have Russian as their second language. I assume that for other sets of languages that use the same or similar scripts, there is such a fallback language too (Arabic, Hindi, …) while for scripts that are used by only 1 language there is no ambiguity at all (katakana is used for name:jp only).

First impression: name:sr, clearly, because you just said it’s the Serbian version? Don’t include bg anywhere in the key, that’s implied by the geography.

Second impression: I’d refer to the Serbian OSM community’s established practice on how to enter both versions of names inside the country, to see if it’s applicable here.

If that established practice is to use name:ru for the Serbian Cyrillic names, then I’d say that’s pretty weird and shouldn’t be propogated on elements outside the national border.

Double tagging name:sr-Latn and name:sr-Cyrl on the Bulgarian village seems like the cleanest solution to me, but I would understand if existing tagging weighs towards a different solution. Using name:ru for names that are Cyrillic but not Russian seems quite prone to being overwritten. Use the language code that the Cyrillic was produced in/for.

Would seem to me to likely be not very popular with the inhabitants of a number of Cyrillic-speaking, but non-Russian, countries!

1 Like

Just to add some background information, without entering into the discussion what should be tagged and what not:

IETF BCP 47 specifies everything that is needed for the various options of languages, scripts and regions:

  • es-Arab for plain transliteration of Spanish to Arabic script
  • zh-Latn-pinyin for Chinese transliterated into Latin script using the pinyin system
  • yue-Hant-HK for Cantonese using traditional Han characters as spoken in Hong Kong
  • ca-valencia for Catalan as spoken among Valencian people (not related to a defined province or state, hence no country/region code)
  • en-t-jp for text literally translated from Japanese to English
  • uk-Latn-t-fr literally translated from French to Ukrainian, using Latin script

All of these combinations are used in some places in OSM.

6 Likes

I think my motivation for bringing this up is my interest in the tension there is between verifiability and usefulness. I think that just like perfection should not be the enemy of the good, verifiability should not be the enemy of something damn useful like smoothness or in this case transliteration.

I think the verifiability good practice is often applied and interpreted too strictly on OSM so that it obstructs adding very useful info to the map. 100% objectivity is desirable but not necessary: the wiki itself mentions the distinction between stream and river as an example: it is not 100% objective and the verification criterion (to be able to jump across) is somewhat subjective, but that doesn’t stop us from making the distinction. I already mentioned the distinction between ground, compacted and gravel as another example. One mapper may judge that he can jump across a stream, the next one may judge he can’t, and they may start an edit war between them. The same can happen between transliterators, but that shouldn’t be a reason not to add transliterations. A tag may not be verifiable by all mappers, but I think it’s sufficient if it is verifiable by some. There are thousands of Bulgarians with sufficient knowledge of Dutch pronunciation so they are capable of transliterating Dutch place names, and a few of them are probably mappers. I am personally capable of checking their work too. Even local mappers with no knowledge of the foreign script tag being added to their home town can do something to verify it. A mapper in Reading, UK who notices that name:jp-ćƒŖćƒ‡ć‚£ćƒ³ć‚° has been added to the city’s node can paste that into a back-transliteration tool, find it spells ā€œridinguā€, suspect that the mapper adding it has guessed the pronunciation wrong and comment on it on the changeset.

3 Likes

Historically, cartographers have always done one of two things: either forced their own exonyms onto maps, or warped local names to fit their own language, script, and phonetics. They also loved throwing out generic descriptive terms (which I totally disagree with, by the way :slightly_smiling_face:). So it’s a fair question: why should OSM contributors be any different?

The short answer is that maps have always been tools of power. Look at the USSR era — they literally had detailed, official manuals on how to Russify geographic names. If you try hard enough, you can easily manufacture a fake ā€œRussian realityā€ purely through toponyms (which, sadly, is exactly what’s happening in the occupied territories right now). For me, having a Russian version for all or most Ukrainian names in the OSM-database is a complete nightmare scenario.

If we are to move towards creating transliterations, possible steps could include making it mandatory to discuss the addition of derived transliterations and adaptations between several language communities, primarily involving those local to the specific area. For instance, the Ukrainian and Bulgarian communities could hold this kind of discussion to find a mutual consensus.

1 Like

I don’t understand. Can you give an example?

So, using Wikidata is not an option?

I expect that the needed amount of space for transliterations will be limited by the availability and motivation of mappers who are skilled enough to add these. In this wiki, the example of not ā€œadding an Inuktitut transliteration of a small village in Indonesiaā€ is mentioned. I think we can safely assume that this will not happen even if we relax those criteria, because it is highly unlikely that there is a mapper who is both capable (knows both Inuktitut and the local conditions in Indonesia) and motivated (sees the usefulness of it and is willing to spend time on it) to add them manually.

I don’t have the expertise to judge that, but I do see the results of what Google Maps is doing with transliteration of Dutch place names to Cyrillic:

Here, every single transliteration is questionable: the machine doesn’t know that Gorinchem is pronounced Gorkum, ch is transliterated Ч instead of Š„ (back transliterated, Schelluinen becomes Stseloeinen and Kedichem Keditsem, but Gorinchem is done correctly if that was the correct pronunciation), Vuren has become Voeren and the names of the provinces are a combination of transliteration of the compass direction and the Bulgaria exonym of Holland/Brabant while I think it should be a full translation. If this represents the current state of the art, I have my doubts if machine transliteration will be able to do a decent job within a few years :wink:

Here you have a point: in fact much of the disputes about how to transliterate are about how to best solve a case where a perfect transliteration is not possible due to lack of a suitable character in the destination script to represent a sound (German ü doesn’t have a clear Cyrillic equivalent, but there is a convention that this is done with ю yu). How to transliterate tonal languages like Chinese must be a big issue (maybe it is possible only to Vietnamese Latin?).

It’s the name written using Serbian Cyrillic, but the language is still Bulgarian so it should be name:bg-XXX=. It’s a tiny village (21 inhabitants) not close to the Serbian border so it’s unlikely to have a truly Serbian name.

When the transliterated name is the same for several Cyrillic-using languages (quite common because the differences between Cyrillic alphabets are only a few letters), it would be efficient to have a key in which to store that name so it doesn’t need to be repeated for all those languages. I think the systematic name for such a key would be name:und-Cyrl, but that has the disadvantage that it is probably not understood by most data consumers and would have to be learned. What I am proposing is to add the additional meaning of ā€œa transliteration of the name into Cyrillic without specific languageā€ to name:ru . I realise that this is politically sensitive. name:en is already used in a similar way for names transliterated into Latin without really being in the English language and even though there are many people who don’t like Donald Trump :slight_smile:

I don’t know how things are in other parts of the world, but Google Maps in Ukraine is a real mess. They absolutely do not care about data quality, completeness, or accuracy. So in your example, the data on Google Maps may simply be incorrect — generated by a flawed system or taken from an unreliable source, it doesn’t really matter.

The only way to represent a name properly is through a well-designed ā€œlanguage → languageā€ system, like the ones I linked for the ā€œUkrainian → Russianā€ and ā€œBulgarian → Ukrainianā€ pairs.

Transliteration, on the other hand, is a ā€œletters → lettersā€ system, and in such systems phonetic accuracy is usually far from the top priority.

1 Like

I’m Dutch, and until today I had no idea that Gorcum is the official pronunciation :wink:, I’ve always said ā€˜goriegem’.

As for the others, I’d probably recognize the names if someone were to ask me about them. I doubt that a better transliteration would make a big difference.

No matter how good a transliteration, foreigners will always have a hard time using Dutch sounds like ā€˜ui’, ā€˜ij’, ā€˜g’ etc. And even without a Cyrillic - Latin conversion there will be confusion: A Spanish speaking person wouldn’t know about Gorinchem / Gorcum, and might want to see something like Jórijem or Jórkem.

I have questioned that statement elsewhere, but I’ll go into a bit more detail now: Cyrillic alphabet variations are just as diverse as the Latin ones, and transliteration schemes wildly vary among languages. Yes, a ā€œplain vanillaā€ name such as London will invariably be name:und-Cyrl=ЛонГон across the board. However, anything more complex produces a plethora of variations: for example, the capital of Belgium is variously called Š‘Ń€ŃŽŃŃŠµŠ»ŃŒ (ru, uk+some Central Asian languages), Š‘Ń€ŃƒŃŠµŠ»ŃŒ (be), Š‘Ń€ŃŽŠŗŃŠµŠ» (bg), Брисел (sr), as well as Š‘Ń€ŅÆŃŃŠµŠ»ŃŒ, Š‘Ń€ŃŽŃŃŠ»ŃŒ. A typical Serbian user would be baffled seeing some of those on the map, and probably prefer the original Bruxelles.

Disagreement notwithstanding, the general problem you presented is real. I’m cautiously leaning towards the idea of using Wikidata to store all language names, but I’m still open to ideas.

3 Likes

That’s a good idea! I think it could lead to a re-writing of this wiki from ā€œWhen to avoid transliterationā€ to ā€œHow to add transliterationā€, i.e. from a ā€œNo, unlessā€¦ā€ to a ā€œYes, butā€¦ā€ attitude. We could split it off into a separate page and list a number of guidelines on how to do it. These could include:

  1. No machine transliteration (same rules as automated edits and bulk imports)
  2. Required (or recommended?) knowledge (both on the source and target script and language as well as local conditions)
  3. Cooperation with local and foreign language OSM communities
  4. Which names to transliterate (settlements, administrative units, etc. but not brands, private business names unless in use, etc.)
  5. …

as well as some words on verifiability and the guideline that transliterations are allowed and if you don’t understand them, apply Don’t remove tags that you don’t understand

Paragraphs describing basic consensus on how to transliterate between the main scripts could follow.