Tagging names transformed from one language to another

After a quick search on this forum, I couldn’t find a discussion specifically about this topic. If this has already been covered, my apologies.

I believe most active mappers here are familiar with the concepts of exonym and endonym, but for clarity, here is a brief explanation.

What are endonyms and exonyms?

  • Endonym – The name of a geographical feature used by local inhabitants in their native language. For example, Wien for Austria’s capital and Firenze for the well-known Italian city.
  • Exonym – A name for a geographical feature used in a different language that differs from the local name. For instance, the capital of Austria is known as Wien in German, but Vienna in English, Vienne in French, and Відень in Ukrainian. Similarly, Firenze in Italian is Florence in English, Florencia in Spanish, and Florenz in German.

Exonyms are especially common for country names, capitals, and major cities. However, there are also many lesser-known geographical features that only have a name in one language.

Examples of endonyms and exonyms in Ukraine

Endonym (name, name:uk) Exonym (name:en)
Чорне море Black Sea
Дунай Danube
Димерчин ставок
Крим Crimea
Закарпаття Transcarpathia
Харитонівська сільська громада
Київ Kyiv
Витягайлівка
Цмоки
вулиця Леонтія Свічки
2-й провулок Сергайовки
урочище Попові Корита

Although OSM Wiki guidelines suggest avoiding transliterated, transcribed, or translated names in multilingual tags, name:en often contains such transformed versions.

For example, here are some names listed under name:en for places in Belarus, even though none of these can truly be considered part of the standard English lexicon:

Nyasvizh
Stowbtsy
Valozhyn
Kletsk
Viliejka
Stolin
Luniniec
Kapyĺ

And similarly for Egypt:

Manial Shiha
Hadayek helwan
Wadi Hof
Ghamaza Al-Kubra
Qiblya
Dahshur

Proposal: Using BCP 47 Extension T Standard for transformed content

The BCP 47 standard allows specifying transformed content, including transliterations, transcriptions, and even translations. Here is a quote from the standard:

Identification of transformed content can be done using the ‘t’ extension defined in this document. This extension is formed by the ‘t’ singleton followed by a sequence of subtags that would form a language tag as defined by BCP47. This allows the source language or script to be specified to the degree of precision required.

(…)

For example:

Language Tag Description
ja-t-it The content is Japanese, transformed from Italian.
ja-Kana-t-it The content is Japanese Katakana, transformed from Italian.
und-Latn-t-und-cyrl The content is in the Latin script, transformed from the Cyrillic script.

And here is how language tags would appear in OpenStreetMap:

name:ja-t-it
name:ja-Kana-t-it

Why іs this needed?

Exonyms vary significantly. Consider different names for Germany:

Name Etymology
Deutschland (endonym) From Old High German diutisc (“of the people”), derived from Proto-Germanic þeudō (“people”). -land means “country,” so Deutschland means “land of the people.”
Німеччина (Ukrainian exonym) From Old East Slavic нѣмьць (“mute person”), referring to Germanic people who did not speak Slavic languages.
Germany (English exonym) From Latin Germania, a term used by the Romans for the lands beyond the Rhine. The origin is unclear but may derive from Gaulish germani (“neighbors” or “twins”).
Allemagne (French exonym) Derived from the name of the Alemanni tribe, who lived in southwestern Germany and Alsace.

These are true exonyms and, in my opinion, belong in name:uk, name:en, name:fr, etc.

However, for a small Ukrainian town like Згурівка, other language versions are actually transformations of the original Ukrainian name.

Using the “t” singleton for transformed names

For such cases, the t singleton can indicate that a name has been transformed from the original language:

name:be-t-uk = Згурыўка
name:ru-t-uk = Згуровка
name:en-t-uk = Zghurivka
name:crh-t-uk = Zhuriwka
name:ro-t-uk = Zgurivka

Benefits and challenges of this approach

This approach ensures more accurate tagging, as a transformed version of Згурівка does not truly belong to Belarusian, Russian, or English linguistic systems. It also helps distinguish between native English names like New York or Main Street and transliterated names such as Khreshchatyk or Zghurivka, which do not form part of the standard English lexicon. However, implementing this system requires data users to adapt their processing algorithms to recognize the -t-<lang_code> extension in tagging keys, which may introduce an initial technical challenge.

Advanced usage: Specifying transformation methods

BCP 47 extension T allow even more precise tagging by indicating the transformation method. This can be useful for historical and cartographic research.

Example: Different Romanization standards for the same Ukrainian place name:

Transformation Method Tag Transformed Name
US Board on Geographic Names (1965) name:und-Latn-t-uk-Cyrl-m0-bgn-1965 Kam’yanyy Brid
UNGEGN Standard (2012) name:und-Latn-t-uk-Cyrl-m0-ungegn-2012 Kamianyi Brid

In conclusion, I believe that my proposal to use the BCP 47 extension T is important, especially for OpenStreetMap, as there are not many databases in the world that are as open and international. Accuracy in specifying languages, alphabets, and such transformed exonyms is essential for the development of mapping and global interaction within the project. This is a topic worth discussing, and I look forward to a constructive discussion. Thank you for your attention!

2 Likes

(Related topic: Filling in the name:en tag in Ukraine)

Bear in mind that this is tricky, because it is not just English where Ukrainian (or any place with a non-Latin script) gets transliterated, but every single language that doesn’t use Cyrillic writing. For handling small villages, it makes more sense to use a tag like name:uk-Latn. It is not strange for a small village to not have any name:en at all (or name:de, name:fr, name:nl, etc.).

3 Likes

Just to clarify an important point to ensure I am understood correctly: I am against mass cross-transliteration, transcription, or translation.

In my opinion, unfortunately, this has already happened with some languages, such as English. How many of these points on the map truly have names that are genuinely in English?

One problem with specifying the method of transformation, is that often we simply don’t know exactly. Etymology is a notoriously vague field. Sure, we know that name:nl=Kyiv comes from the Ukrainian pronunciation of Київ, but for less well-known places this is often a guess, even if the exonym has quite obviously been created by using either the Ukrainian or Russian pronunciation.

What we do know and can fairly reasonably proof, is what the exonym is. But what its source is, is much less straight forward. Kyiv aside, if some place has a name:nl (perhaps it became famous in the Netherlands for some reason), did that exonym come from transliteration according to Ukrainian pronunciation, or was it borrowed from name:en or name:de which already had that exonym?

You are right about name:en sometimes being abused to simply put a romanized version of a name somewhere though.

4 Likes

Are you proposing to supplement name:en=* with name:en-t-uk or replace name:en=* with name:en-t-uk=*?

Can you elaborate on why researchers would need OSM to explicitly record these etymologies versus Wiktionary or Wikidata’s lexicographical data project? For a proposal to get traction within the OSM community, it needs a more concrete use case than “for research”.

This assumes that there’s a single most obvious romanization scheme from Cyrillic Ukrainian to any other language, but sometimes there are multiple transliteration or transcription schemes between the same two writing systems, which might give us a reason to indicate the scheme in tags.

For example, the difference between name:zh-Latn-pinyin=* and name:zh-Latn-tongyong=* can be relevant in Taiwan. However, both already have official variant codes in the IANA language subtag registry, making the -t- extension redundant. The PRC prefers that German-language publications use Lessing–Othmer romanization rather than Hanyu Pinyin as in other languages, though I’m unsure if this is common enough that OSM would need to tag name:de-Latin-t-uk-Hans-m0-lessing=*. (First, Lessing–Othmer would need to be registered with CLDR, the official registration authority. Until then, I guess we’d have to use name:de-x-lessing=*?)

I’m not very confident that we’d be able to replace all this usage with tags that indicate the specific transformation, simply because we don’t always know which transformation scheme was originally used. Maybe replacing these name:en=* with name:und-Latn=* would be a more achievable goal.

Even so, I don’t think it’s fair to say that these transliterations are “fake” English. Unlike some other languages, English is a largely unregulated language; no official standard governs the use of exonyms in general. English is also very accommodating of loanwords from other languages. It’s better to think of name:en=* as the answer to the question, “What would an English speaker use?” That could be an argument for redundant name:en=* and name:und-Latin=*, or for data consumers to add logic to fall back to name:und-Latn=* when name:en=* is missing before falling back to name=*.

2 Likes

What matters here is not the intermediate stages of the name but the language in which the endonym—the original name—is written. The transformation can vary greatly, even taking the form of a translation. However, it is still a transformation.

I’m not proposing anything official just yet. This is more of a call for discussion rather than an official suggestion.

First of all, I wasn’t aware of these projects. I’m more focused on the development of the OpenStreetMap tagging system, which is why I brought this up here. I recently discovered this tagging standard, and it connected with my realization that there are many names that should not be recorded in English, but in OpenStreetMap, they are.

But is this really the right approach? Would language tags lose their meaning if we started treating them this way?

It seems that we’ve ended up with some inconsistency in language tagging, and my proposal offers a way to gradually bring more clarity to it. I appreciate the efforts of everyone who has contributed to the internationalization of the world map.

But different languages, even using similar scripts, transcribe scripts differently. For example name:en-t-uk=Zaporizhzhia but name:de-t-uk=Saporischschja but name:pl-t-uk=Zaporiżżia (as opposed to name:pl=Zaporoże) - so which would go in name:uk-Latn?

1 Like

I want to provide a real-life example.

After the abolition of the national-territorial autonomy of the Crimean Tatars and their genocide in 1944, the Crimean Peninsula saw a massive renaming of toponyms. The names of settlements in the Crimean Tatar language were replaced with Russian names. In the vast majority of cases, this was not a translation or a transformation of the old names into Russian Cyrillic. Simply new names, unrelated to historical ones, were introduced.

Here is a list of several such renamings:

name:crh name:ru
Aqmeçit Черноморское
Ablaq Acı Калиновка
Abuzlar Высокое
Abulğazı Вольное
Aq Baş Вячеславовка

Later, I don’t know exactly when, but probably in connection with the transfer of Crimea to the administrative subordination of the Ukrainian SSR, Ukrainian translations of these Russian names appeared.

name:uk
Чорноморське
Калинівка
Високе
Вільне
Вʼячеславівка

Currently, Crimea is occupied by Russia, and most settlements still have the same Russian names and their corresponding Ukrainian names. By the way, according to the DWG decision, in accordance with “Ground truth,” the name tag currently contains these Russian names.

On September 7, 2023, after the full-scale invasion of Ukraine by Russia, a law came into effect that restores the historical Crimean Tatar names to many settlements in Crimea.

Here are some of these names. According to Ukrainian orthography, these are names of foreign origin.

Official name in Ukraine before 2023 After 2023
Куйбишеве Албат
Новоульяновка Отарчик
Фурмановка Актачи
Ударне Бочала
Ульяновка Султан-Сарай
Завіт-Ленінський Кучук-Алкали

So, what do we have?

According to the current tagging scheme, the tags name, name:ru, and name:uk are already filled, and there is no place for the names that were restored to the settlements in 2023. These are also not alt_name:uk or old_name:uk. (I will add that very often these tags are used to stuff things that don’t fit anywhere else.)

How it could be according to my proposal:

name and name:ru, as much as I, as a Ukrainian, would not want it, would remain as they were before the de-occupation of the peninsula.

But instead, there would be an opportunity to record the names once translated from Russian in the name:uk-t-ru tag, and those restored in 2023 in the name:uk-t-crh or official_name:uk-t-crh tag.

Final table:

name (imposed by Russia)
name:ru (imposed by Russia)
name:uk-t-ru (still in use by the Ukrainian-speaking population of Crimea)
name:crh (Crimean Tatar names)
name:uk-t-crh (new names in Ukrainian, tarnsformed from Crimean Tatar)

1 Like

In order for a data consumer to present the value of name:uk-t-ru=* or name:uk-t-chr=* to a Ukrainian-speaking end user, it would either need to make an editorial decision that one is better than the other or ask the user. Since no existing data consumer is quite that detail-oriented, the practical result is that only Russian and Crimean Tatar speakers will see anything in their language; Ukrainian speakers would see labels in Russian. That’s probably not what you intended.

For backwards compatibility, we’d need to consider the proper values of less specific subkeys, namely name:uk=*. This brings us back to the starting point. In the event of a geopolitical dispute, the on-the-ground rule favors the de facto, in situ name over the others. In the event of a tie, multiple names can be separated by some delimiter.[1] Other keys besides name=* don’t require such strict adherence to the on-the-ground rule.

I’m a bit puzzled that not all of these keys are listed in your final table. If you’re already open to using official_name:*=*, there isn’t technically a conflict between the Crimean Tatar–based and Russian-based names that would require more specific codes. It could be merely the distinction between name:uk=* and official_name:uk=*.

If that approach is untenable because of contested notions of “official”, then maybe there could be a combination of name:uk-UA=* plus name:uk-RU=*, or nat_name:uk-UA=* plus nat_name:uk-RU=*, with name:uk=* having multiple names in it. But any change along these lines will probably require buy-in from more local mappers who understand the situation better than someone like me. :man_shrugging:


  1. Which delimiter is a subject of controversy, but the semicolon currently enjoys more robust software support than other delimiters. ↩︎

2 Likes

The purpose of tags is to indicate not for whom the information is needed, but what exactly this information is. Choosing what is better for the data user is not the task of OpenStreetMap contributors, as there will be many data users and they will be diverse. I gave an example where, instead of figuring out where to place a second name in Ukrainian, one could improve the tagging scheme itself to show what the difference is between the two Ukrainian names. What is different in their nature.

Yes, the choice between name:uk-t-ru and name:uk-t-crh is indeed a problematic question. Theoretically, one could establish a logical chain starting from name=* as the common name, then linking it to name:ru=*, which would duplicate the common name and explicitly indicate that it is Russian. From there, name:uk-t-ru=* would naturally follow as the Ukrainian name transformed from the Russian one.

Of course, this approach assumes that, by default, a data consumer would want to display the common name and any foreign-language transformations derived from it. However, I am not entirely sure this would work as seamlessly in practice as it does in theory. Language interconnections are complex, and ultimately, it may all come down to configuring name rendering manually to achieve a high-quality result—just as it does now.

By explicitly marking transformations, we provide more information about a name, not less. Consequently, the data consumer gains more flexibility, not less.

It seems that regional codes would also be useful, thank you for mentioning them

1 Like

I think it’s definitely worth identifying (and correcting) name:en tags that are blatant transliterations, and am certainly not opposed to tagging such transliterations appropriately, but I caution against assuming that small villages in Ukraine, Belarus, Egypt and other countries don’t have ‘standard’ or otherwise legitimate names “in English”. It may very well be that that transliteration is commonly used in English, or conversely that an “official” transliteration is NOT actually used by anyone.

Kapyĺ, Belarus is a great example of a transliteration that is very obviously not used in English: we don’t put acute accents on the letter ‘l’! The diacritic there is a dead giveaway that it’s not actually used in English.

Conversely Qiblya, Egypt probably is the “English” name; an “official” transliteration would likely be “Qibliyah”, or something similar, so I would be careful about just blindly assuming it’s a “junk” name.

4 Likes

I decided to structure the name tags used for one of the settlements in Crimea according to the proposed scheme. The scheme demonstrates the interrelations between the names. In addition to these relationships, a clear hierarchy is outlined. The Crimean Tatar, Russian, and Ukrainian names serve as the source for other exonyms.

name = Завет-Ленинский # Commonly used name, endonym, serves as the source for related exonyms
 └── name:ru = Завет-Ленинский # Indicates that the commonly used name is in Russian
      ├── name:yi-t-ru = זאַוועט לענינאַ # Exonym of the commonly used Russian name in Yiddish
      ├── name:ko-t-ru = 자비트레닌스키 # Exonym of the commonly used Russian name in Korean
      └── name:uk-t-ru = Завіт-Ленінський # Endonym in Ukrainian, used by the Ukrainian-speaking population of Crimea, was the official name from 1991 to 2014, before the occupation of Crimea
          └── name:en-t-uk = Zavit-Leninskyi # Exonym in English, derived from the Ukrainian endonym, transliterated according to the Ukrainian national transliteration standard, which is also BGN—2019 and UNGEGN—2012

name:crh-Arab = كۇچۇك حالقالی # Crimean Tatar historic endonym. Arabic script was the first alphabet of the Crimean Tatar language
 └── name:crh = Küçük Alqalı # Crimean Tatar historic endonym, written in modern Crimean Tatar Latin script
      ├── name:az-Arab-t-crh = کوچوک حلقه‌لی # Exonym of the name in Azerbaijani, Arabic script
      ├── name:ota-t-crh = كوچك حلقه‌لی # Exonym of the name in Ottoman Turkish
      ├── name:fa-t-crh = حلقه‌لی کوچک # Exonym of the name in Persian (Farsi)
      └── name:uk-t-crh = Кучук-Алкали # Exonym of the name in Ukrainian, also the official name from Ukraine's perspective since 2023

It would be interesting to hear about other places with complex situations involving parallel foreign names and try to test my tagging scheme on their examples.

There is some truth to this, but we do consider data users when designing tagging systems. We don’t cater to the wants of one data user over another, but we do at least try to imagine what a concrete use case for a new tagging style will be. Without that there is no point in collecting the data.

1 Like

So far, it is obvious to me that when transformed names have separate, more detailed tags, the data will become more valuable and easier to process for users. However, unfortunately, I must admit that I don’t yet have precise use cases.

Here’s what I have come up with so far. Imagine an application that utilizes the structured data from name:en-t-uk (a transformed name from Ukrainian) alongside name:uk (the original name). With this information, the app could provide both an English pronunciation (with an accent) and the original Ukrainian pronunciation. Additionally, tapping on the name could display a label: Transformed name—original pronunciation available: “Listen”. All the necessary data for such a feature would already be present.

On the other hand, Kraków - Wikipedia - no idea is it Wikipedia doing Wikipedia things, or is common name in English actually Kraków rather than Krakow or Cracow

I have severe doubts about ó being actually used, even Poles in Poland were not using żółćęśąźń when support for these special letters was really bad on typical computers.

Plenty of places/streets/parks in Poland have

  • (1) old German name used in past
  • (2) German name used briefly during German occupation
  • (3) currently used German name
  • (4) prewar Polish name
  • (5) Polish name used during Russian occupation/puppetry
  • (6) current Polish name
  • (7) current official Polish name

sometimes the same name applies to multiple of those

in particular distinguishing “is there distinct German name for this object anymore” is tricky

also, question whether German name used briefly during German occupation and old extinct names are mappable at all is also an open question

as more and more time went since Russian army was kicked out I guess it is growing question are names from that time are supposed to be mapped in OSM

oh, and you can get regular street name changes. There is still ongoing process of renaming stuff that celebrated Russian occupiers and their puppets.

3 Likes

Wikipedia doing dumb Wikipedia things. :rofl:

It’s complicated. In the anglophone world, diacritical marks have come in and out of fashion many times over the past couple centuries. Superficially, the language is liberal about borrowing names from other languages but conservative about keeping their diacritical marks. In general, Western European languages get more of a pass than Slavic languages, to say nothing of Vietnamese. It also depends on the audience: an academic journal in the humanities or social sciences will likely retain the diacritics, whereas an atlas for schoolchildren or a newswire report is unlikely to bother with such nuances. My city has had edit wars over the acute in its name – not only in OSM but also on signs in real life. Do you see the acute over the “E” on this public toilet? I suspect it’s been a target of vandalism.

The point about Kapyĺ is not so much about the diacritic per se but rather about the choice between scientific/bibliographic and practical orthographies. I keep having to remind developers that ICU’s automatic transliteration library is not the easy magic wand that developers wish for it to be. It favors scientific transliteration schemes that are unambiguous and lossless when transforming in either direction – constraints that are often at odds with the natural processes for borrowing names into a language.

Alternatively, the feature could have a name:pronunciation=* tag (or subkeys thereof) that allows an English text-to-speech engine to pronounce the name more authentically. Or it could have a wikidata=* tag that allows a data consumer to fetch the Wikidata item, which can provide access to structured data about transliterated names, pronunciations, and etymologies.

The BCP 47 extension syntax certainly allows us to be more expressive and specific, but it doesn’t change the fact that OSM is primarily about geographic data. External projects like Wikidata are not so good at managing geodata but provide much more flexibility in recording linguistic content. There is plenty of software today that joins these projects together for insights and end user functionality that would never be possible with one project alone.

2 Likes

The name:pronunciation tag, unfortunately, is not very common. According to taginfo, there are only 7,405 occurrences worldwide.
Знімок екрана 2025-02-03 о 09.27.49
However, the names we are discussing are actively added by OSM contributors, despite recommendations to limit the inclusion of transliterations.

By the way, I even found instances of language tags using “t”:

Tag Value
coastline:name:en-t-zh Shore of Thousand Creek
name:en-t-zh Shore of Thousand Creek
name:en-t-zh Double Dragon Stream
name:zh-t-en 見聞
name:en-t-zh Lai Sam Accient Trail
name:en-t-zh Green Dragon River
name:en-t-zh Lai Sam Accient Trail
name:en-t-zh Lai Kuk Ancient Trail
name:en-t-zh Lai Kuk Ancient Trail
name:zh-t-en 走私嶺
name:yue-t-en 雲山
old_name:en-t-zh Seven Sisters Ranges
name:en-t-zh Shore of Thousand Creek
name:en-t-zh Double Dragon Stream
name:en-t-zh Double Dragon Stream
name:en-t-zh Lai Sam Accient Trail
name:en-t-zh Lai Sam Accient Trail
name:en-t-zh Lai Kuk Ancient Trail
name:en-t-zh Lai Sam Accient Trail
name:en-t-zh Lai Sam Accient Trail
name:zh-t-en 鎮島

It seems that they were added by user @Kovoschiz.

But OpenStreetMap itself significantly expands the category of geographic data beyond its classical understanding before our project emerged. And I find it hard to think of data more geographic than foreign-language names of features.

After all, I’m not suggesting adding any new data—just providing more detailed tagging for the existing ones.