[RFC] Feature Proposal - Add languages: tags for name rendering

Jarek · November 14, 2024, 1:47am

Do any of these examples in OSM do not currently have appropriately tagged name:de / name:fr / name:en / etc? It seems to me that any voice navigation software that cares can use name:<lang> and won’t have a problem. And any voice navigation software that doesn’t care won’t be updated to read a languages:* tag.

Sorry but this is either a software defect or a configuration error. That specific example is not something we should be trying to fix in OSM.

(Or consider the alternative: if you weren’t in the car, and the satnav “correctly” spoke German names to monolingual English speakers, would they understand the names better than the Englishfied versions?)

Again it seems to me that any software that can’t apply a “pronounce street names in Germany with a German speech pack” heuristic is not going to be updated to read languages:*.

And this only applies to data which has only a name with no name:<lang>, so the heuristics wouldn’t be needed in most of the common multilingual areas (where software could indeed conceivably struggle to figure out what language pack to use in one of 22 tiny Belgian exclaves, or in Biel/Bienne).

Jarek · November 14, 2024, 1:50am

The following point in the wiki page cuts off without finishing, can you check on it and clarify what you intend?

“It may open ways to improve searching by address, as addr:street tags could more reliably matched to street name:<language code> tags, particularly in areas where addr:street”

aseigo · November 14, 2024, 6:25am

Whoops, the last few words were deleted by accident, apparently while I was reformatting it to wiki markdown … it should (and now does) end with “where add:street tags are not multi-lingual.”

This happens in some Canadian cities, for example, where the street names are in “ ” form (e.g. “Rue John Road”) but the house addresses are entered in one language only (e.g. “John Road”).

For full disclosure: I’m not sure how much impact this has in practice, as I haven’t had a chance to look at how e.g. nominatim’s code handles these specific cases, though a few searches I did seem to show impacts.

aseigo · November 14, 2024, 6:30am

languages:presentation_order is certainly clear; and languages:presentation_order:local for where a local override-with-fallback is needed?

alan_gr · November 14, 2024, 6:38am

You mean dropping the “official” wording completely? That might be a good idea as linking rendering to official languages seems risky. Official status may refer to things like the right to do business with the state or courts in a particular language - there is no automatic link to displaying on a map.

I’m still not completely sure why two tags are needed though. If the algorithm uses the “most local” language list, what ia the dividing line between the two tags?

aseigo · November 14, 2024, 6:57am

Yes, they have name:<code> entries, but there is no hint as to which is the default and so without applying heuristics they pick name. When name is the same as one of these name:<code> entries, and/or it successfully applies location-aware heuristics (hit and miss in multi-lingual areas), then it can be deduced … but when name has more than one language name in it, it becomes increasingly complicated to deduce which languages are represented.

Keep in mind that “language the user has their nav set to” may not be the same as the language of the streets you are driving through. This happens all the time in the country where I live, where the language of street signs can change multiple times over the course of a half-hour drive, and where many people prefer to use navigation in their native language rather than whatever the town they are driving through speaks.

Can you explain how the following is intended to work:

set the navi language to English
drive through a multi-lingual town where there are no name:en entries
get place names from OSM entries with no language metadata, only name, name:fr, and name:de entries, all of which different since name is a by-hand combination of the fr and de entries

Again, the text-to-speech could use heuristics, it could use that to read specific language-specific entries only from the set of name:<code> entries (assuming those are even provided by the driving instructions, which IME they typically are not), but those are a lot of hoops we are asking the software to jump through.

Or OSM could instead provide language metadata that would enable detecting which localizations to use.

Yes, it is more understandable when voiced in the correct language.

In the case I mentioned, the driver can read German (and even speak to some degree) fine. They are used to the street names in German. The English vocalizations were nearly indecipherable. Comparing them to the street signs was not straight-forward.

It gets even worse if you try to share those wrongly-enunciated instructions with locals, as they will struggle as well. Again, I know this from first-hand experience, on both sides of that interaction.

A related “funny story” is when Biel/Bienne rolled out an automated VoiP system for city services that was poorly configured and would pronounce some names with the wrong language pack, and it was also hard to figure out for native speakers living there. (That story courtesy of my partner, who lived there at the time.)

Vocalizing names in the proper language is very useful.

“German names in Germany” is easy mode, as that is another (mostly) monolingual situation. The more difficult cases are street names in a French-speaking and/or bilingual town in a mostly-German-speaking canton of Switzerland, or the Haida names that now appear in some name fields in Haida Gwaii.

The motivation for this proposal are these multi-lingual use cases, where names appear in multiple languages, including putting two or more names in the name field either with some separator character (“French Name / Flemmish Name” in Brussels, e.g.) or just bodged together (<French Street Type> <Street Name> <English Street Type> in various Canadian cities)

It would be great if navi software was sophisticated enough to sort this out, but the OSM dataset makes that harder than necessary. I’ve personally witnessed it result in failure.

Some places have name:<lang> tags in languages other than the ones generally spoken in the location, and which do not appear on street signs.

Some places have names in use which are not in the commonly used local language (e.g. name:hai in Canada), and which also have name:<lang> entries in the commonly spoken language.

Some places, such as the Chinatown or other ethnic communities in major North American cities, also have preferred names in official use that do not match the general language preferences of the area.

The S. Africa issue is similarly complex, where some names are localized and some are, as a matter of general use, not.

Yes, there are many places which are mono-lingual, or which are easy to figure out. This is for the rest of the world. It incurs little, if any, extra cost for the places which are mono-lingual.

edit: Note that “no extra cost” includes preserving the name field as it currently is. No changes to the dataset are needed for places where the current scheme works fine, and the rendered results will be the same as they currently are.

aseigo · November 14, 2024, 7:13am

Yes, it seems to be misleading for many people, as seen repeatedly in this thread.

For cases where:

the local preference is to favour an entirely different language from what is otherwise used around it (Cantonese in Vancouver’s Chinatown, Haida in Haida Gwaii), but where the standard language fallbacks should be respected in case the matching name:<lang> tags are missing.
there are many official languages, but only a small number in local use
a locality may have multiple local languages in official use (official), though typically the place names are only in one or two of those (preferred)… however, in cases where the name is not in those one or two commonly used languages (preferred), then the list of official languages can be consulted to figure out which one of the name:<lang> entries would make sense to show instead.

This also prevents having to replicate the whole set of languages at every administrative level where language changes occur. That prevents later having to edit them in many places when new languages are added/removed/change in order, and keeping them universally consistent across an entire country.

When it was “official” vs “preferred”, it also made semantic sense to encode regional ordering differences (e.g. English-French vs French-English) in a preferred tag, but that becomes less of an issue when dropping the official terminology.

Basically, it covers a variety of edge-cases, and most commonly only the main tag will be needed.

On that note:

I feel this is a non-issue.

If the language is an official language of the government offices in a locality, but street names do not appear in that language, nothing would be altered.

Furthermore, the preference tag would define the common language selection for map features even when there are localized names defined, while allowing falling back to any official language should it appear in the dataset.

aseigo · November 14, 2024, 7:25am

It doesn’t necessarily have to do with law (though often it does happen to map to that), but generally that’s the idea: what are the standard list of languages in use for naming, and what is actually in use in this given locale.

You offer some excellent examples of how tricky it can be!

In your example for the Ohlone name: if the preferred language set was krb;cst;css;en then it would Just Work™ out of the box. The Ohlone name would be selected if available (in whichever of the three dialects it appears in), with the English name appended where that is provided as well.

The “shared character set, different words” example seen in CJK languages that you bring up is another excellent case where language metadata is a requirement for the sensible/correct thing to occur reliably.

It is probably true we can’t fix 100% of cases, but right now OSM is failing in too many multi-lingual cases. Hopefully can improve that with some small changes.

alan_gr · November 14, 2024, 7:26am

Taking the case of Ireland again, it’s an objective fact that Irish and English are official languages throughout the country. So a tag called language:official, if the tag name means anything, would refer to Irish and English.

As I understand it, this could have the side effect of dual rendering all names in the country, contrary to what happens now. To get back to the current situation for most of the country, I think your idea would be to set the preferred language to English only, and then the official tag would be ignored. That would still require significant work to map Gaeltacht areas and tie them to admin units - possibly it could be done by setting the preferred language to Irish on lots of level 10 admin units (townlands).

But this may be less relevant if you are moving away from the official tag to one that is more clearly about rendering/presentation, rather than a statement of fact about formally recognised official languages.

Minh_Nguyen · November 14, 2024, 9:31am

I’ve asked some folks who work on OSM-based navigation software to offer their perspective.

Possibly outdated by now, but my recollection while working on the Mapbox Navigation SDK was that users preferred things that were sometimes contradictory. In general, our typical target user wanted to search for things in their own language and see places and POIs in their own language on the map, but see and hear street names in a signposted language to avoid getting lost. In the event of multiple options, they wanted to see and hear names in their preferred language or writing system to the maximum extent, falling back to related languages. Ultimately, a butchered name was better than a completely unspeakable one – but users definitely made sure we knew about the butchered names.

Language fallbacks are incredibly complicated. Someone who uses Serbian in Latin might be OK with hearing a Russian fallback, but displaying Cyrillic text would be unfortunate. Someone who uses Japanese would rather hear an English fallback than Chinese. Chinese speakers in Singapore are a lot more open to an English fallback than Chinese speakers in China. Someone contributed an Esperanto localization (because open source), but no text-to-speech engine has an Esperanto voice, so we wound up mapping it to the Italian voice. Making matters worse, there isn’t a one-to-one mapping from written language codes to spoken language codes, but you need both for text-to-speech. What to do when the basemap data says it’s in Traditional Chinese (zh-Hant) but the only available voices are in Cantonese (yue) and Mandarin (cmn)?

These were unsolved problems for us. For better or worse, we ended up displaying and vocalizing name=* in the navigation UI but displaying the user’s preferred language on the basemap. In most software components, we used a mix of the truncation fallback algorithm, CLDR language matching algorithm, and an ad hoc fallback mechanism. However, we would tell the TTS engine to sound out every name in the user’s own language. If this caused the American English voice to butcher a Polish name, at least it would match the hapless American tourist also trying to read the sign at the same time. On the bright side, both Android and iOS let the user choose a preferred language fallback list, which we honored, so the user could work around any suboptimal fallbacks on their own.

To clarify, they made a point of renaming it from “Coyote Ridge” to “Máyyan 'Ooyákma – Coyote Ridge”. That is, the English name (name:en=*) now contains the Chochenyo name. “Coyote Ridge” is the old_name=* now. I’ve added the pronunciation, which does make a noticeable difference in the TTS engines I tried (Amazon Polly and VoiceOver).

wardmuylaert · November 14, 2024, 12:43pm

That is quite the sweeping assertion. When I tell English speakers my Dutch language first name “Ward”, they do not even understand what I am saying most of the time. Don’t even get me started on my last name which uses vowel sounds they do not have nor would be able to link to the letters.

I am no Vietnamese speaker, but know a Vietnamese woman whose name is spelled Trang and she says you have to pronounce it like “Chong”. If I heard that while driving, it would be of no help to find the right sign.

Now don’t get me wrong, if you know the local language then sure, it can make sense, but that does not “just work” everywhere.

SomeoneElse · November 14, 2024, 12:49pm

@aseigo I think what would really help would be some concrete examples. Have a look at some of the examples above and say perhaps

How you’d tag https://www.openstreetmap.org/relation/6266995/history and Node History: ‪Dingle / Daingean Uí Chúis‬ (‪52241235‬) | OpenStreetMap .
How you’d tag Relation: ‪San Jose‬ (‪112143‬) | OpenStreetMap
How you’d tag Node History: ‪St Davids‬ (‪3712052604‬) | OpenStreetMap

SomeoneElse · November 14, 2024, 1:07pm

My recollection was that it was actually the other way around, but you’ve probably been there much more recently than me

To some extent it applies everywhere. I used to create specific sat-nav maps from Garmin data, and due to the dash layout of the car I had at a time I was reliant on audio for directions. Basically you need to make sure that whatever the map thinks is the “name” is short enough to read out and spelt so that the TTS will recognise it. This meant suppressing “some long names”, and certainly not using the raw name tag from OSM (which as discussed above can have all sorts of stuff in it). It also meant some phonetic renames so “Brian Cluff Way” for Way: ‪Brian Clough Way‬ (‪140964831‬) | OpenStreetMap . However, this data modification was always made in the process of making the map, not in the OSM data.

With regard to languages, saying the name of something like Eteläinen Rautatiekatu in Finland presumably would be the same regardless of the language of the driver? People navigating need to match what is said to what is written on the sign, and there it’ll be in Finnish (perhaps with the Swedish version in small letters underneath). Translating to anything other than the main name on the sign makes no sense.

trigpoint · November 14, 2024, 1:43pm

Pronunciation is often a case of knowing, being a native speaker doesn’t help.

Way: ‪Belvoir Street‬ (‪3076563‬) | OpenStreetMap Beaver

Node: ‪The Ercall‬ (‪356741658‬) | OpenStreetMap Arcle

There are lots more where standard rules of English don’t apply in England.

Cholmondeley is pronounced Chumley.

Both have been used at my places of work as conference room names and gave us many laughs at non-locals asking for the belle voir room.

Must confess I hate voice navigation, and always turn it off and use screen directions.

Minh_Nguyen · November 14, 2024, 6:29pm

Navigation applications have to cater to a variety of user behaviors. Some rely on voice instructions while others rely on visual instructions. Some need the map to always point north while others need the map to rotate with their direction of travel. People with more spatial awareness or local knowledge may dispense with turn-by-turn navigation altogether and rely on a route preview map. That said, with wearable devices like watches and smart glasses, pure voice-based navigation becomes a lot more important. OSM is consumed aurally, but we don’t pay enough attention to this use case because rendered maps like OSM Carto are so entrenched in mapper workflows.

Besides active turn-by-turn navigation, users with vision impairment or blindness rely even more heavily on text-to-speech as an assistive technology, both for navigation and for basic map exploration. In some jurisdictions, government regulations require applications to make maps accessible through screen readers. The good news is that these users tend to be quite lenient about proper pronunciation. I’ve always been amazed at how skillfully blind users navigate user interfaces with screen readers zipping through text at a rate of upwards of 550 words per minute – twice as fast as the average seeing user can read. The flip side is that both code-switching into other languages and skipping unpronounceable words can seriously throw off the user.

Hybrid names, like the Franglais in Canada, probably sound rather silly to end users. I can’t be too critical of this practice, because I know it’s an uneasy compromise between language communities that sometimes view each other with suspicion. But just as Wikipedia is not paper, we aren’t signmakers and don’t always have to encode exactly what’s on the sign. Recalling one of the examples I gave earlier, I really hope no one feels the need to tag this restaurant as name=Seafood 金山 Kim Sơn 酒家 Buffet just because of this sign that literally puts the most important language front and center. If there’s an overriding need to balance competing interests by stuffing multiple names in name=*, then at least we can avoid jumbling the words and use a predictable, unambiguous delimiter, so that the heuristic of “find a match between name:*=* and name=*” gets a sporting chance.

By the way, that seafood restaurant is on a street that has signposted names in different languages, but only the English name is used in addresses. iD apparently led a mapper to stick both names in addr:street=*.

github.com/openstreetmap/iD

Address field dropdowns should choose a name in the relevant language

opened 06:55PM - 14 Nov 24 UTC

1ec5

localization field

As far as I know, most postal services and other addressing authorities use only… one name at a time in each part of an address (street, city, etc.), even if OSM would give those features multiple values in `name=*` as a linguistic compromise. For example, I recently had to [retag](https://www.openstreetmap.org/changeset/159141323) a bunch of POIs where a mapper had previously accepted iD’s suggestion of `addr:street=Bellaire Boulevard;Đại Lộ Sàigòn` based on [this nearby street](https://www.openstreetmap.org/way/1067465473), which is tagged `name=Bellaire Boulevard;Đại Lộ Sàigòn`. This street has two values in `name=*` because of [dual wayfinding signs](https://www.mapillary.com/app/?pKey=314599710013458). But neither the U.S. Postal Service nor the county records manager recognizes “Đại Lộ Sàigòn” as a street name for addressing purposes. Even if they did, they still wouldn’t recognize a dual name. (Do other countries’ postal systems work like this? Please correct me if I’m wrong.) Ideally, iD would somehow know that it should use `name:en=Bellaire Boulevard` instead, because that’s the standard language for addressing in most of the U.S. We might be able to use data/territory_languages.json for this purpose, but it gets tricky in multilingual countries where the countrywide default language may not be a good fit. Since iD also consults Nominatim, maybe it can access the `default_language=*` from the surrounding boundaries, or it could base the decision on the interface language. (Not great, but this is a fallback.) Or maybe iD could simply use the `name=*` but truncate it at the first delimiter (a semicolon in this case).

Jarek · November 15, 2024, 3:51am

Sorry, I still can’t really understand the relevance of this scenario. You’ve stated that your user prefers voice instructions in English and German street names pronounced in German. So the application would be configured to read name:de in German, and the other instructions in English.

I don’t understand what OSM could tag to make this configuration not required. Which languages:preferred or languages:display_order or whatever would you tag? fr;de or de;fr? Why? How will it help? What about the users driving in this same area who want the instructions in Polish and the street names in French?

Or is this entire scenario only about having multiple languages mashed together in the name field? But I don’t see how having a semicolon-separated name or no name at all would make it any better. The user still has to tell the app if they prefer street names in a given language, and at that point the first thing the app should be doing is checking for name:<lang>. The only thing I can see that would make heuristics based on name easier is if name is tagged and has exactly one language in it, but that’s not acceptable in most multilingual areas – whatever separator (;, /, -, whitespace) ends up being used.

Or is it about historical or not-really-used-but-still-tagged names in other languages? For example, the person who configured their application to read German street names then driving to Poland and being told to look for Langgasse or Fleischergasse in Gdańsk? (I would argue that shouldn’t really be tagged name:de, but in practice it is at least a bit, although old_name:de seems more widespread in Gdańsk.) But then I don’t see any name:en on streets in Biel/Bienne, and barely any in Brussels, so maybe tagging not-widely-used street names isn’t really a widespread problem?

aseigo · November 15, 2024, 7:39am

Ask yourself: how does the software know which language (or languages) the name field is in?

That is not the proposed solution, of course.

The proposed solution is to add new tags which note what the preferred languages are for local names so that instead of going to the name tag, renderers can instead pick out the localized tag (name:<lang code>) and render those.

Currently, when there are multiple names in the name tag, it is hard or even implausible for software to figure out what is going on in there.

Complicating this, many features have names in multiple languages that are not used locally, but are there purely for localization when rendering maps in other languages not used locally.

So renderers can’t even just look at the set of name:<lang code> entries and assume those are the correct set of localized names to use in default rendering.

By being able to pick the localized names that match the locally-appropriate set of languages for names, we can still have localized name:<lang code> tag sets and only rely on name as a fallback, while still having properly and consistently rendered local names.

SomeoneElse · November 15, 2024, 11:32am

It is OSM, so it can only assume that it might be a hodge-podge of names in different languages that may or may not correspond to signage.

However, the relevant question (which you may be trying to address) is “which name:xx are useful to present to the user”.

aseigo · November 15, 2024, 11:35am

Exactly, which elides the issue of unfortunate values in name.

In the problem statement in the proposal it says: "However, there is currently no mechanism for an OSM renderer to determine which name:<language-code> tags should be shown in which order. "

If that was not clear or visible enough, I can address that in the upcoming revision. Wording suggestions, etc. welcome.

alan_gr · November 15, 2024, 12:29pm

On a different aspect, the proposal mentions that the current approach “creates disputes among contributors over the content of the default name tag, with discussions often driven by regional language politics and local editing customs, rather than focused on usability or data quality”

The implication is that the proposal would help reduce these disputes. I find that doubtful. For example currently there is an active discussion on this forum about (among other things) the use of dual Spanish-Catalan “name=” in certain areas near Catalonia but outside its borders. I don’t see how the proposal would change things - there would be the same disagreement about what the official and preferred languages actually are, and the level of administrative unit at which they are valid.

I don’t think the proposal stands or falls on this specific issue, but it might be better not to present it as an advantage unless you have a clear justification for how it would help with disputes. The only thing I can think of is that it would make it easier to implement a consensus if one is reached. But usually the problem is reaching the consensus in the first place.