Multiple delimited names in the name tag

What about storing in the names key not a list of names, but an unordered list of suffixes to name: from which consumers can choose a single one, or combine several to their liking? In the case of Node: ‪Londonderry/Derry‬ (‪267762522‬) | OpenStreetMap that’d make names=en_GB;en_IE.

name:en_gb is not strictly true, many GB people refer to Derry/Londonderry not just Londonderry as that tags. It would not be acceptable to only display one or the other on a map such as osm.org.

Phil (trigpoint)

1 Like

From this particular data consumer’s point of view, I would love to have ready-made data for “this name is in this language”. Not just for multiple names, but for all of them… so I knew which voice to use for cycle.travel’s turn-by-turn directions. Right now it uses en-GB everywhere, which results in some entertainingly bad French pronunciation.

But then, greater minds than mine have struggled on that one too. I used to have an iPod Shuffle which used text-to-speech to tell you what the upcoming track was called. Unfortunately it thought Saint Etienne were a French band and was accordingly keen to tell me that the next track was called “Lak a Moteurwey”.

1 Like

why would it be in “name”? It would mean retagging all “name” tags. Rather it could be an additional tag so things can be introduced slowly without forcing any changes to current data.

ok I see, it is “names” not “name”

(@dieterdreist, I don’t think you interpreted the post right, or you replied to the wrong post.)

3 Likes

I think the core challenge this thread is grappling with is “How can I (a data consumer) determine what the name of a thing is in the local languages?”

We’re mostly focused on delimiters in name=* because that is the current field for “the name in local language”. Other ways of determining this like the suggestion of names=en_GB;en_IE are on-topic for that underlying challenge.

5 Likes

I suspect that “Londonderry/Derry” could be considered a single name at this point. There’s a reason people call it “Stroke City” but don’t call Brussels the equivalent of “Spaced Dash City” in its local languages. (Though it sounds like “Squiggly City” might be a possibility too…) If that means a renderer might label the city redundantly, at least there’s an interesting explanation for it.

Aside from that special case, you’ve hit upon the point that keeps getting missed in this thread: that language is not the only reason for a name or *_name or name:* tag to contain multiple equally valid values. Sometimes we can differentiate with subkeys, but just as with linguistic differences, there still needs to be a name to “split the difference”, so to speak. Other times, there’s really no reasonable way to qualify which name is used in which context.

Incidentally, I’m so glad the Karlsruhe schema doesn’t specify a slash as a value delimiter for house numbers:

1806/127/2/6/15/48/4B Huỳnh Tấn Phát

These house numbers are often (but not always) based on the numbered streets leading up to them, some of which have slashes and dashes in their names too:

As in Northern Ireland, Vietnamese people have a special nickname for these millions of addresses and streets: siêu xuyệt, meaning “super-slash”. Data consumers tend to misinterpret these addresses, causing real-world problems for residents.

… on which note see this from 2018 - “… a symbol used only by mathematicians, dictionaries and Unix programmers” :slight_smile:

You’ll have to ask the admins @forum-governance to do that. We mods don’t have access to the tools to move messages around.

Technically it can go further to name:en-GB-u-sd-gbeng (England) and name:en-GB-u-sd-gbnir (Northern Ireland) etc. But in general that’s not correct either, as you explained.

I’d map several POIs there, one for each office. The co-location maybe even has a name of its own?

1 Like

Your suggestion works in some cases but not others, due to the “One feature, one element” principle. Sometimes I do map multiple POIs, such as for this engineer’s office that has enough desks and employees that I could imagine each one serving a different sewerage district, mapped as separate POIs right next to it.

On the other hand, I can’t stress enough that multiple professionals often share the same office without a distinct name for the office. Two dentists have a shared practice at this single office at unit #3, incidentally signposted in two languages. Their partnership has no publicly visible name of its own, unlike the shared practice next door in unit #4:

I’ve also seen situations where a single desk doubles as multiple distinct businesses. These businesses may have different names and websites, but in the real world, they might exist only as different tabs in a browser window on the computer on a desk in an office.

It’s not just professionals: some retail businesses also combine brands in a manner that makes a multi-POI representation purely arbitrary. This is really just a single restaurant that happens to serve items from two brands:

I can’t tell you whether a slash or dash or space is technically the most appropriate delimiter. But if I order a Taco Bell burrito with a side of KFC mashed potatoes from the restaurant’s single counter, my credit card statement might well say KFCTACOBELL on it. :stuck_out_tongue:

The key thing when mapping a combination Taco Bell is, of course, to remember the addr:housenumber tag to eliminate confusion.

Imagico posted two articles on his blog, insightful as always:

From reading there, mappers already do as I proposed, by turning the default_language key into such a list, of course separated by semicolon.

MUCH better than writing out the names in the names key and thereby losing language and script info. Should fix several problems at once:

Remains, how to handle the non-language based multiples.

Seems like both progress and confusion simultaneously here, I think Imagico is on a correct track, good, though it takes us part of this journey. It seems part of the confusion arises as logic in what usage of the default_language key “thinks it can” do what it seems like it is doing, but might not actually be doing. Now, it seems to work for many cases, but will it work for all of them? That’s a tough problem to solve.

And “multiples” remain problematic. Yet, I sense progress, a coming a bit closer, perhaps.

With Asa’s “suffixes” I think you’d need to invent a new key or subkey.

I must say that I disagree with Imagico’s analysis.

Firstly, default_language=* may be a fine idea for places where administrative areas have a single primary language, but this falls down completely in numerous places where local language dominance doesn’t follow administrative boundaries. A prime example of this is neighborhoods dominated by ethnic minorities in many of the world’s cities. These neighborhoods have may have one or more local languages that appear on signage, place names, and other features, but which still remain a minority in the smallest administrative area that surrounds them. These neighborhoods also do not necessarily have sharp borders that one could draw a polygon around. Minority-owned businesses named in minority languages may intermingle with majority-culture businesses as well as those of yet other minorities. For example, a predominantly Chinese neighborhood in New York City may overlap with Italian neighborhood, a Korean neighborhood, and a Jewish one and on the street between them one may see an intermingling of businesses with their primary names in 5 different languages and as many or more scripts. default_language completely fails to accommodate this type of real-world complexity.

Next he says:

Isn’t the logic of splitting compound name tags awfully complicated? Wouldn’t it be much easier to just standardize on semicolon as separator?

I don’t think there is a huge difference between supporting one or supporting three separators. The detection of different scripts to separate compound names without separator (as it is common in particular in Africa) is a different matter. But i am pretty sure once there is a viable way to get multilingual name display without an undesired delimiter, the local communities would not be opposed to changing that. For the moment this complexity is there to support all the common multilingual name tagging variants with equal determination.

As discussed previously in this thread, there are numerous valid names that include -, /, spaces, and other similar punctuation in something that is validly a single name. Semicolons simply require fewer cases to be escaped than the other common delimiters.

5 Likes

I’m glad to see some experimentation on the osm-carto front. I look forward to some of these experiments being made available for mappers to experience firsthand.

What @imagico has prototyped needn’t be mutually exclusive to what Americana has implemented. It’s entirely possible that a perfect solution in a renderer would require the combination of multiple approaches, but realistically some data consumers will need something in between that and the most basic handling of name.

As always, the devil is in the details. So far this approach relies on some details that hopefully can be clarified by the time it goes into production. Here’s a few things that come to mind:

  • default_language itself was rejected in 2018. There seems to be some enthusiastic support for it in this thread, so I wonder if that vote’s participants were missing important context. It may be an uphill battle to convince editors to support a rejected key.

  • It would be essential to mitigate the risk of vandalism or accidental breakage. Boundary relations break all the time in a manner that is often difficult to fix without local knowledge. To mitigate this risk, the FAQ points out that boundary-based defaults could be subject to a delay, similar to the changeset review process implemented by some data consumers such as Mapbox. I’m intrigued by this possibility but also apprehensive about introducing another process similar to the coastline process that works except when it doesn’t.

  • Since language use does not neatly correspond to administrative boundaries in the real world, this prototype relies on arbitrary features to be tagged with default_language. This is equivalent to another rejected 2017 proposal in all but name.

  • Since a space is typically used as a delimiter in North Africa and Hong Kong, the prototype tries to detect boundaries between different scripts. This is naturally only a rudimentary implementation with room for improvement, but I suspect it would be the most fragile aspect of this approach.

    The Unicode standard comes with an algorithm for script boundary detection as part of text segmentation; however, there are plenty of edge cases in POI and place names that this algorithm would consider to be degenerate. One challenge is that many common characters are script-neutral, such as “15” or “131”. But clearly Latin letters like “E” can also be part of a non-Latin name. The Japanese community has even standardized on a mix of rōmaji, kana, and kanji in POI names, based on the literal contents of signs.

    To be clear, I think it’s OK for a data consumer to develop sophisticated heuristics along these lines. MapLibre GL/Mapbox GL uses similar heuristics to decide whether it’s appropriate to rotate CJK text vertically. But if splitting strings on semicolons is merely a quick fix for “ugly breakage”, to be viewed askance, then what are we to make of text segmentation heuristics that may not even be reliable?

Christoph’s prototype does attempt to address this case too, by requiring the component names to be repeated in alt_name or similar. Unfortunately, none of the *_name keys would apply to the monolingual cases I brought up above, such as the shared dentist’s office in Cupertino, so the prototype would simply hide the label, communicating to the mapper that the POI is mistagged. I’m not optimistic that any proposal to redefine alt_name would ever fly, but there has been a suggestion to introduce a new key for this purpose:

In fairness to Christoph, his prototype is primarily aimed at solving the Unihan/Arabic font selection problem, which is a legitimate goal, and only incidentally about laying out compound monolingual or multilingual labels. I quibble with the notion that one goal is inherently more worthy than the other, and especially with a cold calculation based on population, but every software project is entitled to its own priorities. Certainly it’s good to finally choose a Chinese font for Chinese and a Nastaliq font for Urdu, but I don’t see it as inextricably tied to other labeling issues.

For Americana, selecting a font for the label has been slightly less problematic than for a statically rendered style. A combination of happy accidents results in the style automatically selecting not only the user’s preferred language but also, for CJK only, their preferred font in their preferred language. Consequently, the delimiter issue has been more of a focus, but there’s lots of room for improvement around fonts too.

Firstly, default_language=* may be a fine idea for places where administrative areas have a single primary language, but this falls down completely in numerous places where local language dominance doesn’t follow administrative boundaries. A prime example of this is neighborhoods dominated by ethnic minorities in many of the world’s cities. These neighborhoods have may have one or more local languages that appear on signage, place names, and other features, but which still remain a minority in the smallest administrative area that surrounds them. These neighborhoods also do not necessarily have sharp borders that one could draw a polygon around. Minority-owned businesses named in minority languages may intermingle with majority-culture businesses as well as those of yet other minorities. For example, a predominantly Chinese neighborhood in New York City may overlap with Italian neighborhood, a Korean neighborhood, and a Jewish one and on the street between them one may see an intermingling of businesses with their primary names in 5 different languages and as many or more scripts. default_language completely fails to accommodate this type of real-world complexity

the default language could be tagged on a per object basis. It could either be the standard or be used to override a value from an enclosing polygon.

It’s worth remembering that whatever is chosen will need to be acceptable to data consumers who are only looking at the name tag, and no other fields.

2 Likes