Multiple delimited names in the name tag

Incidentally, I’m so glad the Karlsruhe schema doesn’t specify a slash as a value delimiter for house numbers:

1806/127/2/6/15/48/4B Huỳnh Tấn Phát

These house numbers are often (but not always) based on the numbered streets leading up to them, some of which have slashes and dashes in their names too:

As in Northern Ireland, Vietnamese people have a special nickname for these millions of addresses and streets: siêu xuyệt, meaning “super-slash”. Data consumers tend to misinterpret these addresses, causing real-world problems for residents.

… on which note see this from 2018 - “… a symbol used only by mathematicians, dictionaries and Unix programmers” :slight_smile:

You’ll have to ask the admins @forum-governance to do that. We mods don’t have access to the tools to move messages around.

Technically it can go further to name:en-GB-u-sd-gbeng (England) and name:en-GB-u-sd-gbnir (Northern Ireland) etc. But in general that’s not correct either, as you explained.

I’d map several POIs there, one for each office. The co-location maybe even has a name of its own?

1 Like

Your suggestion works in some cases but not others, due to the “One feature, one element” principle. Sometimes I do map multiple POIs, such as for this engineer’s office that has enough desks and employees that I could imagine each one serving a different sewerage district, mapped as separate POIs right next to it.

On the other hand, I can’t stress enough that multiple professionals often share the same office without a distinct name for the office. Two dentists have a shared practice at this single office at unit #3, incidentally signposted in two languages. Their partnership has no publicly visible name of its own, unlike the shared practice next door in unit #4:

I’ve also seen situations where a single desk doubles as multiple distinct businesses. These businesses may have different names and websites, but in the real world, they might exist only as different tabs in a browser window on the computer on a desk in an office.

It’s not just professionals: some retail businesses also combine brands in a manner that makes a multi-POI representation purely arbitrary. This is really just a single restaurant that happens to serve items from two brands:

I can’t tell you whether a slash or dash or space is technically the most appropriate delimiter. But if I order a Taco Bell burrito with a side of KFC mashed potatoes from the restaurant’s single counter, my credit card statement might well say KFCTACOBELL on it. :stuck_out_tongue:

The key thing when mapping a combination Taco Bell is, of course, to remember the addr:housenumber tag to eliminate confusion.

1 Like

Imagico posted two articles on his blog, insightful as always:

From reading there, mappers already do as I proposed, by turning the default_language key into such a list, of course separated by semicolon.

MUCH better than writing out the names in the names key and thereby losing language and script info. Should fix several problems at once:

Remains, how to handle the non-language based multiples.

Seems like both progress and confusion simultaneously here, I think Imagico is on a correct track, good, though it takes us part of this journey. It seems part of the confusion arises as logic in what usage of the default_language key “thinks it can” do what it seems like it is doing, but might not actually be doing. Now, it seems to work for many cases, but will it work for all of them? That’s a tough problem to solve.

And “multiples” remain problematic. Yet, I sense progress, a coming a bit closer, perhaps.

With Asa’s “suffixes” I think you’d need to invent a new key or subkey.

I must say that I disagree with Imagico’s analysis.

Firstly, default_language=* may be a fine idea for places where administrative areas have a single primary language, but this falls down completely in numerous places where local language dominance doesn’t follow administrative boundaries. A prime example of this is neighborhoods dominated by ethnic minorities in many of the world’s cities. These neighborhoods have may have one or more local languages that appear on signage, place names, and other features, but which still remain a minority in the smallest administrative area that surrounds them. These neighborhoods also do not necessarily have sharp borders that one could draw a polygon around. Minority-owned businesses named in minority languages may intermingle with majority-culture businesses as well as those of yet other minorities. For example, a predominantly Chinese neighborhood in New York City may overlap with Italian neighborhood, a Korean neighborhood, and a Jewish one and on the street between them one may see an intermingling of businesses with their primary names in 5 different languages and as many or more scripts. default_language completely fails to accommodate this type of real-world complexity.

Next he says:

Isn’t the logic of splitting compound name tags awfully complicated? Wouldn’t it be much easier to just standardize on semicolon as separator?

I don’t think there is a huge difference between supporting one or supporting three separators. The detection of different scripts to separate compound names without separator (as it is common in particular in Africa) is a different matter. But i am pretty sure once there is a viable way to get multilingual name display without an undesired delimiter, the local communities would not be opposed to changing that. For the moment this complexity is there to support all the common multilingual name tagging variants with equal determination.

As discussed previously in this thread, there are numerous valid names that include -, /, spaces, and other similar punctuation in something that is validly a single name. Semicolons simply require fewer cases to be escaped than the other common delimiters.

5 Likes

I’m glad to see some experimentation on the osm-carto front. I look forward to some of these experiments being made available for mappers to experience firsthand.

What @imagico has prototyped needn’t be mutually exclusive to what Americana has implemented. It’s entirely possible that a perfect solution in a renderer would require the combination of multiple approaches, but realistically some data consumers will need something in between that and the most basic handling of name.

As always, the devil is in the details. So far this approach relies on some details that hopefully can be clarified by the time it goes into production. Here’s a few things that come to mind:

  • default_language itself was rejected in 2018. There seems to be some enthusiastic support for it in this thread, so I wonder if that vote’s participants were missing important context. It may be an uphill battle to convince editors to support a rejected key.

  • It would be essential to mitigate the risk of vandalism or accidental breakage. Boundary relations break all the time in a manner that is often difficult to fix without local knowledge. To mitigate this risk, the FAQ points out that boundary-based defaults could be subject to a delay, similar to the changeset review process implemented by some data consumers such as Mapbox. I’m intrigued by this possibility but also apprehensive about introducing another process similar to the coastline process that works except when it doesn’t.

  • Since language use does not neatly correspond to administrative boundaries in the real world, this prototype relies on arbitrary features to be tagged with default_language. This is equivalent to another rejected 2017 proposal in all but name.

  • Since a space is typically used as a delimiter in North Africa and Hong Kong, the prototype tries to detect boundaries between different scripts. This is naturally only a rudimentary implementation with room for improvement, but I suspect it would be the most fragile aspect of this approach.

    The Unicode standard comes with an algorithm for script boundary detection as part of text segmentation; however, there are plenty of edge cases in POI and place names that this algorithm would consider to be degenerate. One challenge is that many common characters are script-neutral, such as “15” or “131”. But clearly Latin letters like “E” can also be part of a non-Latin name. The Japanese community has even standardized on a mix of rōmaji, kana, and kanji in POI names, based on the literal contents of signs.

    To be clear, I think it’s OK for a data consumer to develop sophisticated heuristics along these lines. MapLibre GL/Mapbox GL uses similar heuristics to decide whether it’s appropriate to rotate CJK text vertically. But if splitting strings on semicolons is merely a quick fix for “ugly breakage”, to be viewed askance, then what are we to make of text segmentation heuristics that may not even be reliable?

Christoph’s prototype does attempt to address this case too, by requiring the component names to be repeated in alt_name or similar. Unfortunately, none of the *_name keys would apply to the monolingual cases I brought up above, such as the shared dentist’s office in Cupertino, so the prototype would simply hide the label, communicating to the mapper that the POI is mistagged. I’m not optimistic that any proposal to redefine alt_name would ever fly, but there has been a suggestion to introduce a new key for this purpose:

1 Like

In fairness to Christoph, his prototype is primarily aimed at solving the Unihan/Arabic font selection problem, which is a legitimate goal, and only incidentally about laying out compound monolingual or multilingual labels. I quibble with the notion that one goal is inherently more worthy than the other, and especially with a cold calculation based on population, but every software project is entitled to its own priorities. Certainly it’s good to finally choose a Chinese font for Chinese and a Nastaliq font for Urdu, but I don’t see it as inextricably tied to other labeling issues.

For Americana, selecting a font for the label has been slightly less problematic than for a statically rendered style. A combination of happy accidents results in the style automatically selecting not only the user’s preferred language but also, for CJK only, their preferred font in their preferred language. Consequently, the delimiter issue has been more of a focus, but there’s lots of room for improvement around fonts too.

Firstly, default_language=* may be a fine idea for places where administrative areas have a single primary language, but this falls down completely in numerous places where local language dominance doesn’t follow administrative boundaries. A prime example of this is neighborhoods dominated by ethnic minorities in many of the world’s cities. These neighborhoods have may have one or more local languages that appear on signage, place names, and other features, but which still remain a minority in the smallest administrative area that surrounds them. These neighborhoods also do not necessarily have sharp borders that one could draw a polygon around. Minority-owned businesses named in minority languages may intermingle with majority-culture businesses as well as those of yet other minorities. For example, a predominantly Chinese neighborhood in New York City may overlap with Italian neighborhood, a Korean neighborhood, and a Jewish one and on the street between them one may see an intermingling of businesses with their primary names in 5 different languages and as many or more scripts. default_language completely fails to accommodate this type of real-world complexity

the default language could be tagged on a per object basis. It could either be the standard or be used to override a value from an enclosing polygon.

It’s worth remembering that whatever is chosen will need to be acceptable to data consumers who are only looking at the name tag, and no other fields.

6 Likes

name tag should be made deprecated with introduction of name:* tags. It just calls for argues.

name:* tags already allow having names in needed languages. Existence of name tag just creates confusion.

Introduction of another tag, something like “local_language” that would contain designation of local language (or languages separated by semicolons) may provide needed information how name is displayed on local signs.

From software side that would clean, straightforward and simple way to tag names in various languages and on social side it would remove arguing and fighting about what should stay in name tag.

name tag lost its basic function to display what is on the ground as that rule is not applied everywhere and is proved to be avoided due to various political biases.

Even where it is agreed how to put multiple languages in name tag it is impractical. Using multilingual name just makes map cluttered. Map renderer has to know local rules and parse content based on those rules.

Deprecating name tag, using name:* tags and using local_language tag would allow universal way to tag names and render maps, by having better control of rendering, regardless of languages used for naming.

I don’t want to sound discouraging, but deprecating name=* seems like a total non-starter. It might be perhaps the oldest, most consistently used (paired) tag in OSM. It is likely supported by every single use case (router, renderer, text-to-speech module…) out there, and all would need to be re-written / modified if we were to deprecate name=*. I doubt this would happen without great upset to our project.

Modifying the name=* tag, like with name: subtags, well, we’re listening and discussing, as that could actually fly.

This is simply one person’s opinion (mine).

3 Likes

It does not have to be removed. It may be generated, based on other rules that are described in a way that allows generating content for name tag.

1 Like

Hm, that is a helpful clarification, and I didn’t think of your answer quite that way. Thank you!

Welcome to this discussion! As you can see, this discussion has dragged on quite a while and is a bit difficult to follow. Your point about the importance of name:* is well taken, but various posts above have touched on why it’s insufficient to rely on a key that indicates which name:* to display. Here are some links to individual comments as a starting point, to avoid repeating these arguments:

If you’re following me so far, what remains is a debate about whether to insert human-readable punctuation or a machine-readable delimiter between multiple tag values when the values happen to be names (but aren’t alternative names, official names, short names, or the names of destinations, brands, operators, or owners).

4 Likes

From software side that would clean, straightforward and simple way to tag names in various languages and on social side it would remove arguing and fighting about what should stay in name tag.

sure, removing the name tag would stop arguing what to put in the tag, but introducing local_language as a new tag will start new arguing about the exact same topics (which languages to add and in which order)

1 Like