Multiple delimited names in the name tag

trigpoint · January 11, 2023, 11:52am

ZeLonewolf:

I would make name the “formatted for display” name (basically what it tends to be today), and introduce a new key called simply names which would be “a semicolon-delimited list of names used locally”.

What about storing in the names key not a list of names, but an unordered list of suffixes to name: from which consumers can choose a single one, or combine several to their liking? In the case of Node: ‪Derry/Londonderry‬ (‪267762522‬) | OpenStreetMap that’d make names=en_GB;en_IE.

name:en_gb is not strictly true, many GB people refer to Derry/Londonderry not just Londonderry as that tags. It would not be acceptable to only display one or the other on a map such as osm.org.

Phil (trigpoint)

Richard · January 11, 2023, 12:13pm

From this particular data consumer’s point of view, I would love to have ready-made data for “this name is in this language”. Not just for multiple names, but for all of them… so I knew which voice to use for cycle.travel’s turn-by-turn directions. Right now it uses en-GB everywhere, which results in some entertainingly bad French pronunciation.

But then, greater minds than mine have struggled on that one too. I used to have an iPod Shuffle which used text-to-speech to tell you what the upcoming track was called. Unfortunately it thought Saint Etienne were a French band and was accordingly keen to tell me that the next track was called “Lak a Moteurwey”.

dieterdreist · January 11, 2023, 12:41pm

why would it be in “name”? It would mean retagging all “name” tags. Rather it could be an additional tag so things can be introduced slowly without forcing any changes to current data.

ok I see, it is “names” not “name”

Herrieman · January 11, 2023, 1:36pm

(@dieterdreist, I don’t think you interpreted the post right, or you replied to the wrong post.)

Adam_Franco · January 11, 2023, 3:06pm

I think the core challenge this thread is grappling with is “How can I (a data consumer) determine what the name of a thing is in the local languages?”

We’re mostly focused on delimiters in name=* because that is the current field for “the name in local language”. Other ways of determining this like the suggestion of names=en_GB;en_IE are on-topic for that underlying challenge.

Minh_Nguyen · January 11, 2023, 6:05pm

I suspect that “Londonderry/Derry” could be considered a single name at this point. There’s a reason people call it “Stroke City” but don’t call Brussels the equivalent of “Spaced Dash City” in its local languages. (Though it sounds like “Squiggly City” might be a possibility too…) If that means a renderer might label the city redundantly, at least there’s an interesting explanation for it.

Aside from that special case, you’ve hit upon the point that keeps getting missed in this thread: that language is not the only reason for a name or *_name or name:* tag to contain multiple equally valid values. Sometimes we can differentiate with subkeys, but just as with linguistic differences, there still needs to be a name to “split the difference”, so to speak. Other times, there’s really no reasonable way to qualify which name is used in which context.

Name lists in OSM Americana; or: how I learned to love semicolons

There are also cases where the authorities on either side apply their name to both sides, such as this road that runs just inches north of the border between Michigan and Ohio (which once fought a war over the border). To this day, Williams County, Ohio, continues to post street signs on its side of the border calling it County Road T, even though the road lies entirely within Hillsdale County, Michigan, which maintains both sides as Territorial Road. They aren’t being petty: deliverers and emergency responders need to be able to find the addresses on either County Road T or Territorial Road.

Or consider the case I brought up in this openstreetmap-carto issue of a street where two authorities have joint authority over both sides of the street. They disagree about the road name, to the point of posting competing signs up and down the street at regular intervals. Should it go without a name in favor of loc_name and reg_name? If there’s this much outcry about less sophisticated renderers showing semicolons in labels, imagine if the labels went away entirely because of an absent name.

Minh_Nguyen · January 11, 2023, 6:48pm

Incidentally, I’m so glad the Karlsruhe schema doesn’t specify a slash as a value delimiter for house numbers:

github.com/streetcomplete/StreetComplete

House number validator rejects slash-laden house numbers in Vietnam

opened 07:57PM - 12 Mar 22 UTC

closed 06:40AM - 14 Mar 22 UTC

1ec5

bug

The address quest requires a relatively rigid format for `addr:housenumber` that… allows a single slash but requires a single digit or single letter to follow it: https://github.com/streetcomplete/StreetComplete/blob/88d71c4a792376d82d5b1863553cf96f551088f7/app/src/main/java/de/westnordost/streetcomplete/quests/address/HousenumberAnswerValidator.kt#L5-L6 This doesn’t work for the vast majority of urban addresses in Vietnam. Vietnam’s urban address format includes house numbers that resemble a POSIX relative file path that starts from a lane off an arterial road and recurses through each alley until you reach the destination. For example, this is a [valid address](https://congan.com.vn/doi-song/song-o-noi-so-nha-dai-hon-so-chung-minh-thu_36586.html): > Số 1806/127/2/6/15/48/2A Huỳnh Tấn Phát > Khu phố 6, thị trấn Nhà Bè, huyện Nhà Bè > Thành phố Hồ Chí Minh translation: > Number 1806/127/2/6/15/48/2A, Huỳnh Tấn Phát street > Ward 6, Nhà Bè town, Nhà Bè district > Ho Chi Minh City or more verbosely (but no one would write this): > Number 2A, alley 48, alley 15, alley 6, alley 2, alley 127, alley 1806, Huỳnh Tấn Phát street > Ward 6, Nhà Bè town, Nhà Bè district > Ho Chi Minh City There’s even a word for the most deeply nested addresses, [_siêu xuyệt_](https://en.wiktionary.org/wiki/siêu_xuyệt#Vietnamese), but 2–4 slashes are not at all uncommon. StreetComplete would be very helpful for surveying addresses in these dense neighborhoods, since it’s very difficult to collect them more systematically. [Sometimes these addresses even get deleted](https://www.openstreetmap.org/changeset/59631625) by Westerners who are unfamiliar with Vietnamese addressing and mistake them for vandalism or botched imports. /ref https://github.com/openstreetmap/iD/pull/4235#discussion_r133900734

These house numbers are often (but not always) based on the numbered streets leading up to them, some of which have slashes and dashes in their names too:

As in Northern Ireland, Vietnamese people have a special nickname for these millions of addresses and streets: siêu xuyệt, meaning “super-slash”. Data consumers tend to misinterpret these addresses, causing real-world problems for residents.

SomeoneElse · January 11, 2023, 6:58pm

… on which note see this from 2018 - “… a symbol used only by mathematicians, dictionaries and Unix programmers”

apm-wa · January 12, 2023, 6:12pm

You’ll have to ask the admins @forum-governance to do that. We mods don’t have access to the tools to move messages around.

Kovoschiz · January 13, 2023, 11:22am

Technically it can go further to name:en-GB-u-sd-gbeng (England) and name:en-GB-u-sd-gbnir (Northern Ireland) etc. But in general that’s not correct either, as you explained.

Hungerburg · January 14, 2023, 12:02am

I’d map several POIs there, one for each office. The co-location maybe even has a name of its own?

Minh_Nguyen · January 15, 2023, 5:51am

Your suggestion works in some cases but not others, due to the “One feature, one element” principle. Sometimes I do map multiple POIs, such as for this engineer’s office that has enough desks and employees that I could imagine each one serving a different sewerage district, mapped as separate POIs right next to it.

On the other hand, I can’t stress enough that multiple professionals often share the same office without a distinct name for the office. Two dentists have a shared practice at this single office at unit #3, incidentally signposted in two languages. Their partnership has no publicly visible name of its own, unlike the shared practice next door in unit #4:

I’ve also seen situations where a single desk doubles as multiple distinct businesses. These businesses may have different names and websites, but in the real world, they might exist only as different tabs in a browser window on the computer on a desk in an office.

It’s not just professionals: some retail businesses also combine brands in a manner that makes a multi-POI representation purely arbitrary. This is really just a single restaurant that happens to serve items from two brands:

I can’t tell you whether a slash or dash or space is technically the most appropriate delimiter. But if I order a Taco Bell burrito with a side of KFC mashed potatoes from the restaurant’s single counter, my credit card statement might well say KFCTACOBELL on it.

Richard · January 15, 2023, 8:20am

The key thing when mapping a combination Taco Bell is, of course, to remember the addr:housenumber tag to eliminate confusion.

Hungerburg · January 23, 2023, 8:26pm

Imagico posted two articles on his blog, insightful as always:

From reading there, mappers already do as I proposed, by turning the default_language key into such a list, of course separated by semicolon.

MUCH better than writing out the names in the names key and thereby losing language and script info. Should fix several problems at once:

Remains, how to handle the non-language based multiples.

stevea · January 24, 2023, 2:16am

Seems like both progress and confusion simultaneously here, I think Imagico is on a correct track, good, though it takes us part of this journey. It seems part of the confusion arises as logic in what usage of the default_language key “thinks it can” do what it seems like it is doing, but might not actually be doing. Now, it seems to work for many cases, but will it work for all of them? That’s a tough problem to solve.

And “multiples” remain problematic. Yet, I sense progress, a coming a bit closer, perhaps.

With Asa’s “suffixes” I think you’d need to invent a new key or subkey.

Adam_Franco · January 24, 2023, 3:03am

I must say that I disagree with Imagico’s analysis.

Firstly, default_language=* may be a fine idea for places where administrative areas have a single primary language, but this falls down completely in numerous places where local language dominance doesn’t follow administrative boundaries. A prime example of this is neighborhoods dominated by ethnic minorities in many of the world’s cities. These neighborhoods have may have one or more local languages that appear on signage, place names, and other features, but which still remain a minority in the smallest administrative area that surrounds them. These neighborhoods also do not necessarily have sharp borders that one could draw a polygon around. Minority-owned businesses named in minority languages may intermingle with majority-culture businesses as well as those of yet other minorities. For example, a predominantly Chinese neighborhood in New York City may overlap with Italian neighborhood, a Korean neighborhood, and a Jewish one and on the street between them one may see an intermingling of businesses with their primary names in 5 different languages and as many or more scripts. default_language completely fails to accommodate this type of real-world complexity.

Next he says:

Isn’t the logic of splitting compound name tags awfully complicated? Wouldn’t it be much easier to just standardize on semicolon as separator?

I don’t think there is a huge difference between supporting one or supporting three separators. The detection of different scripts to separate compound names without separator (as it is common in particular in Africa) is a different matter. But i am pretty sure once there is a viable way to get multilingual name display without an undesired delimiter, the local communities would not be opposed to changing that. For the moment this complexity is there to support all the common multilingual name tagging variants with equal determination.

As discussed previously in this thread, there are numerous valid names that include -, /, spaces, and other similar punctuation in something that is validly a single name. Semicolons simply require fewer cases to be escaped than the other common delimiters.

Minh_Nguyen · January 24, 2023, 3:09am

I’m glad to see some experimentation on the osm-carto front. I look forward to some of these experiments being made available for mappers to experience firsthand.

What @imagico has prototyped needn’t be mutually exclusive to what Americana has implemented. It’s entirely possible that a perfect solution in a renderer would require the combination of multiple approaches, but realistically some data consumers will need something in between that and the most basic handling of name.

As always, the devil is in the details. So far this approach relies on some details that hopefully can be clarified by the time it goes into production. Here’s a few things that come to mind:

default_language itself was rejected in 2018. There seems to be some enthusiastic support for it in this thread, so I wonder if that vote’s participants were missing important context. It may be an uphill battle to convince editors to support a rejected key.
It would be essential to mitigate the risk of vandalism or accidental breakage. Boundary relations break all the time in a manner that is often difficult to fix without local knowledge. To mitigate this risk, the FAQ points out that boundary-based defaults could be subject to a delay, similar to the changeset review process implemented by some data consumers such as Mapbox. I’m intrigued by this possibility but also apprehensive about introducing another process similar to the coastline process that works except when it doesn’t.
Since language use does not neatly correspond to administrative boundaries in the real world, this prototype relies on arbitrary features to be tagged with default_language. This is equivalent to another rejected 2017 proposal in all but name.
Since a space is typically used as a delimiter in North Africa and Hong Kong, the prototype tries to detect boundaries between different scripts. This is naturally only a rudimentary implementation with room for improvement, but I suspect it would be the most fragile aspect of this approach.

The Unicode standard comes with an algorithm for script boundary detection as part of text segmentation; however, there are plenty of edge cases in POI and place names that this algorithm would consider to be degenerate. One challenge is that many common characters are script-neutral, such as “15” or “131”. But clearly Latin letters like “E” can also be part of a non-Latin name. The Japanese community has even standardized on a mix of rōmaji, kana, and kanji in POI names, based on the literal contents of signs.

To be clear, I think it’s OK for a data consumer to develop sophisticated heuristics along these lines. MapLibre GL/Mapbox GL uses similar heuristics to decide whether it’s appropriate to rotate CJK text vertically. But if splitting strings on semicolons is merely a quick fix for “ugly breakage”, to be viewed askance, then what are we to make of text segmentation heuristics that may not even be reliable?

Christoph’s prototype does attempt to address this case too, by requiring the component names to be repeated in alt_name or similar. Unfortunately, none of the *_name keys would apply to the monolingual cases I brought up above, such as the shared dentist’s office in Cupertino, so the prototype would simply hide the label, communicating to the mapper that the POI is mistagged. I’m not optimistic that any proposal to redefine alt_name would ever fly, but there has been a suggestion to introduce a new key for this purpose:

Minh_Nguyen · January 24, 2023, 3:31am

In fairness to Christoph, his prototype is primarily aimed at solving the Unihan/Arabic font selection problem, which is a legitimate goal, and only incidentally about laying out compound monolingual or multilingual labels. I quibble with the notion that one goal is inherently more worthy than the other, and especially with a cold calculation based on population, but every software project is entitled to its own priorities. Certainly it’s good to finally choose a Chinese font for Chinese and a Nastaliq font for Urdu, but I don’t see it as inextricably tied to other labeling issues.

For Americana, selecting a font for the label has been slightly less problematic than for a statically rendered style. A combination of happy accidents results in the style automatically selecting not only the user’s preferred language but also, for CJK only, their preferred font in their preferred language. Consequently, the delimiter issue has been more of a focus, but there’s lots of room for improvement around fonts too.

dieterdreist · January 24, 2023, 8:49am

Firstly, default_language=* may be a fine idea for places where administrative areas have a single primary language, but this falls down completely in numerous places where local language dominance doesn’t follow administrative boundaries. A prime example of this is neighborhoods dominated by ethnic minorities in many of the world’s cities. These neighborhoods have may have one or more local languages that appear on signage, place names, and other features, but which still remain a minority in the smallest administrative area that surrounds them. These neighborhoods also do not necessarily have sharp borders that one could draw a polygon around. Minority-owned businesses named in minority languages may intermingle with majority-culture businesses as well as those of yet other minorities. For example, a predominantly Chinese neighborhood in New York City may overlap with Italian neighborhood, a Korean neighborhood, and a Jewish one and on the street between them one may see an intermingling of businesses with their primary names in 5 different languages and as many or more scripts. default_language completely fails to accommodate this type of real-world complexity

the default language could be tagged on a per object basis. It could either be the standard or be used to override a value from an enclosing polygon.

pnorman · January 27, 2023, 10:40pm

It’s worth remembering that whatever is chosen will need to be acceptable to data consumers who are only looking at the name tag, and no other fields.