Multiple delimited names in the name tag

Minh_Nguyen · January 24, 2023, 3:09am

I’m glad to see some experimentation on the osm-carto front. I look forward to some of these experiments being made available for mappers to experience firsthand.

What @imagico has prototyped needn’t be mutually exclusive to what Americana has implemented. It’s entirely possible that a perfect solution in a renderer would require the combination of multiple approaches, but realistically some data consumers will need something in between that and the most basic handling of name.

As always, the devil is in the details. So far this approach relies on some details that hopefully can be clarified by the time it goes into production. Here’s a few things that come to mind:

default_language itself was rejected in 2018. There seems to be some enthusiastic support for it in this thread, so I wonder if that vote’s participants were missing important context. It may be an uphill battle to convince editors to support a rejected key.
It would be essential to mitigate the risk of vandalism or accidental breakage. Boundary relations break all the time in a manner that is often difficult to fix without local knowledge. To mitigate this risk, the FAQ points out that boundary-based defaults could be subject to a delay, similar to the changeset review process implemented by some data consumers such as Mapbox. I’m intrigued by this possibility but also apprehensive about introducing another process similar to the coastline process that works except when it doesn’t.
Since language use does not neatly correspond to administrative boundaries in the real world, this prototype relies on arbitrary features to be tagged with default_language. This is equivalent to another rejected 2017 proposal in all but name.
Since a space is typically used as a delimiter in North Africa and Hong Kong, the prototype tries to detect boundaries between different scripts. This is naturally only a rudimentary implementation with room for improvement, but I suspect it would be the most fragile aspect of this approach.

The Unicode standard comes with an algorithm for script boundary detection as part of text segmentation; however, there are plenty of edge cases in POI and place names that this algorithm would consider to be degenerate. One challenge is that many common characters are script-neutral, such as “15” or “131”. But clearly Latin letters like “E” can also be part of a non-Latin name. The Japanese community has even standardized on a mix of rōmaji, kana, and kanji in POI names, based on the literal contents of signs.

To be clear, I think it’s OK for a data consumer to develop sophisticated heuristics along these lines. MapLibre GL/Mapbox GL uses similar heuristics to decide whether it’s appropriate to rotate CJK text vertically. But if splitting strings on semicolons is merely a quick fix for “ugly breakage”, to be viewed askance, then what are we to make of text segmentation heuristics that may not even be reliable?

Christoph’s prototype does attempt to address this case too, by requiring the component names to be repeated in alt_name or similar. Unfortunately, none of the *_name keys would apply to the monolingual cases I brought up above, such as the shared dentist’s office in Cupertino, so the prototype would simply hide the label, communicating to the mapper that the POI is mistagged. I’m not optimistic that any proposal to redefine alt_name would ever fly, but there has been a suggestion to introduce a new key for this purpose: