Multiple delimited names in the name tag

aighes · December 31, 2022, 1:48pm

Here I would be careful, as this might be part of real names (depending how the mappers wrote it).

IanH · December 31, 2022, 2:13pm

Thats why it is important to not confuse the map render with extra characters only used to create a list of names. Individual translations need to be in thier respective name:LANG=values so the render can determine how to display the correct place names.

Minh_Nguyen · December 31, 2022, 4:08pm

If only it were so simple. We’ve already established that both the slash and the hyphen are ambiguous because they aren’t only used as delimiters. Moreover, in some places like Morocco, Hong Kong, and Jerusalem, the delimiter is just a space. Am I expected to replace the semicolon with a space when mapping a bilingual Chinese–English POI elsewhere in the world?

To reiterate, a substring match would be too naïve, especially if a mere space can be a delimiter between names. “Milan (o)”? “Habana (La)”?

But let’s suppose we ignore the pesky African and Asian languages and cater to only the European languages’ delimiters. What is the order of precedence for these features, which are just the tip of the iceberg?

Of course name:* tags are important. However, this discussion presupposes that there are situations where, despite these name:* tags, there’s still a need to place multiple names in name, due to multilingualism, a geopolitical dispute, or some other intentional ambiguity. I agree that delimiters can be confusing, which is why I’m advocating for the one delimiter that causes the least confusion.

It’s not as if this is an unsolved problem. The semicolon already works in a lot of software. However, it doesn’t look pretty in openstreetmap-carto, so mappers are incentivized to preserve the status quo from more than a decade ago when openstreetmap-carto’s labeling represented the state of the art.

SomeoneElse · December 31, 2022, 5:17pm

Unfortunately until OSM Carto shows that it is capable of working with OSM data as it is now and not how ir was 10 years ago, it isn’t going to be able to be part of a sensible discussion going forward. for the most part It’s not a technical restriction, it’s essentially a social one within the project.

stevea · December 31, 2022, 9:49pm

Precisely. Let us (OSM’s “good conscience” going forward) be the sensible discussion. Let Carto be Carto, which might “pick up the ball and run with it” eventually, but for now appears to be a blot on wider “social understanding” of improving this relatively minor, quite solvable problem in our data.

If a single (and the “most popular,” even as it is a sort of “front door” to OSM data) renderer is the source of confusion or dislike of the current toolchain, let’s thank @SomeoneElse for pointing this out, encourage semicolon to be the delimiter of choice (perhaps so it eventually becomes “the standardized” delimiter), know that some are going to “hold their nose” as they see this (incorrectly?) render in Carto, and move ahead with much better data in our map. Renderers will catch up with well-defined / well-structured data. Or, at least, they should.

Tag. Tag well. And (does it need to be said again?!), to the extent you can, don’t tag for renderers!

TheSwavu · January 1, 2023, 5:31am

Yes. In AU and NZ we have dual named geographic features that have “/” as part of the name.

dieterdreist · January 1, 2023, 10:35am

What is the order of precedence for these features, which are just the tip of the iceberg?

Way: ‪Tunnel sous la Manche / Channel Tunnel ‒ Tunnel Ferroviaire Nord / Running Tunnel North‬ (‪143253058‬) | OpenStreetMap

are we sure this is correctly tagged? Seems like two English and French names in the name tag, eventually one has to go in alt_name. Why is English and French “split”? How can it be solved with a semicolon?

Way: ‪Dover/Douvres - Calais‬ (‪209213884‬) | OpenStreetMap

with semicolon, how would you do it here?

Relation: ‪Trentino-Alto Adige/Südtirol‬ (‪45757‬) | OpenStreetMap

also how is this written with semicolons?

Hungerburg · January 1, 2023, 11:27am

The region has three officially sanctioned names: name_it, name_de and name_lld. Lld is too minor a minority, so dropped. So concatenation of name_it;name_de remains.

Fact is more complicated: First, in this case, if user agent is in English locale, name_en is neither name_it nor name_de but a mix of both (it just copies name), so deduplication will fail. Perhaps, because the region actually does not have a proper English name? Much like Bolzano/Bozen, where the value in name_en only says, that in the US/UK/? the city commonly gets referred to by its Italian name, instead of its German name? Unlike e.g. Munich or Vienna, that have an original English name.

Second problem, if the user agent is in e.g. French locale, how to construct an unbiased name from name_it;name_de, and how to know, that it should be based on it/de?

Minh_Nguyen · January 1, 2023, 5:06pm

These would be Dover - Calais;Douvres - Calais and Trentino-Alto Adige;Trentino-Südtirol, respectively, but I’m not necessarily advocating for the use of a semicolon in these cases. Rather, I’m pointing out that the absence of semicolons here makes semicolons necessary in other situations.

These are examples of customary combinations familiar to a local community that can’t easily be derived by concatenating name:en with name:fr or name:it with name:de. As we discussed earlier, name is a fine key for such shorthand, which inevitably includes dashes and slashes. But as long as name is used for this purpose, then the other purpose of displaying a rote list of names must use a different delimiter if a data consumer is to recognize it as a list. Note that ‪Trentino-Alto Adige/Südtirol‬ is one of the regions that, according to the wiki, ostensibly uses a dash as the delimiter between two languages’ names, but here we can see the reality is not as simple.

SomeoneElse · January 1, 2023, 5:24pm

Given that that’s a ferry route between England and France, perhaps use “Dover” (English) for the English end and Calais (French) for the French end? **

** with apologies to whoever wrote that gag for Spitting Image in the 1980s

aighes · January 1, 2023, 6:43pm

This wasn’t exactly my point. Delimiter like “/” or “-” are common in several areas as part of a single lingual name. So to ask a software to use those as a splitting delimiter is not possible without a lot of mistakes. So definitely if multiple names have to be listed in equal priority (due to whatever reason local mappers have) in the data there is a need for another delimiter.
In OSM the most common one is “;” which additional seems to be uncritical for real-world names.

Kind of listed values in the name-tag for equal important names of an object seems to be consensus based on actual usage in such areas. So I think this is nothing we need to talk about. But whats needs to be discussed is how to make this listed names machine-readable.

Hungerburg · January 2, 2023, 12:22am

This prompts me to spell out my intermediate summary of this topic: It is not about displaying names, but it is about labelling stuff in a way to make map users feel at home.

Which directly would lead me to propose a new tag name_label, which is a format string, eg. in the case of Bolzano/Bozen this one might work out fine:

<name_user-agent-locale><consumer-delimiter-global-start><name_it>?\consumer-delimiter-local><?name_de>?<consumer-delimiter-end>

where <> marks place holders and the question mark meant, only shown when name_user-agent-locale=name_it|de, rsp. (in case of hyphen) not shown, if there is a match.

Minh_Nguyen · January 2, 2023, 1:51am

I wrote “labeling” as shorthand, because openstreetmap-carto happens to only label features’ names, which is quite reasonable. It just happens to consume name verbatim, which is unfortunate. Other renderers, and indeed other kinds of data consumers, may have reasons to label things based on other name keys or other non-name-related keys.

For example, many navigation applications ignore a motorway’s name entirely in favor of a route number to avoid longwinded instructions. Or consider that a map may want to append some of its own explanatory text to a road label as an alternative to introducing yet another confusing color or dash pattern for roads. Obviously, glosses such as “(under construction)” and “(closed)” shouldn’t be hard-coded in any name tag in OSM.

Due to this diversity, I don’t think mappers have enough context to explicitly tag what a data consumer should display, only enough to tag what is true about the feature. One of those facts can be the feature’s “native name(s)”.

If I understand your proposal correctly, this name_label tag would clarify what’s in each part of name. This is not far from the language_format key that @imagico informally proposed back in 2017. However, I don’t think either that simpler syntax or your more complex syntax would be worth the effort for just allowing a renderer to append a native name onto a localized name without duplication. Christoph was trying to solve other problems at the same time, such as language-aware font selection.

Any kind of name metadata key would run into the same problem that no one would want to repeat this information on every individual feature, yet regional defaults are both unenforceable (as we’ve seen in South Tyrol) and impractical for data consumers.

In my opinion, the only suggestion so far that adheres to the KISS principle is to adopt the semicolon delimiter more broadly. But even that is beyond my ambitions: I would just like mappers to be able to use the semicolon without fearing criticism for making the map look ugly. This incremental improvement wouldn’t in any way preclude a more comprehensive solution in the future.

dieterdreist · January 2, 2023, 2:56pm

great you are not advocating for the semicolon in this case, we already do have the individual language names for these and the term “Trentino-Alto Adige/Südtirol” is a common form for the region, it is even part of the Italian constitution, Art. 116 La Costituzione - Articolo 116 | Senato della Repubblica

La Regione Trentino-Alto Adige/Südtirol è costituita dalle Province autonome di Trento e di Bolzano.

note how the region name is “bilingual” while the provinces are named only in Italian (in German it would be Trient and Bozen).

Minh_Nguyen · January 2, 2023, 3:46pm

So then we’re in agreement that the vast majority of multilingual name concatenations, which are not specified as such in a constitution, do not strictly need to be invented by mappers using the same punctuation?

ZeLonewolf · January 2, 2023, 5:29pm

Then it sounds like it’s just the local name, rather than multiple delimited names in one field.

In that case, shouldn’t it be the exact same string for the name:it Italian forms?

dieterdreist · January 2, 2023, 11:54pm

I am not mapping in an area with such complications, (so I do not personally bother if “the community” decides to do it one way or the other) but I know from discussions that there can be a lot of tension around language and names in these areas, and that the status quo is the result of years of discussion, so I would rather not touch it.

no, as I wrote, it is “a common form”, not the only one, there is a purely Italian version that doesn’t have any “Südtirol” in it (the Italian alphabet doesn’t have an ü)

Matija_Nalis · January 3, 2023, 5:37pm

Yes. Most common uses I thing are either:

a local name, or
a combination on multiple names (often in multiple languages) separated usually with some (semi-random) ASCII separator.

What I would advise against is that “new and improved solution” caters the latter. It is a bad idea from several standpoints

for all that data-centric purists, it is absolutely horrible idea to intentionally break database normalization. It has been shown time and again - in OSM too - that using separate tags is superior to trying to cramp multiple values into one tag separated by ; or whatever (e.g. sidewalk:left=yes is better solution than sidewalk=left, which is highly noticeable as situation gets more complex, e.g. sidewalk:left=separate). And here, name:hr=* + name:it=* (+something like “default_language=hr - it” in polygon for the county where that is recommended) would be better idea than trying to standardize specific way in which name=hr_name / it_name would be abused. - especially as situation gets more complicated (by e.g. implementing user preferences).
trying to do it wrong way around is bound to be much more complex, with huge number of problematic cases. So - it is simple to merge name:hr with name:it, it is reliably hard to extract separate name:hr and name:it given their mix in name. (same as it is much easier to mix flour and salt, than to extract salt from flour given their mixture; that is why I use term “wrong way”)
then there is separator issue; if one still insist on forcing the multiple values into one key (which is frowned upon in most databases), using ASCII separator like “;” or “/” is probably bad idea (as it can be used already elsewhere). Although I would highly discourage trying to stuff multiple values in one key, if one were to go that way, it would probably be better to use dedicated UTF8 information separation characters for that purpose.

Maybe the renderer can pull in some external comprehensive source of what’s spoken or signposted in every locality

As noted above, it can be specified per-locality polygon.

but a very tempting simpler alternative is to just use name

it is tempting, and it is a problem. Just as it is tempting currently, when most popular renderer will just take name and go with it verbatim – because it is simpler to do. Problem is it is often not what user wants.

At this point, we don’t even need to know that it’s Italian, just that this seven-character-long name has already appeared earlier in the label.

Oh, I totally agree with that this is unwanted consequence of putting multiple things into one key. It’s just that my suggested solution is not “lets see how we can better deduplicate stuff different values put into name key” but instead “let’s NOT put duplicate information into name key in the first place”.

Is there an example of an online map (doesn’t have to be OSM-based) that implements such complex fallbacks dynamically based on user preferences alone?

On user preferences alone, no. There should be reasonable default provided by map depending on the language. So my for example online desktop map, first part (Croatian in Croatia, or Spaniard in Spain) would remain the same (name / loc_name / official_name / alt_name). The second part (user in foreign country) could be approximated my CLDR matching you mention, and it would be reasonable for desktop online map. (although exposing it to the user so they can additionally modify it to their preferences, like e.g. streetcomplete-mapstyle does for styles).

But that online desktop map behaviour might be quite different from the behaviour the same user wants when on the ground (i.e. I might not speak Chinese at all so its unwanted on my desktop, but if I’m in China, I definitely want my mobile app to display Chinese characters too, so I can compare them with traffic signs for example)

So preferences change even for same user looking at the same part of the map, depending on what that map is being used for.

But you know, OSM Americana is easy to fork – a Croatia-focused style can afford to hard-code some assumptions about its users’ language skills

Sure. But if I’m going to fork&hardcode it, I can go with anything, right? (and I’d probably prefer mobile offline solution then). The advantage of OSM Americana to me seemed exactly that it automatically gets some of the user preferences and caters its display to accommodate them, with reasonable fallbacks.

What I miss in solution like OSM Americana is more user control over preferences (e.g. in example above I might choose not to see official_name or loc_name even when in Croatia (but just name and alt_name if it exists), or in other case using on the ground I might want to see int_name as well as local Chinese name – yeah I know it is called “Americana” for a reason and it is not a feature request there, but I’m trying to show what I personally find good and what I would prefer as direction).

TL;DR: What I though was interesting in this discussion was idea how this catering to user preferences can be even further improved by standardization on some tags (and maybe one day implemented on osm.org too). It is just that I find standardizing on separator chars in name=* as a worse (because it seems more prone to misinterpretation, de-normalizes data needlessly, hugely bigger amount of work, etc.) solution then improving standardization on something like default_language=* or language_format=*.

Minh_Nguyen · January 3, 2023, 8:02pm

First of all, default_language has been around for years. I appreciate the intention, but in practice it’s just for a user’s edification, similar to the “Languages spoken” row in the infobox of a Wikipedia article. As a tool for identifying the languages contained in name, it suffers from multiple flaws:

Language usage doesn’t necessarily adhere to boundaries, as the numerous examples earlier in this thread demonstrate.
As with default access restrictions, it requires geocoding each feature to determine how to render it in the most basic way possible. This is a poor tradeoff compared to inline approaches.
There’s no contingency for when one of the names in name contains the same punctuation character that’s in default_language. In other words, what if there are more hyphens in name than in default_language? What if there are fewer? What if the delimiter is just a space?
Ironically, the syntax for default_language lacks any delimiters around the language codes. When the Brussels boundary relation says fr - nl, you’re left wondering whether the spaces are there because the hyphen is already surrounded by spaces or just to avoid confusion with the dialect of French spoken in the Netherlands. (fr-nl is a valid BCP 47 code.)
Subnational extracts would no longer be self-contained. To accurately render a map of Antwerp, you’d need to download the entire Belgium extract or rely on an external geocoder.
Now someone can really disrupt OSM by changing a country’s default_language to ho - ho - ho!. Not only would one edit vandalize everything at every zoom level in the country, but it would also require inordinate resources to clean up any renderer unlucky enough to have rendered tiles based on this bad data.

As a tool for constructing a comprehensive native name label based on name:* keys, it also suffers from multiple flaws:

What to do when one of the referenced keys is missing?
If default_language=ab/cd (ef) but name:ab is identical to name:ef, should the renderer know to throw out the second half of this tag? What other arbitrary combinations are possible?
OSM XML doesn’t support newlines in tag values, but what if a newline is the best delimiter?
What should default_language be when the delimiter isn’t inherent to all the features contained within the boundary but rather depends on the map designer’s stylistic preference? (Do we agree that this is a legitimate opinion for a designer to have?)

None of these challenges is insurmountable, but I don’t think we should let perfect be the enemy of the good. Why not acknowledge the reality that name has multiple values in it? If default_language is such a good idea, it can contain a semicolon too.

Secondly, I agree in principle that a flat, semicolon-delimited value list isn’t naturally extensible to additional metadata. Anyone using name would have to do so in the most language-agnostic way possible, making no assumptions about what language each name is in. I still think a list of names in name is a useful tool for both renderers and geocoders despite this limitation.

Further enhancements such as automated transliteration or language-aware font selection would benefit from something more structured. However, I don’t think that should necessarily block the more basic need to display native names without duplication or with some additional formatting, especially since lists of native names have been present on the map for years and some users have come to expect them.

There’s no shortage of tagging schemes that rely on the semicolon delimiter, yet apparently the lack of formal arrays and dictionaries in our data model hasn’t stopped data consumers from using these schemes effectively.

I disagree that we need to use a hidden control character to separate values in a tag. Unicode includes these control characters for situations where a single character is absolutely necessary and the string is guaranteed to be processed before display to a user, something no browser does by itself. Given the uphill battle for acceptance of the semicolon, I think the prospects for a hidden control character are basically nil. Besides, there’s already a ;; escape sequence for the cases where an individual value in a list itself contains a semicolon. It’s only been used on a handful of features, and all of them look like typos except for this café:

Node: ‪;;Coffee‬ (‪9725500420‬) | OpenStreetMap

Compare that one example to all the names with dashes, slashes, or spaces in them.

Matija_Nalis · January 4, 2023, 1:57am

I absolutely agree it suffers from multiple flaws in its current state (even it’s own wiki says so!). That’s why said “improve on solution like default_language=* / language_format=*” and not “use it verbatim in its current state”. For starters it should have clearly delineated variables (e.g. something more along the lines “${name:hr} - ${name:sr-Latn}”)

Some are easily fixable by having better syntax, some has already handled counterparts (e.g. coastlines), and some are inherent but handleable (e.g. Antwerp extract would either have to duplicate default_language for itself, or have extract generator add it automatically, or store it separately somewhere or simply have user defining its own preferred rendering which maybe be same or different than “official” one) and some are actually easier to use then alternatives (e.g. vandalism case requires fixing just one tag, instead of many thousands of potentially modified objects for which clean revert is problematic in case if user just blatantly replaced multiple name tags and those objects changed afterwards).

None of those problems seem insurmountable to me; but yes, they would need extra discussion if people are interested in such more versatile solution, which is why I suggested it for consideration (if noone is interested, I certainly do not intend to start one-man crusade war over it )

What to do when one of the referenced keys is missing?

Whatever we want, eh? Simplest solution “just substitute null string” is admittedly not very nice, but even some basic handling (remove trailing fixed chars before null variable) produces much nicer results. Or if needed one can go more advanced ways if needed (e.g. posix shell variable expansion, ternary conditional operator, etc.) or even hardcode some rules. But I’d personally prefer to keep it somewhat simple.

OSM XML doesn’t support newlines in tag values, but what if a newline is the best delimiter?

The same as you would do in name=* case - you’d have to represent it somehow (common syntax is usually ASCII sequence \n, but one could use UTF8 shenanigans instead). But it would be much simpler to add it once by few experienced users in more powerful/customizable editor for the whole country/locality, then to depend on zillions of users on the ground with their different apps to all support it correctly and have all those zillions of users educated to use it correctly to map every single name.

What should default_language be when the delimiter isn’t inherent to all the features contained within the boundary but rather depends on the map designer’s stylistic preference? (Do we agree that this is a legitimate opinion for a designer to have?)

Absolutely, in fact giving that freedom to map designers is one the my main goals behind that idea (and even more extended than that - I think every user should have possibility to become simple map designer by tweaking rendering profile to their needs if they so choose. Sure majority want, but they should be able to). So, instead of having name=aaa / bbb that some random mapper on the ground has chosen as “best” and having map designer be at their mercy, the map designer would be one with power to choose what to render. They could take that default_language and use it verbatim (e.g. similar to current osm.org map), or they may decide to replace that “-” inside default_language (or “/” or “;” …) and replace it with newline or a picture of a red star or whatever. Or they might extend provided default_language with a newline and int_name. Or they may decide to disregard default_language altogether and render names for whole world in Croatian only (or whatever). Or they can have simple (or complex) set of user preferences they want to follow depending on the users will. IOW, map designers should be able to decide whatever they feel best for their specific use case!

Why not acknowledge the reality that name has multiple values in it?

I do acknowledge it is the reality (what I do not particularly accept is the claim that we should encourage the users to do more of it, instead of less of it). In fact, that reality is actually one of the main reasons I think it very uphill battle trying to convince every mapper in the world to map names in some specific way. It would IMHO be much simpler (and more realistically doable) to hand-craft and curate few hundreds (or thousands) default_language tags, then to try to handle every single name=* tag in existence and their additions/changes (even with such effort supplemented by very good AI bots).

If default_language is such a good idea, it can contain a semicolon too.

Sure, it can, if some locality really likes such separation of names with literal semicolon “;” characters (hey, I’m not judging!)

My experience is somewhat different about that “effectively”. Sure, it is possible to support them, and some have done it (to some extent at least), but majority I’ve seen do not seem to really handle them (and even more importantly, is it often impossible to even define them that way in ideal or even useful way, even with all goodwill of data consumers at disposal, as mentioned before with sidewalk example). (Additionally, parsing it sucks at efficiency - if I want to find every element that offers Croatian cuisine, I have to do extremely inefficient fulltext search on cuisine=* tags to find results containing croatian substring – if it were instead tagged as cuisine:croatian=yes, it would be very fast indexed find).