Multiple delimited names in the name tag

I wrote “labeling” as shorthand, because openstreetmap-carto happens to only label features’ names, which is quite reasonable. It just happens to consume name verbatim, which is unfortunate. Other renderers, and indeed other kinds of data consumers, may have reasons to label things based on other name keys or other non-name-related keys.

For example, many navigation applications ignore a motorway’s name entirely in favor of a route number to avoid longwinded instructions. Or consider that a map may want to append some of its own explanatory text to a road label as an alternative to introducing yet another confusing color or dash pattern for roads. Obviously, glosses such as “(under construction)” and “(closed)” shouldn’t be hard-coded in any name tag in OSM.

Due to this diversity, I don’t think mappers have enough context to explicitly tag what a data consumer should display, only enough to tag what is true about the feature. One of those facts can be the feature’s “native name(s)”.

If I understand your proposal correctly, this name_label tag would clarify what’s in each part of name. This is not far from the language_format key that @imagico informally proposed back in 2017. However, I don’t think either that simpler syntax or your more complex syntax would be worth the effort for just allowing a renderer to append a native name onto a localized name without duplication. Christoph was trying to solve other problems at the same time, such as language-aware font selection.

Any kind of name metadata key would run into the same problem that no one would want to repeat this information on every individual feature, yet regional defaults are both unenforceable (as we’ve seen in South Tyrol) and impractical for data consumers.

In my opinion, the only suggestion so far that adheres to the KISS principle is to adopt the semicolon delimiter more broadly. But even that is beyond my ambitions: I would just like mappers to be able to use the semicolon without fearing criticism for making the map look ugly. This incremental improvement wouldn’t in any way preclude a more comprehensive solution in the future.

7 Likes

great you are not advocating for the semicolon in this case, we already do have the individual language names for these and the term “Trentino-Alto Adige/Südtirol” is a common form for the region, it is even part of the Italian constitution, Art. 116 La Costituzione - Articolo 116 | Senato della Repubblica

La Regione Trentino-Alto Adige/Südtirol è costituita dalle Province autonome di Trento e di Bolzano.

note how the region name is “bilingual” while the provinces are named only in Italian (in German it would be Trient and Bozen).

So then we’re in agreement that the vast majority of multilingual name concatenations, which are not specified as such in a constitution, do not strictly need to be invented by mappers using the same punctuation?

Then it sounds like it’s just the local name, rather than multiple delimited names in one field.

In that case, shouldn’t it be the exact same string for the name:it Italian forms?

I am not mapping in an area with such complications, (so I do not personally bother if “the community” decides to do it one way or the other) but I know from discussions that there can be a lot of tension around language and names in these areas, and that the status quo is the result of years of discussion, so I would rather not touch it.

no, as I wrote, it is “a common form”, not the only one, there is a purely Italian version that doesn’t have any “Südtirol” in it (the Italian alphabet doesn’t have an ü)

Yes. Most common uses I thing are either:

  • a local name, or
  • a combination on multiple names (often in multiple languages) separated usually with some (semi-random) ASCII separator.

What I would advise against is that “new and improved solution” caters the latter. It is a bad idea from several standpoints

  • for all that data-centric purists, it is absolutely horrible idea to intentionally break database normalization. It has been shown time and again - in OSM too - that using separate tags is superior to trying to cramp multiple values into one tag separated by ; or whatever (e.g. sidewalk:left=yes is better solution than sidewalk=left, which is highly noticeable as situation gets more complex, e.g. sidewalk:left=separate). And here, name:hr=* + name:it=* (+something like “default_language=hr - it” in polygon for the county where that is recommended) would be better idea than trying to standardize specific way in which name=hr_name / it_name would be abused. - especially as situation gets more complicated (by e.g. implementing user preferences).
  • trying to do it wrong way around is bound to be much more complex, with huge number of problematic cases. So - it is simple to merge name:hr with name:it, it is reliably hard to extract separate name:hr and name:it given their mix in name. (same as it is much easier to mix flour and salt, than to extract salt from flour given their mixture; that is why I use term “wrong way”)
  • then there is separator issue; if one still insist on forcing the multiple values into one key (which is frowned upon in most databases), using ASCII separator like “;” or “/” is probably bad idea (as it can be used already elsewhere). Although I would highly discourage trying to stuff multiple values in one key, if one were to go that way, it would probably be better to use dedicated UTF8 information separation characters for that purpose.

Maybe the renderer can pull in some external comprehensive source of what’s spoken or signposted in every locality

As noted above, it can be specified per-locality polygon.

but a very tempting simpler alternative is to just use name

it is tempting, and it is a problem. Just as it is tempting currently, when most popular renderer will just take name and go with it verbatim – because it is simpler to do. Problem is it is often not what user wants.

At this point, we don’t even need to know that it’s Italian, just that this seven-character-long name has already appeared earlier in the label.

Oh, I totally agree with that this is unwanted consequence of putting multiple things into one key. It’s just that my suggested solution is not “lets see how we can better deduplicate stuff different values put into name key” but instead “let’s NOT put duplicate information into name key in the first place”.

Is there an example of an online map (doesn’t have to be OSM-based) that implements such complex fallbacks dynamically based on user preferences alone?

On user preferences alone, no. There should be reasonable default provided by map depending on the language. So my for example online desktop map, first part (Croatian in Croatia, or Spaniard in Spain) would remain the same (name / loc_name / official_name / alt_name). The second part (user in foreign country) could be approximated my CLDR matching you mention, and it would be reasonable for desktop online map. (although exposing it to the user so they can additionally modify it to their preferences, like e.g. streetcomplete-mapstyle does for styles).

But that online desktop map behaviour might be quite different from the behaviour the same user wants when on the ground (i.e. I might not speak Chinese at all so its unwanted on my desktop, but if I’m in China, I definitely want my mobile app to display Chinese characters too, so I can compare them with traffic signs for example)

So preferences change even for same user looking at the same part of the map, depending on what that map is being used for.

But you know, OSM Americana is easy to fork – a Croatia-focused style can afford to hard-code some assumptions about its users’ language skills

Sure. But if I’m going to fork&hardcode it, I can go with anything, right? (and I’d probably prefer mobile offline solution then). The advantage of OSM Americana to me seemed exactly that it automatically gets some of the user preferences and caters its display to accommodate them, with reasonable fallbacks.

What I miss in solution like OSM Americana is more user control over preferences (e.g. in example above I might choose not to see official_name or loc_name even when in Croatia (but just name and alt_name if it exists), or in other case using on the ground I might want to see int_name as well as local Chinese name – yeah I know it is called “Americana” for a reason :smile: and it is not a feature request there, but I’m trying to show what I personally find good and what I would prefer as direction).


TL;DR: What I though was interesting in this discussion was idea how this catering to user preferences can be even further improved by standardization on some tags (and maybe one day implemented on osm.org too). It is just that I find standardizing on separator chars in name=* as a worse (because it seems more prone to misinterpretation, de-normalizes data needlessly, hugely bigger amount of work, etc.) solution then improving standardization on something like default_language=* or language_format=*.

1 Like

First of all, default_language has been around for years. I appreciate the intention, but in practice it’s just for a user’s edification, similar to the “Languages spoken” row in the infobox of a Wikipedia article. As a tool for identifying the languages contained in name, it suffers from multiple flaws:

  • Language usage doesn’t necessarily adhere to boundaries, as the numerous examples earlier in this thread demonstrate.
  • As with default access restrictions, it requires geocoding each feature to determine how to render it in the most basic way possible. This is a poor tradeoff compared to inline approaches.
  • There’s no contingency for when one of the names in name contains the same punctuation character that’s in default_language. In other words, what if there are more hyphens in name than in default_language? What if there are fewer? What if the delimiter is just a space?
  • Ironically, the syntax for default_language lacks any delimiters around the language codes. When the Brussels boundary relation says fr - nl, you’re left wondering whether the spaces are there because the hyphen is already surrounded by spaces or just to avoid confusion with the dialect of French spoken in the Netherlands. (fr-nl is a valid BCP 47 code.)
  • Subnational extracts would no longer be self-contained. To accurately render a map of Antwerp, you’d need to download the entire Belgium extract or rely on an external geocoder.
  • Now someone can really disrupt OSM by changing a country’s default_language to ho - ho - ho!. Not only would one edit vandalize everything at every zoom level in the country, but it would also require inordinate resources to clean up any renderer unlucky enough to have rendered tiles based on this bad data.

As a tool for constructing a comprehensive native name label based on name:* keys, it also suffers from multiple flaws:

  • What to do when one of the referenced keys is missing?
  • If default_language=ab/cd (ef) but name:ab is identical to name:ef, should the renderer know to throw out the second half of this tag? What other arbitrary combinations are possible?
  • OSM XML doesn’t support newlines in tag values, but what if a newline is the best delimiter?
  • What should default_language be when the delimiter isn’t inherent to all the features contained within the boundary but rather depends on the map designer’s stylistic preference? (Do we agree that this is a legitimate opinion for a designer to have?)

None of these challenges is insurmountable, but I don’t think we should let perfect be the enemy of the good. Why not acknowledge the reality that name has multiple values in it? If default_language is such a good idea, it can contain a semicolon too.

Secondly, I agree in principle that a flat, semicolon-delimited value list isn’t naturally extensible to additional metadata. Anyone using name would have to do so in the most language-agnostic way possible, making no assumptions about what language each name is in. I still think a list of names in name is a useful tool for both renderers and geocoders despite this limitation.

Further enhancements such as automated transliteration or language-aware font selection would benefit from something more structured. However, I don’t think that should necessarily block the more basic need to display native names without duplication or with some additional formatting, especially since lists of native names have been present on the map for years and some users have come to expect them.

There’s no shortage of tagging schemes that rely on the semicolon delimiter, yet apparently the lack of formal arrays and dictionaries in our data model hasn’t stopped data consumers from using these schemes effectively.

I disagree that we need to use a hidden control character to separate values in a tag. Unicode includes these control characters for situations where a single character is absolutely necessary and the string is guaranteed to be processed before display to a user, something no browser does by itself. Given the uphill battle for acceptance of the semicolon, I think the prospects for a hidden control character are basically nil. Besides, there’s already a ;; escape sequence for the cases where an individual value in a list itself contains a semicolon. It’s only been used on a handful of features, and all of them look like typos except for this café:

Compare that one example to all the names with dashes, slashes, or spaces in them.

3 Likes

I absolutely agree it suffers from multiple flaws in its current state (even it’s own wiki says so!). That’s why said improve on solution like default_language=* / language_format=* and not “use it verbatim in its current state”. For starters it should have clearly delineated variables (e.g. something more along the lines “${name:hr} - ${name:sr-Latn}”)

Some are easily fixable by having better syntax, some has already handled counterparts (e.g. coastlines), and some are inherent but handleable (e.g. Antwerp extract would either have to duplicate default_language for itself, or have extract generator add it automatically, or store it separately somewhere or simply have user defining its own preferred rendering which maybe be same or different than “official” one) and some are actually easier to use then alternatives (e.g. vandalism case requires fixing just one tag, instead of many thousands of potentially modified objects for which clean revert is problematic in case if user just blatantly replaced multiple name tags and those objects changed afterwards).

None of those problems seem insurmountable to me; but yes, they would need extra discussion if people are interested in such more versatile solution, which is why I suggested it for consideration (if noone is interested, I certainly do not intend to start one-man crusade war over it :smile:)

  • What to do when one of the referenced keys is missing?

Whatever we want, eh? Simplest solution “just substitute null string” is admittedly not very nice, but even some basic handling (remove trailing fixed chars before null variable) produces much nicer results. Or if needed one can go more advanced ways if needed (e.g. posix shell variable expansion, ternary conditional operator, etc.) or even hardcode some rules. But I’d personally prefer to keep it somewhat simple.

  • OSM XML doesn’t support newlines in tag values, but what if a newline is the best delimiter?

The same as you would do in name=* case - you’d have to represent it somehow (common syntax is usually ASCII sequence \n, but one could use UTF8 shenanigans instead). But it would be much simpler to add it once by few experienced users in more powerful/customizable editor for the whole country/locality, then to depend on zillions of users on the ground with their different apps to all support it correctly and have all those zillions of users educated to use it correctly to map every single name.

  • What should default_language be when the delimiter isn’t inherent to all the features contained within the boundary but rather depends on the map designer’s stylistic preference? (Do we agree that this is a legitimate opinion for a designer to have?)

Absolutely, in fact giving that freedom to map designers is one the my main goals behind that idea (and even more extended than that - I think every user should have possibility to become simple map designer by tweaking rendering profile to their needs if they so choose. Sure majority want, but they should be able to). So, instead of having name=aaa / bbb that some random mapper on the ground has chosen as “best” and having map designer be at their mercy, the map designer would be one with power to choose what to render. They could take that default_language and use it verbatim (e.g. similar to current osm.org map), or they may decide to replace that “-” inside default_language (or “/” or “;” …) and replace it with newline or a picture of a red star or whatever. Or they might extend provided default_language with a newline and int_name. Or they may decide to disregard default_language altogether and render names for whole world in Croatian only (or whatever). Or they can have simple (or complex) set of user preferences they want to follow depending on the users will. IOW, map designers should be able to decide whatever they feel best for their specific use case!

Why not acknowledge the reality that name has multiple values in it?

I do acknowledge it is the reality (what I do not particularly accept is the claim that we should encourage the users to do more of it, instead of less of it). In fact, that reality is actually one of the main reasons I think it very uphill battle trying to convince every mapper in the world to map names in some specific way. It would IMHO be much simpler (and more realistically doable) to hand-craft and curate few hundreds (or thousands) default_language tags, then to try to handle every single name=* tag in existence and their additions/changes (even with such effort supplemented by very good AI bots).

If default_language is such a good idea, it can contain a semicolon too.

Sure, it can, if some locality really likes such separation of names with literal semicolon “;” characters (hey, I’m not judging!)

My experience is somewhat different about that “effectively”. Sure, it is possible to support them, and some have done it (to some extent at least), but majority I’ve seen do not seem to really handle them (and even more importantly, is it often impossible to even define them that way in ideal or even useful way, even with all goodwill of data consumers at disposal, as mentioned before with sidewalk example). (Additionally, parsing it sucks at efficiency - if I want to find every element that offers Croatian cuisine, I have to do extremely inefficient fulltext search on cuisine=* tags to find results containing croatian substring – if it were instead tagged as cuisine:croatian=yes, it would be very fast indexed find).

It would be nice if this topic could stay focused on multiple delimited names in the name tag and not get side tracked into a discussion of language defaults for areas. Although the ideas are related, this topic already quite long. Perhaps one of the @mods-general could split off the messages about language defaults into a separate linked thread?

I’m not hard against, but do note that many existing posts are quite intertwined with parts about disadvantages of “multiple delimited names in the name tag” as well as suggestions to modify/improve them (which should IMHO definitely remain in this thread), as well as suggestions for alternative ways to accomplish similar result (which might indeed benefit from being in new thread).

Perhaps new replies at least should each be split in two different messages? (one in new thread commenting on parts of messages related to default_language-alike methods, and one message in this thread commenting only on the name-alike method. Although I do envision it would be hard to keep such messages usefully crossreferenced, if one tries to compare their pro and contra. :slightly_frowning_face: )

default_language’s blast radius is too large for any data consumer to use for any use case that involves reuse or caching. Think of all the commotion whenever the coastline breaks and floods the world, or when the Great Lakes dry up, and how long it takes for the Standard layer to recover. Now imagine that multiplied by literally everything in a country at every zoom level. No changeset can cause anywhere near that scale of disruption by modifying individual name tags. Meanwhile, any legitimate change to a default_language tag would require modifying every name tag in the country. This is one of those ideas that sounds great on paper until considering how OSM is produced.

1 Like

I will consider this, but I won’t be able to give this potential creation of a new language default thread any attention for 6 hours or so.

The good news: OSM Americana now supports the semicolon delimiter in name, name:*, and also ref (for things like terminal gate numbers and highway exit numbers).

The bad news: OSM Americana can’t support slashes, dashes, and spaces as delimiters. But just imagine if the places that use these delimiters were to migrate to semicolons.

1 Like

Pompously announcing a really bad idea doesn’t make it less bad. The “other prominent OSM data consumers” are only doing a quick fixup to avoid ugly breakage in the name of being lenient in what they accept, that is not the same as “support”.

As has been pointed out multiple times in this thread, things are not so simple. Often in (proper) multi-lingual regions the -actual- name of the place is composite and is customarily written with a separator.

Please stop trying to rearrange the world according to a naive, CS-driven, concept of normalization.

The idea with semicolons may look neat and clean to a computer programmer but it solves preciously little for the name tag. name tags with multiple languages in them is just the very tip of the ice berg when it comes to problematic content. We also have descriptive names, names with extra info, categories in names, names with full route descriptions (any PT route). Each of these has its own particular problems for data users. When you add semicolons to the mix, you just pile yet another format on top of all that already exist.

If we are looking pragmatically at the situation, then the de facto use of the name tag has been for a long time to be the label or display name of the place. That is nowhere written down because we always strife for a name tag that adheres to the definition in the wiki. However, in reality it is what mappers tend to do (because of the feedback they get from the map) and because it helps to avoid conflicts. Maybe it’s time to just accept that the name tag is on of the ‘human’ tags in OSM, only to be displayed but not interpreted by a computer. As long as we make sure that the necessary data is also available in tags that are machine-readable, that’s a workable compromise.

My personal suggestion her would be to introduce a new tag display_name and get carto to render that preferably where it now renders name. Then advertise the tag among data user and start slowly moving non-names into the new tag. No mass edits necessary. Just rename tags when you come upon a problematic use. If it takes 10 years to get to a clean name state, that’s fine. No rush.

I certainly don’t think that it is a particular good idea when a single data users imposes a format for a tag that breaks pretty much everybody elses map.

3 Likes

That’s fine. If the simulated screenshot would be wrong in any of these places, then by all means the name should stay as is. For the features that are using a semicolon, however, it’s clear that the mapper’s intention was not for the user to see a semicolon. If Americana is avoiding ugly breakage, then you’ve written a better headline than I was able to come up with.

Can you elaborate on what’s broken as a result of Americana or any of these other data consumers (plural) interpreting a semicolon as a value separator? Americana still renders slashes, dashes, and spaces as slashes, dashes, and spaces. If you’re concerned about semicolons getting misinterpreted, so far, I’ve come across only one name in the whole world that properly contains a semicolon in the real world – and it’s escaped as ;;. So if anything, that feature is broken in any data consumer that does not interpret the semicolon as Americana does.

This would be grist for a separate topic, about which I’m pretty sure you and I would see eye to eye.

If separating multiple equally primary names with the standard semicolon separator is such a bad idea then it would be helpful to explain why you think that. I don’t particularly want to see further proliferation of multiple names stuffed in one tag in cases where name + alt_name + *_name + name:* would be a better representation. However, with multiple names in the name tag being a common political compromise, I’d much prefer to see the standard semicolon delimiter used in those cases.

1 Like

I’m glad to hear that you recognize the challenge that we face, and I appreciate that you have some concrete ideas to solve the problem. If this thread has shown anything, it’s that a lack of data standardization can cause real problems for real data consumers when alternate tagging schemes are in competition. If the community comes up with a better data modeling solution, I’m confident that the Americana project and the broader US mapping community would adopt it.

In the meantime, with my “maintainer of a community renderer project” hat on, I support the views of my fellow maintainers that supporting semi-colon delimiters is the least bad option available in the face of multiple conflicting methods to solve the same problem. Sitting around and waiting for the community to invent a better solution is inconsistent with the zeitgeist of the community around our renderer. Supporting innovation is an explicit goal of the project, and I expect we will continue to innovate in the future on long-standing challenges in OSM-based cartography. If that philosophy exposes areas where the OSM data model can do better, I consider that a positive outcome.

I recognize the unfortunate situation that rendered names with a semi-colon will look poor when rendered on maps that have not chosen to interpret a semi-colon as a delimiter. Rather than complain about the situation, I hope those on this thread with strong feelings will consider this a call to action to work with the community to solve it properly. I appreciated your response to a question on tagging standards during the recent OSMF election:

The evolution of tagging is a question I consider a core responsibility of the community that should not be decided top-down by the OSMF board. However, it is a topic where the board could give the necessary support to bring the topic forward by organizing a working group. As with the data model, such a working group would need to start with a study that researches the different options of standardization or consolidation of our tagging system, so that the community can have an informed discussion. Only then can we talk about how the OSMF can support a concrete evolution step.

If you were serious about this, and it wasn’t just an offhand statement to mollify the portion of the electorate that feels strongly about tagging standardization, consider this an opportunity to put your suggestion into action.

  • As has been pointed out places do have composite names (@lonvia touched on other complexities that in the end cause similar issues) while they might be built by concatenating semi-independent strings, the result is still a name in its own right.

  • Turning the previously unstructured name tag in to a structured tag is just a tremendously bad idea, it changes the semantics of one of the most used attributes (and @Minh_Nguyen was asking for all punctuation to be converted to semi-colons, not just handling the odd misused tag) and will loose information on a big scale.

PS: poster child example Biel/Bienne - Wikipedia

1 Like

The problem right now with non-standardized delimiters is that there is no way to distinguish the case of a “name in its own right” from there being two equally valid but different names used by different local linguistic groups. This is specifically the situation that normalizing on the semicolon separator for equally valid but different names can help distinguish. If the name really is hyphenated, don’t change the hyphen to a semicolon. If the name really isn’t hyphenated in practice but there are two different versions that are equally prominent and valid, then don’t use a hyphen, use a semicolon.

I believe that you are misunderstanding Mihn’s comments. If the single name is understood to include hyphens and other punctuation, then they should be kept. The cases where semicolon should be used is where local speakers of one language use one name and local speakers of another use a different name and those linguistic groups don’t have a unified understanding that the name should be compounded.

3 Likes