Multiple delimited names in the name tag

ezekielf · December 28, 2022, 5:31am

Lots of good points raised in this thread. It seems to me that there are some situations where a standard semicolon delimiter offers a nice amount of flexibility to a data consumer. If I’m making a map, the unordered list Bozen;Bolzano tells me that two different names are used locally and the tags name:de=Bozen & name:it=Bolzano tell me which language each name is associated with. From this information I can generate a variety of different label styles depending on the purpose of the map. Some examples:

This works because each name is distinct (even if similar) so displaying both does not feel redundant. In this case, structuring the data for flexibility seems good to me. However, this should only be for cases where both names truly are equally important locally. If the local community considers one primary and the other secondary, putting one in name and the other in alt_name makes that clear and should be preferred.

The cases where some words are redundant between the multiple names are less clear to me. While Carabinieri Bozen;Carabinieri Bolzano would seem perfectly accurate, the label style examples above would look much more cluttered with the word Carabinieri repeated. It certainly would be nice to be able to construct “Carabinieri Bozen / Bolzano” or “Carabinieri Bozen–Bolzano” from the list of both full names.

The Canadian Province of New Brunswick is bilingual and officially goes by it’s French name, Nouveau Brunswick, as well. A map label of “New / Nouveau-Brunswick” wouldn’t be wrong, but in this case “New Brunswick – Nouveau-Brunswick” seems preferable to me. Probably because Brunswick by itself refers to a different place.

Ottowa, Canada signs its streets in French and English, but the unique part of the name is the same in both languages (and sometimes the prefix and suffix words are too). For example “rue Percy St.” which expands to “rue Percy Street” or “av. Powell Ave.” which expands to “avenue Powell Avenue”. It looks like Ottowa mappers have chosen to put the English variant in the name tag, but lets imagine they used a semicolon delimited list. Here are some examples of how these two names might be displayed on a map:

That’s quite a lot of redundancy just to show that the French word for street is rue, but it is a reasonable representation of the signage on the ground. Matching the signage exactly with the format “rue Percy St.” and “av. Powell Ave.” would be the most space efficient option, though perhaps not universally preferred.

An automated rule set might behave a bit differently for each of these examples:

Carabinieri Bozen;Carabinieri Bolzano → “Carabinieri Bozen / Bolzano”
New Brunswick;Nouveau-Brunswick → “New Brunswick – Nouveau-Brunswick”
rue Percy;Percy Street → “rue Percy St.”

There are probably other semi-redundant multi-name situations where a different output would be optimal. An automated rule set like this would be difficult to get right, though it is an interesting problem to think about. Simply formatting these type of names exactly as one thinks they ought to be printed on a map certainly is enticing, but it limits what map makers are able to do with the data. I think shifting more delimited names to use the standard semicolon would be a net positive in the long run. For more complex redundant word cases, perhaps it doesn’t make sense currently, but for the simple cases the benefit seems quite clear to me.

aighes · December 28, 2022, 9:30am

Agree, Carabinieri is a loanword, Bozen / Bolzano is a additional description/location. Something like brand + location.

Agree, New Brunswick and Nouveau-Brunswick belong together.

Here I disagree. rue Percy St. isn’t easy to understand and somehow breaking the unity of rue + Percy and Percy + Street. Of course, if you have knowledge of both languages and are able to pick the useful part, it’s fine. But I assume the majority of end users would simply assume the road is called rue Percy Street. If space is an issue, I would rather prefer only to see one of the names, as it’s less confusing. At least on vector-tiles (so I get my preferred one). On static tiles, both should display (maybe just replace ; with /).

In summary, geographical names (places, roads, admin. areas) I would expect all the single names displayed as is. On amenities the “brand” part should be combined, if they are same for all the single names.

SK53 · December 28, 2022, 1:08pm

Good time to upload this composite image of road names in N. Wales. Immediately, relevant is the sign Lon Efelyn with the conjoined construction to “Ffordd Hwfa Road” used to save space, but i can be seen that even in most cases this doesn’t work because of differences in English & Welsh orthography Tyrner/Turner, Efelyn/Evelyn). One would have to look at other languages, I’ve certainly seen street signs of the “rue Percy St.” form, but would have to check where. It may cause confusion if such differences led to rather different ways of showing the name.

Also worth noting is that “Hwfa Road” has been spray painted out on the sign for that road (an illustration of cultural issues).

ezekielf · December 28, 2022, 2:32pm

This is a good point. If a map maker chose to do this, I’d imagine they’d style the prefix and suffix words a bit differently to set them apart from the main word. For example, on the signs, rue and St. use a smaller font size than Percy. Displaying both names is the clearer option, but the point is that giving map style designers options is a good thing! Most maps probably wouldn’t do this, but if someone figures out the logic to make it work and chooses this style for their map, that’s great. I want to enable creative and interesting diversity in map design.

Minh_Nguyen · December 28, 2022, 4:50pm

If Ottawans have their own idiosyncratic local style for combining the names in writing or speech, just as Montrealers greet strangers with an ambivalent “Bonjour-Hi”, then I certainly have no objection to that information being recorded on the map for all to see, as long as there’s no expectation that a map would avoid the redundancy somehow.

Meanwhile, in the linguistic gumbo of southern Louisiana, some places prioritize English on street name signs, others prioritize French or Spanish, and still others bear an English name borrowed from Vietnamese, but none attempt to combine the names as far as I’m aware.

(Wikimedia Commons has an extensive collection of multilingual street name signs in the U.S.)

Mateusz_Konieczny · December 28, 2022, 7:13pm

And one more wrinkle related to “tag what is signed” idea. Poland has relatively low barrier for double language settlement signs like this one:

Nevertheless only Polish name is put into name tag (see Node: ‪Gąsiorowice‬ (‪863897920‬) | OpenStreetMap ) as it is a clearly dominating local name.

ezekielf · December 28, 2022, 8:01pm

This seems appropriate. I’m not suggesting that name should always include what is signed, only that mappers in a particular area may decide that is the appropriate thing to do if multiple names are widely used by residents of that area. And when this decision has been made, a semicolon delimited list is preferable to other ways of putting multiple names in one tag. There are plenty of signs in both English and French where I live, but we don’t have a large French speaking population. Instead, this is for the benefit of French speaking tourists from neighboring Quebec. Therefore it would not be appropriate to tag objects with both French and English in the name tag. However in an area where there is significant usage of both French and English it could be appropriate.

stevea · December 29, 2022, 12:24am

I want to agree with Zeke, as I do agree with him. Yet, I also want to “agree plus.” What I mean is that my state has dozens, maybe in the low hundreds of name:xy=* tags (where xy is a language code). Now, there’s a lot of tourism, I can see that making sense, even as I also see that not every small town or street sign is going to be named in 117 languages (and some things are). So, when Zeke says “be appropriate,” yes, sure: it is appropriate to be bilingual in or around Québec, or instead of French, maybe Spanish, or Korean in Koreatown or Flemish and two other languages in Flanders or whatever. That’s fine. However, while the number of name=* tags can be one, or two such as name:fr, two can become two hundred. We shouldn’t “struggle to get there,” but nor should we ignore that our tagging is as elastic to our growth as we wish to make it. We don’t want to ridiculously-overload, but there isn’t anything wrong with a “you wanna back up the truck and put a lot of languages here?” Maybe in a sustainable “grow the world’s languages on our map” campaign? Yeah, I might nod my head at something like that: the sky’s the limit, and I think we can achieve limits of reasonableness if it gets out of hand.

Balance. There is “appropriate,” there is “blue sky ahead, let’s stretch our wings and imagine…”, too.

Semicolons aren’t going away. Tags of the form name:xy=* (in many languages) aren’t going away. We have “well-accepted syntax,” so, our map will grow as it will, especially as we “garden it ahead.”

Yeah, that’s right.

Minh_Nguyen · December 29, 2022, 5:38pm

(Sorry, wrong link – that should’ve been pointing to a sign on the other side of the street, kind of hard to spot on your own. I need to be more careful about testing Bing Streetside links before sharing them.)

Minh_Nguyen · December 29, 2022, 5:59pm

Thanks for this graphic! It often helps to center a discussion around a concrete visualization of the intended effect. Note that there are even more possible treatments. For example, this 1939 National Geographic map puts the anglicized name in a gloss but places the gloss on either side of the main label, wherever space allows:

This 1962 map of Libya (with a color scheme suspiciously similar to Americana) places the romanized Arabic name together with an anglicized name in a gloss and also places an Arabic-script name independently, again wherever space allows:

Even if a map lays out CJK horizontally alongside Latin characters, it might need to choose a delimiter other than a hyphen, en dash, or em dash. A standard Western dash could be confused with the character “一”, which has a specific meaning in Chinese, Japanese, and Korean. These languages use a two-em dash instead (⸺), but that doesn’t fit well on a map.

I’ve seen English-language maps that place delimiters such as an bullet (•), triangle (▴), or swung dash (⁓) between foreign-language names of equal standing, as a less ambiguous alternative to a dash, slash, or interpunct (·/．). Unfortunately, I don’t have any photos to share offhand.

Matija_Nalis · December 31, 2022, 1:27am

Perhaps I misunderstood, but it seems to me that you say “we don’t need vector tiles to solve this problem”, and then provide a link to service which is using vector tiles to accomplish it. Could Redirecting to americanamap.org… do that following of user preferences if it was using classical TMS? And if not, isn’t then using vector files prerequisite for solving the problem, right?

My point was that you can’t “localize in French” (or Chinese, or Russian, or Greek, or …) depending on user preferences unless you have some service to actually show them. Sure, you could hundreds of different TMS tile servers, one for each language, and that would handle that but at a cost of hundreds time more resources, which does not seem feasible to me. Or, you can have one vector tile server, which uses less resources than one TMS servers (not to mention hundreds of them). One of those solutions is realistic, the other does not seem so to me.

Sure, that is additional step that should be decides how to be handled after vector tiles are implemented. Because without it, we don’t need to bother about user preferences, as we can’t really implement them. Or so it seems to me. But for that additional step, I have suggested ways to handle in my previous post.

ZeLonewolf · December 31, 2022, 1:36am

A pre-generated raster-based tileserver is completely capable of implementing the exact same thing that Americana does in vector tiles, for a single language and list of fallback languages. Also, since raster stacks typically render high-zoom tiles on the fly (>z12 for the standard tile layer), it is also possible, at least theoretically, to do it dynamically as @TomH has noted.

However, neither a vector nor a raster stack can implement language fallbacks correctly if multiple names are shoved into a single tag using arbitrary delimiters.

This is a data modeling problem, not a rendering problem, let’s not mix them up.

Matija_Nalis · December 31, 2022, 2:16am

It seems to me like we’re missing what the other tries to say. I’m talking about need to handle of hundreds of languages (both as a primary, and as fallbacks, so it is a matrix of hundreds * hundreds), you seem to of opinion it is quite OK to settle on single one (with few fallbacks).
I personally would find such single-language (whatever language is chosen) unsatisfactory. I.e. no matter which data model is chosen, if solution forces rendering in “single language and list of fallback languages” for the whole world (thus disregarding user preferences) I would not see it as an improvement over current osm.org raster TMS rendering (which also disregards user preferences).

It seems to me that you insist that it is one or the other, exclusively. What I’m trying to point out is my belief that the best data model in the world is not going to help you much if you’re unable to render it in the end (and it is my impression it is not realistically possible to render any globally usable solution with raster-based TMS). So, if you want a working solution that will appeal to the people, you need both.
Yes, data model is important, I absolutely agree. But so is the ability to render it. Or do you disagree?

ezekielf · December 31, 2022, 2:26am

The point is that this thread is about the data modeling (that can enable a variety of different rendering options), not about what the rendering on osm.org will or won’t do.

Matija_Nalis · December 31, 2022, 3:10am

Fair point; but the the decision about how it should be mapped should in my opinion be seriously influenced by what can be done with it afterwards. (e.g. see quite complex set of preferences which might’ve been ideal from data modeling POV; but lack of realistically available practical implementations after it is implemented probably makes it much less desirable).

Sure, map at osm.org is one of the more popular data consumers. But decision would influence all other renderers too (who would have to cater to it, if current concepts are turned upside-down The more they need to change, the less desirable the solution likely is. (i.e. in ideal case, renderers don’t need to change at all but will still happily deliver close-to-optimal renderings)

stevea · December 31, 2022, 3:46am

I think zooming out a bit and recognizing what are widely acknowledged to be “friction points” in OSM might be in order here.

Yes, OSM is, fundamentally, data. And it is correct for a thread like this to ask “how should we best model our data?” That question is in the context of OSM being (again), fundamentally data. What is known by most people, but not (always) recognized by many, especially more novice consumers of OSM data, is that the [renderer, router, parser of OSM data, text-to-speech module…] is a rather bespoke (custom), hand-tuned, changing-and-improving-over-time… entity. Carto, for example, has its “wide appeal as a mapping renderer to please many for many general-purposes uses,” and other use-cases have their idiosyncrasies (by design).

I think part of why there is some “missing each other” here is that the “generic” idea that OSM is fundamentally data is often (very often) lost when people discuss what is fundamentally a data-modeling problem. Those who do recognize that, those who don’t want said data modeling to “be seriously influenced by what can be done with (them) afterwards…” (there was more after that was so stated, but I’ll truncate there) is where I politely say “Bzzzzzt!” (foul buzzer sound). We really don’t (and shouldn’t) base data-modeling decisions on how that “would influence all other renderers.” Renderers are responsible for rendering OSM’s data (as are routers to route them, and text-to-speech parsers to speak them). These really should happen independently, and hence our admonishment “don’t tag for the renderer.”

Given “ideal” data, if renderers are “sub-optimal” or “close-to-optimal” (but not), that’s not the fault of the data, it is the fault of the renderer (again, downstream use-case parser). You can call me a “data purist,” but OSM really goes down a rabbit hole (which we don’t want to) when this sort of “tweak the data for a/the parser” happens. I’m glad to see that some are “nipping that in the bud” before it full-flowers into a widespread misunderstanding.

Tag. Tag well. (And, its adjunct of “Wiki. Wiki well.” goes along with that. Substitute “Write documentation” for “wiki” as you wish). Parsers, routers, renderers, text-to-speech interpreters, downstream use-cases you haven’t yet even thought of? Forget about 'em. It’s YOUR job (our job) to tag “what is.” The rest will take care of itself.

Now, if you are writing a renderer, we can start another thread.

stevea · December 31, 2022, 4:07am

Also, I’d very happily accept a version of something (raster, vector…doesn’t matter) if it correctly implements multiple language rendering, as data groomers grow two, three, several languages, testing that their implementation and deployment are successful, rather than have us attempt to get “hundreds” or “all of them” (languages) implemented…and we end up languishing under the crush of “too much at once.”

Stubbing (rendering code, for example) works. Yes, some (corner cases, what some might call “minority” use cases…) will have to get in line and wait longer, it’s true. But when their turn comes, it’s far more likely to work, as it’s been “bullet-proofed” with earlier (perhaps easier) cases. No? Your case is buggy? That can be handled (in turn). But to “debug the whole world at once” is simply unrealistic.

Stub. Stub well.

Minh_Nguyen · December 31, 2022, 8:04am

To be clear, the focus of this discussion is the name key, which is not in a single language worldwide. This discussion arose because even a map that uses name:* keys can achieve an extra bit of sophistication by including the name(s) in the local language(s).

If we suppose that a renderer is already showing the name in one or more of the user’s preferred languages, all that’s left is to show any remaining names in the local languages. Maybe the renderer can pull in some external comprehensive source of what’s spoken or signposted in every locality, but a very tempting simpler alternative is to just use name. What a renderer can do with name depends on how its values are delimited:

If the values are separated by an arbitrary, human-readable delimiter, then it can only display the entire name tag verbatim, potentially repeating names that have already been listed.
If the values are separated by a predictable semicolon, then it can display only the names that haven’t been listed yet. It can also apply some nice punctuation or spacing between the names.

Note that names may overlap between unrelated languages. It’s “Bolzano” to both Italian and English, so an English speaker would want the Italian name to be omitted. No matter how important Italian is locally, its name for the city is redundant to the English name. At this point, we don’t even need to know that it’s Italian, just that this seven-character-long name has already appeared earlier in the label.

I would caution against relying too heavily on user preferences. Let’s consider your ideal desired behavior again:

Is there an example of an online map (doesn’t have to be OSM-based) that implements such complex fallbacks dynamically based on user preferences alone? What does the form look like to specify your preferences? Most users would not want to fill out a questionnaire just to see a map.

In reality, some language fallbacks are handled automatically for the user based on a very simple user preference. OSM Americana currently implements the ICU locale fallback algorithm, which is the bare minimum needed to make browser preferences align with the data in the map tiles. Users don’t need to explicitly add en as a fallback to en-US because Americana knows to strip off the region code.

If you speak a language like Serbian, you’d probably appreciate the more nuanced fallbacks in the CLDR language matching algorithm or MediaWiki’s homegrown alternative:

Typical UIs use English as a last resort fallback, but for maps it would be better to use the local language, hence the focus on name. Maybe the behavior could vary depending on the country you’re looking at. This crosses the boundary into what needs to be implemented server-side, where there’s less ability to respond to dynamic user preferences. But you know, OSM Americana is easy to fork – a Croatia-focused style can afford to hard-code some assumptions about its users’ language skills. Americana tries to avoid making assumptions because the U.S. is such a multilingual country.

The most sophisticated fallback strategies are difficult to implement, but internationalization is never an all-or-nothing affair. For a renderer unable to implement language identification and language-aware transliteration, presenting name is not a terrible alternative. The only catch is that it can contain multiple names separated by one of several punctuation characters that can reasonably appear inside a name too.

stevea · December 31, 2022, 8:12am

That is outstanding, Minh, thank you. The fallback chains diagram is most informative.

I think what might be happening at a LOT of “levels” of this discussion simultaneously is questioning where (or there being multiple questions of where) the fallback decisions happen. People have thought about this way more than me, for sure.

One thing that hasn’t been discussed is how much what OSM does is “seen as more-standard” behavior. In some sense, what we say and do now could seriously influence how things go forward. (Like that isn’t true a fair bit already).

Minh really nails it with fork-ability and how something “already somewhat like what YOU might like” isn’t terribly far away or impossible. It’s a “use your words,” (spec it out) and implement chain.

I love what I’m seeing here: excellent words (and even diagrams).

dieterdreist · December 31, 2022, 12:21pm

If the values are separated by an arbitrary, human-readable delimiter, then it can only display the entire name tag verbatim, potentially repeating names that have already been listed.

If the values are separated by a predictable semicolon, then it can display only the names that haven’t been listed yet. It can also apply some nice punctuation or spacing between the names.

currently, the separators “ / “ or “ - “ are used, that’s not any possible arbitrary delimiter but just 2 alternatives

Note that names may overlap between unrelated languages. It’s “Bolzano” to both Italian and English, so an English speaker would want the Italian name to be omitted.

a simple check for substring included would make it, it doesn’t matter to the English reader if the “Bolzano” she sees is meant to be Italian or English, it is the same.

No matter how important Italian is locally, its name for the city is redundant to the English name. At this point, we don’t even need to know that it’s Italian, just that this seven-character-long name has already appeared earlier in the label.

exactly, and hence we don’t want to repeat it. We do not really need to switch from “ - “ to semicolons to omit the localized string if it is already contained in the local label.