[RFC] Feature Proposal - Add languages: tags for name rendering

aseigo · November 13, 2024, 9:57am

There are a few ongoing discussions on the OSM Community forum about the use of the name tag for features in multi-lingual contexts. The problems stem from treating the name tag as the to-be-rendered localized name.

Improvements that address the main concerns could be made by giving renderers enough information in the OSM dataset as to how to select appropriate name: tags for rendering in a given region.

Two new tags, languages:official and languages:preferred for use on administrative boundaries are proposed to do just that. A detailed description of the problem and how these tags could address the issues are found on the OSM wiki here:

https://wiki.openstreetmap.org/wiki/Proposal:Add_languages:_tags_for_name_rendering

Updates

A summary of some of the outcomes of the discussion thus far:

The use of the word official in the tag names causes a variety of questions and concerns. It will be changed in an upcoming draft to something like languages:display_order. I’m still gathering input/feedback on improved naming.
A corner-case that affects a small, but non-zero, number of places has been identified: admin boundaries which have different rules to the place names within them. This has been noted in the Open Questions section of the proposal and before moving to a vote, that issue will be addressed.
It was not entirely clear to everyone that the name field would remain as it currently is. Under the proposal name serves as both a backwards-compatibility feature and as a fallback in cases that otherwise can not be provided for.
Multiple names in the same language as currently serviced by the old_name, alt_name (and possibly other?) tags is not in scope for this proposal, which is focused on providing language metadata for names. I will clarify this in the next draft of the proposal.

Also, a big “thank-you” to everyone who has provided feedback so far. The number of examples, interesting edge-cases, oddities, specializations, and concerns that have been collected has been fantastic, not to mention super helpful!

Kovoschiz · November 13, 2024, 10:43am

language:*= already exists. There was something more complex proposed, but it’s not really needed now. Proposal:Language Purpose - OpenStreetMap Wiki
Creating an ordered list of languages in them seems a risk for controversy in other cases. language:aa=official + language:bb=official would be equal in status.

SimonPoole · November 13, 2024, 11:07am

This has the same issues as Proposal:Defaults - OpenStreetMap Wiki

requires the relevant boundaries (potentially hierarchically) to be included in any extract (just the concept of having to recursively fetch the admin boundary data in an editor prior to editing is a nogo for essentially anywhere outside of Liechtenstein)
doesn’t cater for situations in which the relevant tag values differ on non-admin boundaries
technically it is not particularly straight forward how to use the information in a reasonable fashion on the fly for rendering, aka you are going to have to build the equivalent of one of the many location → country/region/whatever indices, which beggars the question why not use one of them to start with.

The only advantage to attaching the information to boundaries is that users can use conventional OSM editors to do that over adding the information to
an external file/DB (for example GitHub - simonpoole/geocontext: Country specific geo/transport related properties), but in the end you are not going to be using the information without intermediate processing.

aseigo · November 13, 2024, 11:31am

Creating an ordered list of languages in them seems a risk for controversy in other cases

No matter what the politics there are, there are limited choices in rendering, be it visual, audio, etc. Audio renderings are always linear in terms of time, and visual are linear in terms of space.

Regardless of politics, an ordering is needed.

This probably isn’t a real-world issue, however. Outside of actual zones of conflict, name orderings are usually defined, or have an ad-hoc agreement applied because street signage, postal addresses, etc. face the same linearity issues.

language:aa=official + language:bb=official would be equal in status

This would be worse in practice as it would result in renderers not applying the officially mandated ordering. Most places that have multilingual naming and signage which follows it have official guidelines for ordering.

See the example of Canada and Quebec in the proposal.

Kovoschiz · November 13, 2024, 11:35am

The signposted language ordering would be in language:preferred= . But many jurisdictions have equally important official languages. Another case might be 1 primary language, and others equally less important secondary language. These situations can’t be handled.

aseigo · November 13, 2024, 11:42am

If we look at the actual map, this is an additional 2-3 boundaries in most cases, assuming that neighbourhood and city boundaries are often already loaded. These boundaries could also be extracted as a separate layer for use in renderers.

I looked at the expected costs in terms of additional data and processing time and they both seem to be negligable in practice.

If you have numbers that demonstrate otherwise, please share them so I can adjust my viewpoint.

Do you have examples of this in the real world, specifically for languages? Language policies are generally set by governmental agencies, and typically map to boundaries within their jurisdiction.

I looked around at various cases around the world, though certainly was not exhaustive. So, again, if you have examples of this in the real world, I would appreciate looking at them.

(As an interesting (current) counter-example: In the case of the Haida Gwaii example, there currently is not an administrative boundary. However, that is currently an omission as there is a recognized land title that is now overseen by the Haida nation government.)

I included a diagram of the algorithm, and extracting the administrative boundaries in an area is not particularly difficult. Traversing them up to country level is a series of spatial queries, which I suspect themselves may be optimized with pre-processing.

I would not expect renderers to do additional queries to the dataset, one by one, as they slowly increase their search space up to country-level, as you note:

I assume the data would end up included in such indices.

What the proposal provides is a way to notate within the dataset what those values should be in a way that is easily accessible and standardized, allowing contributors to maintain this data with familiar editing tools and practices, and to do so that maps to how these regulations are typically worked out in the real world: by administrative bodies with specific regional oversight.

Perhaps, though, I am misunderstanding the point you are attempting to make here, in which case I apologize in advance and would ask for clarification.

aseigo · November 13, 2024, 11:44am

The idea is to model rendering practices and mandates. As noted above, rendering is (with a few exceptions) linearized in space or (in the case of audio) time.

The idea is not to model political details, but to allow software rendering the map to more accurately, appropriately, and consistently do so.

Do you have real-world use cases where the rendering of languages must be un-ordered, or even is in practice (e.g. on the streets) unordered?

Kovoschiz · November 13, 2024, 11:50am

I’m asking about the official languages, not the signposted language order. If that’s not what you want, it should be dropped from this proposal.

aseigo · November 13, 2024, 12:00pm

This faces the same set of issues when it comes to rendering, however.

Where there isn’t a preferred regional ordering, the nearest official language list is the ordering. The purpose of that is to allow regional variation (see Quebec or Biel/Bienne), or even deviation (see Haida Gwaii), to be taken into account when rendering.

Perhaps the issue you are raising stems from the choice of the word “official”. Naming is indeed hard

If you have alternative suggestions for “languages:official” which are clear, accurate, and will not devolve into the same problems, that’d be great and I’d be more than happy to adapt the proposal!

FWIW, I passed on the idea of language:<code>:official=yes|no (something I did consider at earlier stages of the drafting process), as it does not help when deciding how to order rendering. For many countries, that will be the only list, and in most cases with multiple official languages it would mean using languages:preferred to define the ordering for rendering (probably at the country level) which just moves the problem to “what is the order of the preferred list?!”

Also, I hate to ask again, but: do you have an example of a country which has multiple languages which are considered of equal standing and which also has no prescribed ordering of the languages for the purposes of maps, signs, etc.?

edit: I wonder if languages:order:official and languages:order:preferred, or even languages:order:regional, (or similar) would be clearer for people? It seems a bit unfortunate to bloat the key names, but if it is less of a problem for people … … . .

SimonPoole · November 13, 2024, 12:02pm

Every case of suppressed minorities with their own language when usage doesn’t match the official administrative boundaries.

You are making lots of assumptions about the environment, for example assuming a spatial database with essentially all relevant boundary geometries. Yes, if you have that, then retrieving the information is straight forward (though you will still need to determine via a spatial query which boundary polygons the object intersects with/is covered by).

aseigo · November 13, 2024, 12:12pm

Can you point to a region where official signage and names for official use are in the language of the suppressed minority, but the official mandate is to use something else, and this does not map to administrative boundaries?

Note, I’m not arguing that suppressed minorities are not an issue. I actually included one such example in the draft.

And we do have ways of representing this information in the OSM dataset, via name:<code>.

What we ought to be looking for is a map that reflects practice, not alternative political concepts. That is an unresolvable can of worms. So I’m wondering where such a situation exists that would cause incorrect rendering.

If we can identify such concrete use cases, perhaps there is a way to model that as well that is suitable. Without concrete use cases, we’re chasing theoretical anima.

Not at all! The “failure case” of this proposal is the current status quo. That is not accidental, but purposeful, because we can’t assume perfect data coverage.

The name tag is always there as a fallback. That also happens to be the current status quo. Which means that in the worst case scenario where necessary data is missing, be that administrative boundaries or languages tags on them, it devolves into the current situation.

But where that data exists, the renderers can be improved. And data, of course, can be added where missing.

And despite probably not covering some % of the dataset, it does unarguably address the examples provided. “Some admin boundaries might be wrong” is letting the perfect ruin the good.

That said, if you have a way to improve the proposal which also improves the status quo even when the OSM dataset or specific extracts are incomplete, I’m all ears/eyes.

SimonPoole · November 13, 2024, 1:05pm

The issue is not that I wouldn’t like better ways to support multi-lingual name rendering, it is just that you are banking on a lot of currently non-existing specific support for the scheme so that it is actually useful. And if the defaults scheme is anything to go by that is unlikely to be forthcoming just like that (though the pain points might be somewhat different in this case).

Your case would be far better if you could demonstrate, at least prototype level, support for osm2pgsql and one of the vector tile generators.

lonvia · November 13, 2024, 1:15pm

How does this proposal differ from the already established default_language tag?

aseigo · November 13, 2024, 1:15pm

This is indeed true, as noted in the Open Questions section of the proposal.

I feel it is a surmountable obstacle, though.

Agreed, it needs someone(s) to put in some effort on the implementation side.

Where I am slightly more optimistic than with the defaults schema proposal is that this is more limited in scope and has a clearer direct benefit in terms of the presentation of rather high-value data (namely, names )

But I completely agree with you that this is not the sort of thing that could just be thrown out into the void and Hopefully Maybe the legwork will magically be done.

This is indeed a good next step.

Do you have a recommendation on which vector tile generator would have the most impact in terms of demonstrating the possibilities to the broadest audience?

To be honest, I do not wish to spend time and energy developing something nobody would accept in the first place, so there is value in early rounds of peer review … after all, if the concept is somehow actually unworkable, I can always spend my time drawing building outlines somewhere instead

alan_gr · November 13, 2024, 3:51pm

I didn’t fully understand the Biel/Bienne example. I can see how you would end up with the same rendering as now for streets and so on within the city. But what about the city of Biel/Bienne itself? Wouldn’t that change to being labelled as Bienne / Biel? Is that intended?

I think there are quite a few similar situations where the name conventionally used for a region or city is a kind of compromise that doesn’t exactly match objects inside the city:

The town of Dingle / Daingean Uí Chúis in Ireland is within a “Gaeltacht” (Irish Gaelic speaking area). So most streets have name= set to the Irish name. That would work OK if the Gaeltacht could be defined as an admin boundary with the only preferred language set to Irish Gaelic (how easily that can be done is a separate question, but let’s leave that side for now). But the town itself is an exception, using a compound of name:en and name:ga.

In fact both the country Ireland and the state Northern Ireland are somewhat similar - the compound names or labels currently in the “name” field don’t match names contained within those entities.

Vitoria-Gasteiz in Spain is another example. Streets in the city have name=(Basque name)/(Spanish name), which could be handled by setting the preferred language on the city boundary (or a larger admin area containing it). But then how would you display the conventional Vitoria-Gasteiz label?

aseigo · November 13, 2024, 4:13pm

…

Great examples, and indeed demonstrates a flaw in the current proposal. (Thank you for that!)

The problem you note stems from the administrative boundary also being used for the settlement itself.

In a discussion with a co-worker and fellow OSM contributor, they raised a similar question about the potential for odd features that should just default to the name tag. This led to the “Special case: defined on a feature” section in the proposal.

However, you are absolutely right that administrative areas which do not follow the same preferences as the contents are not resolved with that.

One possibility would be to treat boundary=administrative as a special case, and always use the content in name … but that is rather unsatisfactory in that it would keep all the flaw currently in the dataset for administrative boundaries (inconsistent separators, etc…)

Another possibility would be a third optional tag named languages:self (or similar) which would be used for the administrative boundary itself. I’m not entirely sure I love that either, but it would work …

Another possibility is a type=languages feature, separating language from admin boundaries entirely, but that seems even less desirable as it means add a lot of these, usually they will share all the nodes of an admin boundary except in these specific cases … and then there would need to be an ad-hoc type=languages feature just “inside” the admin boundary itself. Yuck.

I’ll keep thinking on this, but if you have any suggestions please fire away. Currently my favoured answer would be languages:self.

That one works because the town with the English/Gaelic name form is a node within the administrative district. It would fall into the “Special case: defined on a feature” section of the proposal. I will add that there in a future revision.

aseigo · November 13, 2024, 4:29pm

Good question!

default_language does not cover:

general multi-lingual cases well / at all, which is precisely what this proposal is focused on resolving. Monolingual cases are a bonus, but not really a concern as name already services those regions just fine.
situations where the preferred language ought to fall back to another official language if name:<preferred language> is missing (see the Haida Gwaii example in the proposal)
it does not help editing software nearly as much, due to the “one language” concept

default_language could perhaps be retrofit to behave similar to the proposal, assuming that changing the content to a semi-colon separate list (or equivalent) would not break current uses of it.

TBH, I’m not sure what the OSM policy is on changing the format of a value?

It would still benefit from a preferred-vs-default tag, so there would be at least one new tag still. I did consider just encoding the whole preferred/default in a single key/value pair, but the results were anything but pretty, and how inheritance would work was even harder to describe.

In any case, there are 708 uses of default_language in the database, so it’s not a particularly large set of tags … an automated update of them could be done, assuming there is any useful outcome from this proposal

okainov · November 13, 2024, 4:50pm

But that’s exactly where this would cause issues. I don’t think order should matter at all tbh, this just adds extra complexity.

okainov · November 13, 2024, 4:51pm

Please add the link to this thread instead. Noone uses wiki to discuss stuff nowadays imo…

okainov · November 13, 2024, 4:51pm

Copying my post from the linked thread as it’s more relevant here:

Looked at the recent proposal, I got the problem statement… But I didn’t get how is the proposal going to improve the situation with name alone.
If the feature has all kinds of name:lang tags and has no “name” - all good. But when it has name? Do we expect it to exist in general? Contain multiple languages? This is not clear from the proposal text…