It's all in the details

So here I was looking at what values were being in use for the key cemetery.

And I was wondering why there were two lines for ‘cillin’. It actually took me quite a while before I saw the difference between ‘cillin’ and ‘cillín’.

Just in case you don't see the difference,

it is all in the second i: i versus í

With cillín being the version documented on the wiki. :slightly_smiling_face:

2 Likes

Likely just a typo or the person’s device couldn’t handle the fada (accent). I see that you have fixed them. :slight_smile:

For those wondering what a cillín is: Cillín - Wikipedia

Well, errors like these are what you get when you ignore the convention to only use ASCII characters for keys and classification values. :man_shrugging:

10 Likes

In TagInfo, you can see the exact unicode name the characters:

=cillín

vs =cillin

While a i vs í is visible, homoglyphs (letters that look the same, but are different) are common between Latin & Cyrllic. The TagInfo feature was added after trying to differentiate between addr:city=Русе & addr:city=Pyce :wink:

9 Likes

Who else is glad that the convention to tag in British English has avoided ı versus i versus İ incidents? :man_raising_hand:

1 Like

i’d say it’s more likely they don’t know how to input accents on their device, not that the device can’t handle them. It’s a common enough character to be supported on pretty much all devices in 2026.

saying that, there are lots of fonts out there that dont handle ŵ and ŷ

1 Like

There’s currently only 200 of these in OSM. Wouldn’t the prudent curse of action to correct the obvious derivation from our naming conventions now and go to cemetery=cillin?

Tagging @b-unicycling who likely mapped most of them.

6 Likes

Is it fair to say that primary feature keys and iterative refinement of them should stick to ASCII as much as possible? We have plenty of non-ASCII in secondary keys. It’s unavoidable in freeform text keys of course.

Among keyword-based secondary keys, non-ASCII values are somewhat common in cuisine=*. English has more of a tendency to preserve diacritics when borrowing foreign terms describing foods. The most commonly tagged one is probably açaí. (There’s that acute on the I again.) There’s a long tail of other values that lack common English translations or anglicizations, often from Turkish, Hungarian, and Vietnamese cuisine: çiğköfte, kürtőskalács, bánh_xèo, etc. I’m unaware of a major problem so far. These values are still rare enough that data consumers don’t need to special case them. For humans, these alphabets are so laden with diacritics that you can tell at a glance that it isn’t ASCII.

Could we strip out the diacritics anyways, for simplicity’s sake? Yes, but not without sacrificing some clarity. Diacritics easily form minimal pairs in some alphabets. The other day, I spotted a new Vietnamese sandwich shop. Inside, the décor shows you how it’s made:

It’s a new twist on an old classic. The peanuts and shallots are unusual but lend extra depth to the flavor profile. Maybe you’re wondering what kind of “meat” it has. A traditional bánh mì might have cold cuts, thịt nguội. This one instead has thịt người – human flesh.

4 Likes

Yeah, similar happens in Irish: Irish speakers have long pushed for the use of the fada to be guaranteed - it may happen soon

“The síneadh fada/diacritic mark is no trivial matter in linguistic terms. This small mark can be the difference, for instance, between a slice of delicious cake, cáca, or what you have to pick up after your dog does his business, caca.”

To avoid any confusion, in school we were taught “cáca milis” - sweet cake.

4 Likes

I don’t really see a problem with tagging specialized local features using the proper non ASCII characters. The people who add/edit them are usually going to be locals who know the proper spelling and are able to use the correct characters, or people with special interests in these features who can also be expected to care about the proper spelling. Anyone else who might not be as familiar with the type of feature is going to have to look it up in the wiki anyway and might as well copy paste the tag value from there.

Forced transliteration of accented characters into similar looking ASCII characters is not a good idea. I’m sure there are plenty of cases where that leads to the original meaning getting lost or transformed and it’s also a little disrespectful of the local culture.

If only ASCII characters have to be used (for the main tags), the best way to do that is to describe the feature in more generic terms in plain english, e.g. cemetery=unbaptised_children. This is also a good idea if the type of feature is more widespread and also occurs in other places, but under different names. If there are particular local specializations that aren’t covered by the more generic description, the specific local name can be added as a sub-tag (*=cillín)

2 Likes

Any here we are, talking about the problems.

This is working under the assumption that keys and values in OSM have some literal meaning. They are not. Or if they have then only literal OSMenglish. They have the meaning we give them. In the Wiki usually. highway=unclassified can be used on streets with an official classification. power=plant has nothing to do with photosynthesis, even though it confuses the hell out of me, every time I see the tag. And cuisine=thit_nguoi is unlikely to be misinterpreted either, as thịt người is better tagged with cuisine=cannibalic in OSMenglish.

We are an international project. Using diacritics and non-latin scripts in tags that are not of a name type is asking for trouble.

4 Likes

As far as I know, the only non-ASCII values that have been documented for cuisine=* so far are açaí and kürtőskalács. I’m not surprised about açaí, since açaí bowl vendors have always been very consistent about the spelling. No technical concerns came up when it was proposed for inclusion in id-tagging-schema, only that it was still fairly rare at the time. If someone had instead proposed cemetery=cillín, who knows, maybe someone would’ve said something, because the lone diacritic is easier to overlook and the cemetery=* tagging scheme is much more boring.

To me, the rest of the non-ASCII cuisine=* values feel like the mapper needed to capture some detail and wanted the community to sort it out later if they really need ASCII values. I appreciate that a mapper can any-tag-you-like a niche fact about the world. The mapper needs to unambiguously communicate their intent to the global anglophone community in order for a proper OSM English translation to arise later with any confidence. In the meantime, it isn’t a major problem if data consumers refuse to support the tag or if some unnormalized tags float around. The tagging scheme has enough entropy as it is. I suppose an alternative would be for the mapper to tag cuisine:wikidata=*, similar to when someone encounters a rare denomination:wikidata=*. No one expects a data consumer to process rare denomination QIDs either.

The bigger issue for cuisine=* is still that it’s filled with every kind of specialty, from well-recognized regional/ethnic cuisines to individual dishes. It’s the latter that tends to produce these unadapted loanwords, because it isn’t really a classification scheme. Until we settle on a solution to that problem, good luck distinguishing bánh bò and bánh bó: both are plausible as a shop’s specialty, and neither have adapted words in British English, let alone OSM English. If we need to prioritize internationalization by avoiding diacritics, then Q5004795 and Q5004796 are at least as useful as cow_cake_or_crawling_cake[1] and packed_cake[2], respectively.


  1. Bánh bò could literally mean either and there’s little agreement on the real etymology. ↩︎

  2. English doesn’t have a word for bánh, but “cake” is one common translation. ↩︎

2 Likes

It’s worth noting that the approach introduced in Carto 6.0.0 for only rendering shop and office values above usage threshold also rejects non-latin characters (alphanumeric + “_;-”). This is primarily a cheap and cheerful approach to avoiding SQL injection problems, but is based on the principle that tag values should be OSMenglish. So shop=açaí would not be rendered even if above the usage threshold (25).

1 Like

Even though cuisine=açaí got tabled in the schema itself, several name-suggestion-index entries have been using it for several years. On the bright side, NSI can help to shepherd the database toward a reformed tag if necessary.

Ah, crud, that also rules out the Usenet-safe alternative of cuisine=ba'nh_bo` versus cuisine=ba'nh_bo'. I promise I wasn’t trying to hack into any databases using delicious delicacies. Fortunately, no one expects OSM Carto to special-case a single culture’s bakeries and street food stands anyways. Certainly not while they’re distracted by bigger issues like name formatting. (Edit: Wrong link, sorry.)

At least one software stack is capable of handling 8-bit keywords. In Doña Ana County, New Mexico, OpenStreetMap Americana marks many desert roads with route shields based on network=US:NM:Doña_Ana, one of many keywords that the community approved back in 2022. Even ASCII’s homeland dabbles in diacritics once in a while.

More recently, OpenHistoricalMap has begun migrating from network=* to network:wikidata=* in part to better accommodate places where country codes and rigid hierarchical schemes poorly match historical reality. There have been some murmurs in OSMUS Slack about making a similar transition. This general approach could advance the language agnosticism that some here are advocating for without risking dataloss.

Anyways, all these tags are quite tame compared to pyörä_väistää_aina_autoa=*, which even eschews the standard Boolean yes in favor of jep:

Because this is something very local, we used plain text values in local language.

Without any expectation of data consumer support, there’s been little fuss about this well-used key over the past 13 years, apart from an amusing suggestion to transliterate this Finnish phrase as if it were German for the benefit of software systems. After all, what else is OSM English if not British English with Central European compatibility fixes? :smiley:

1 Like

That makes no sense and is absolutely the wrong way to avoid SQL injection.

4 Likes

I’m not gonna open that can of worms of the Brits oppressing the Irish by forcing their standards onto them, again. It is a regional feature that, in my opinion, deserves to have its regional spelling preserved.

4 Likes

if cillíní are only found in Ireland, then I believe the correct Irish spelling should be kept. Of course there’d “only be 200” if they’re found in one place.

1 Like

“keys” I definitely understand, but “classification values” is a bit of a new one on me?

OSM data doesn’t have a defined schema. I’d argue that that’s why it succeeded when other projects from around the same time didn’t. Duck tagging and using “any tags you like” are both very much alive and well.

People will submit slightly different spellings of the same thing and it’s perfectly OK (as happened here) to combine these into the most popular one (the meaning is exactly the same).

People (a bit like the Lizardman’s Constant) will sometimes submit utter rubbish (just this morning I reverted a change that had set a surface value to “in sometown somedistrict somelargerdistrict somecountry”. I don’t believe that implausible value was a bad-faith edit; someone just did not understand OSM.

The upshot of all this is that basically anything can be valid in a value, but that it also makes sense to try and standardise.

Personally, I’d probably have tried to internationalise that, because the concept of “describing how and when cyclists must give way to cars” is something that happens outside Finland too. However, someone contributed the key and it is now widely used there, and as it takes multiple values not just yes or no avoiding a simple yes actually makes some sense.

2 Likes

According to wikipedia there are a few missing.

As of 2021, there were 1,693 recorded cillíní on the island of Ireland.

3 Likes

I take that to mean a tagging scheme that relies on a fixed set of keywords as values. (Data items call these “well-known values” as opposed to “infinite values”.) Examples include highway=* service=* and shop=*. Some keys like cuisine=* are slightly more flexible, allowing for what the wiki often calls “user-defined values” at the end of a long table of values. For these keys, a full proposal vote would be overkill even for the most ardent supporters of the proposal process. I suspect the guidance about ASCII is shorthand for something more nuanced about what we consider to be a “good” user-defined value, not to be taken so literally.

To be clear, I’m not against using transliterations and translations in keywords if it avoids miscommunication. We have a number of payment:*=* subkeys that would be difficult to use otherwise. However, one should always use a transliteration scheme appropriate to the language. It’s also worth considering if transliteration or translation would be a good tradeoff: incremental convenience for data consumers at the expense of intuitiveness and expressability for mappers.

This discussion could’ve been about the fact that some words have multiple conventional spellings that look alike. This happens even in the most traditional British English and the most traditional OSM English. Consolidating these spellings is a no-brainer. The wiki and editor presets can be a big help with that.