Hebrew with Niqqud (diacritics) - עברית עם ניקוד

עברית

זו הצעה פורמלית לאופן שבו יש להשתמש בעברית עם ניקוד ב OSM. ההצעה מבוססת על דיון בערוץ הטלגרם OSM Israel שהחל עם הודעה זו.

הרקע:
תוספת של ניקוד בעברית יכול להבהיר מהו ההיגוי הנכון. מצד שני הניקוד יכול לגרום לבעיות בחיפוש של שמות, בתצוגה של שמות ובאיחוד של אלנמטי OSM על פי שם, כפי שמקובל גם בשמות של רחובות.

ההצעה
מוצע להשתמש בסיומת השפה he-Niqqud עבור עברית עם ניקוד, בדומה לסיומות השפה zh-Hans, zh-Hant, ו- ko-Latn המופיעות בדף הוויקי name tag localization.

הפרדה כזו תתרום גם לצרכני המידע של OSM. צרכני מידע המעוניינים בניקוד ישתמשו בסיומת השפה he-Niqqud בעדיפות ראשונה וב he בעדיפות שניה. צרכני מידע שלא רוצים או לא יכולים לתמוך בניקוד, ימשכו להשתמש בסיומת השפה he.

עוד מוצע שאין להשתמש בעברית עם ניקוד בתגיות עם סיומת he או בתגיות ללא סיומת שפה. הדבר יבטיח שחשיפה לניקוד עברי תתבצע רק עבור אותם צרכני מידע שיכולים ורוצים בכך.

English

Hebrew with Niqqud (diacritics) is the way to indicate vowels, which are omitted in modern Hebrew writing, using a set of ancillary glyphs.

The context:
Hebrew with Niqqud is useful for clarifying the pronunciation of names. On the other hand Niqqud could cause issues with search by name, in presentation of names and in name-based grouping of OSM elements, as used for street names.

This is a formal proposal for the way Hebrew with Niqqud should be used in OSM. The proposal is based on a discussion held in the OSM Israel Telegram channel. The discussion started with this message.

The proposal
The proposal is to use a separate new language suffix he-Niqqud for Hebrew with Niqqud, similar to the zh-Hans, zh-Hant, and ko-Latn language suffixes listed in the OSM wiki page on name tag localization.

Such separation would benefit consumers of OSM data. Data consumers that want Niqqud would use the he-Niqqud as their first priority language suffix and he as the second priority. Data consumers that cannot or do not want Niqqud will continue to use the he language suffix.

It is also proposed that Hebrew with Niqqud should not be used with the :he tag suffix or when no language suffix is used. This will ensure that Hebrew with Niqqud is exposed only to data consumers that can and want to use it.

4 Likes

Great idea.

I suggest using

he-diacritics
or
he-nikud

I wouldn’t realize what niqqud means, even though I speak Hebrew.

Danny

The language tags in OSM follow the BCP 47 standard. That in turn refers to ISO 15942 for script tags such as “Hant” or “Latn”; that standard doesn’t appear to distinguish Hebrew with and without Niqqud (they’re both “Hebr”).

In order to follow BCP-47, the tag used for “Hebrew with Niqqud” would need the private-use marker: something like he-x-Niqqud or he-Hebr-x-Niqqud.

3 Likes

Practically speaking, any search engine these days needs to know how to perform case and diacritic folding. Even your browser’s find in page function can do that.

That said, if Hebrew is rarely written with Niqqud in practice and these marks are only used as a pronunciation guide, then a dedicated key for pronunciation guides would be appropriate. name:he-x-menuqad=* would be a more standards-compliant form for ktiv menuqad. (Hebr is presumed for Hebrew, and extension subtags must be lowercase up to eight characters long.)

If that’s too complicated, then there’s always alt_name:he=*.

(There is also name:pronunciation=*. I tried to keep it focused on IPA, but inevitably it has come to be a mix of other systems including ad hoc pronunciation respellings. In retrospect, IPA should’ve always been in name:*-fonipa=*. We’re starting to move in that direction. Keys that are explicit about their contents can be very helpful.)

1 Like

I support this proposal as long as no Niqqud is introduced into existing tags. I think the proposed suffix separation (he, he-Niqqud) is a clean solution.

Even if all search engines support this properly (which I doubt) introducing Niqqud into the existing tags has too many drawbacks:

  • Makes poweruser-oriented text queries significantly harder (e.g. Overpass queries, regexes, etc)
  • Introduces Niqqud into apps which prefer not to use niqqud.
  • Causes some data loss; It is not sufficient to strip the niqqud in order to obtain the Niqqud-free version of a word, since some words are written slightly differently without niqqud.
  • Causes inconsistencies in name and name:he, where some users would add Niqqud and others wouldn’t.
  • Introduces those same inconsistencies into tags that point to the name tags, such as address tags.
1 Like

I agree this should be a separate tag and no nikud should be used in the default name or name:he tags so that thing like this don’t happen:

(This was already moved to a different tag so now renders normally)

OK, as long as it isn’t he-Niqqud specifically. If you want to be able to type a short he-niqqud, you can propose it for inclusion in the IANA Language Subtag Registry. Until then, you’ll need to use he-x-niqqud. Other language communities have seen success using this approach on a temporary basis.

2 Likes

My 2 cents:

Yes, adding it in a separate tag sounds good. name:he-x-niqqud is as good as any.

Worth noting but not relevant for this discussion: most renderers are currently really bad at rendering niqqud. This is a separate issue that must be fixed by each of them, and it’s not going to be easy. Some renderers don’t even handle RTL correctly, and this is an extra complication on top of that.

Bikeshedding the spelling of 'niqqud'

Personally I think “nikud” or “nikkud” feels better than “niqqud”. However, Wikipedians have already brought this up and seemed to have settled on “niqqud” one way or another. I’m happy to just go with that.


For the handful of cases where niqqud is necessary to clarify pronunciation, such as “נחל הרועה” (English Wikipedia, OSM sample), I would even go as far as suggesting that the name and name:he tags should include minimal niqqud, like “נחל הרועָה” – just a kamats to differentiate the female form from the male form “נחל הרועֶה”. But then, this might necessitates a separate tag which is guaranteed to never have niqqud. For example:

name=נחל הרועָה
name:he=נחל הרועָה
name:he-x-niqqud=נַחַל הָרוֹעָה
name:he-x-no-niqqud=נחל הרועה

Which is definitely over-complicating it and a bad idea. So I’m interested in others’ opinions on this edge case.

IMO, allowing any use of Niqqud in name or name:he nullifies the value of this proposal.

2 Likes

There are too many drawbacks for adding niqqud to the existing tags, many of which were mentioned in previous posts. That includes breaking renderers.

I think your proposal is more sound and backward-compatible if you flip the roles:

  • name and name:he must have no niqqud, maintaining the status quo.
  • name:he-x-niqqud as proposed by @zstadler.
  • name:he-x-minimal-niqqud includes only the minimal necessary niqqud and used only when the niqqud-free version of the word has vague pronunciation.
2 Likes

Thank you for investigating the Wikipedians’ decision! I share your personal taste on this, but since it’s an insignificant matter I think we should follow Wikipedia’s decision solely for the sake of consistency.

1 Like

aggree with:

  • keeping name and name:he as-is and diacritics free
  • using :he-x-niqqud for diacritics
1 Like

Yes, that sounds good. I don’t see anyone objecting, right? The only minor point of contention is the exact tag:

  • name:he-x-niqqud
  • name:he-x-menuqad
  • name:he-x-Niqqud (start with capital?)
  • name:he-x-nikud (alt spelling)
  • etc…

We could make a poll for that, or just go with name:he-x-niqqud unless anyone has a real problem with it? (@Dindia, I think you’d probably figure it out, especially based on the contents of the tag.)

As for name:he-x-minimal-niqqud, there are very few cases where it’s needed but I think it can be invaluable in those cases. So IMO it should also be documented, only to be used when it’s actually useful. Renderers that can handle niqqud may choose to use it instead of name:he when it exists.

Hi NeatNit.

I totally support your initiative, but I have a real problem with using the letter q where it’s not supposed to be. This includes Qatzrin, Qiryat Yovel as well as niqqud.
The common spelling by far is nikkud and that what in my opiniton should be in OSM as well.

Danny

I think the closer we are to something that can be incorporated into the standard in the future by removing the x- subtag, the better. The standard can be found here: https://www.rfc-editor.org/rfc/rfc5646.txt

I’ve skimmed the standards and here are some findings.

The standard appears to encourage Title case for the “script subtag”. So Nikud is better than nikud.

Furthermore, it appears a “script subtag” MUST be exactly 4 letters (Hence zh-Hans, zh-Hant, and ko-Latn). Here is the list of currently standardized script subtags: ISO 15924 Alphabetical Code List

I’m also wondering if what we’re doing here qualifies as a “Script” or something more refined (Variant subtag? Extension subtag?). Wikipedia references this Unicode chapter about Middle Eastern languages https://www.unicode.org/versions/Unicode15.0.0/ch09.pdf , it treats both Hebrew Nikud and Arabic Tashkil as “vocalizations”. So, if I read this pedantically, Nikud is just an optional “vocalization” on top of the Hebrew script and not a separate script. In contrast, Japanese Hiragana uses totally different letters than Katakana, and therefore they qualify as separate scripts (Subtags Hira, Kana).

But then again, there is a subtag for Hiragana-Katakana combos (Hrkt) so it doesn’t appear there is a 1-to-1 correspondence between letter systems and script subtags, and maybe it’s completely fine to have Nikud in a separate script tag (Hebrew letters + vocalizations).

I think we’d need to consult someone highly familiar with the standards if we want this done in the future. In any case, we can choose whatever we’d like as long as the x- is there in the meantime!

1 Like

Update: BCP 47 explicitly says orthography may be registered as “Variant subtags”.

Line 2594 in https://www.rfc-editor.org/rfc/rfc5646.txt :

Dialect or other divisions or variations within a language, its
orthography, writing system, regional or historical usage,
transliteration or other transformation, or distinguishing
variation MAY be registered as variant subtags.

I think Nikud qualifies here. If this is correct, Then the rules are 5 to 8 characters. Line 252:

variant = 5*8alphanum

Here are some random examples (extracted from the subtag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry )

basiceng - Basic English
pahawh3 - Pahawh Hmong Third Stage Reduced orthography
oxendict - Oxford English Dictionary spelling
moderat - The moderate (conservative, i.e. Danish-like) spelling
tongyong - Tongyong Pinyin romanization
bohairic - Bohairic dialect of Coptic
vecdruka - Latvian orthography used before 1920s ("vecā druka")

So, something in that spirit would be:

  • he - Hebrew language, hebrew script, unspecified variant (already in standard).
  • he-nikud - Variant is explicitly with nikud
  • he-nonikud - Variant is explicitly no nikud
  • he-minnikud - Variant is explicitly the minimal necessary nikud for correct vocalization, and sticks to the 8 character practical limit.

Specifically for OSM: we would not use he-nonikud at all, and we would strictly prohibit nikud in he. This would be completely compatible with the underlying standard.

As for the Nikud vs Niqqud vs Nikkud debate, I would go with nikud simply because minnikud is within 8 characters. (I also personally like this since it’s written just like it is vocalized).

Update: Specified that the formal maximum is 8 characters.

1 Like

I have come to the same conclusions :)

So to sum up, the OSM tags to be used immediately would be:

name:he and name with no nikud, ever;
name:he-x-nikud with full nikud;
name:he-x-minnikud with only partial/minimal nikud when name:he is ambiguous

Most renderers will by default use the no-nikud variant. Interested renderers may choose to display the other two options when they are available.

(Edit: of course, these can be used wherever language suffixes are used, not just name.)

Give me a :-1: if you have any objection to this proposal.

Worth noting that minniqud is also within the 8 character limit but no one so far has rooted for x-niqud, so right now x-nikud is the top candidate unless objections are raised.

In any case, if we standardize this in BCP 47 in the future, I think we should get some Hebrew expert to at least have a glance, as the choice would live in computer systems to the end of times. If niqud or something else is deemed better, we could transition to that the moment we register formally and remove the x-.

1 Like

Let’s try not to over-engineer things.

I find it hard to see mappers entering he-x-minnikud/he-x-minniqqud in addition to or instead of he-x-nikud/he-x-niqqud. I find it even less likely that data consumers will use he-x-minnikud/he-x-minniqqud .

I feel I have to repeat myself:

I believe @SafwatHalaby got a bit sidetracked and was talking about a potential future addition to the upstream language code spec. The OSM tags would only be as I said, which is almost identical to your initial proposal.

1 Like