Arabic diacritics and nominatim search results

SafwatHalaby · June 15, 2017, 7:04am

Arabic words can be written in full form, e.g. چِلبُوَع, or in a form without diacritics, e.g. چلبوع. The latter form is by far dominant in the digital world, because typing and processing it is much easier.

In OSM, both forms are used, and it appears that Nominatim has no awareness whatsoever of this, so tags written in one form are not found in searches when the other form is queried, and I think this is a serious issue.

Has this been discussed before?

I am willing to try to create a bot that strips all the diacritics globally, if that’s deemed to be the proper solution. But perhaps a more ideal solution would be some sort of internal database normalization.

hadw · June 15, 2017, 8:46am

I presume this also applies to the Hebrew script.

I’m not sure that this is really the best place to target the Nominatim developers, but my guess is there are probably few, if any, of them with any knowledge of Arabic scripts. My impression, even for the map itself, is that mapping in many Arabic script using countries is done by European expatriates, and one of my general concerns is the overuse of the Latin alphabet, and even partially anglicised names, where those are not the primary scripts and languages used locally. (In some of the larger countries, mapping seems to be done by an English speaking intelligentsia.)

I’d therefore suggest that the first step is to get more Arabic/Urdu/Farsi/Pushtu users involved in OSM. From that pool, you should find a few interested in working on the Nominatim code.

Failing that, what is likely to be useful is references to standard texts (in English) and open source software, relating to the algorithms used for comparing machine readable Arabic script texts.

(Some examples of where you would expect to see almost entirely Arabic script, or at least dual Arabic and Latin, but don’t, are https://www.openstreetmap.org/search?query=pakistan#map=10/27.5679/68.3020 and https://www.openstreetmap.org/search?query=dubai#map=14/25.1774/55.2641 Iran seems to be the main exception, although Iraq also seems to be mainly Arabic, but with some exceptions.)

(As additional background on the issue here. Scripts for Semitic languages, in everyday use, don’t encode short vowels (a bit like English shorthand, for those old enough to remember, drops all the vowels). The consonants convey enough information for people to recognize the words, even though not enough to pronounce the word phoneme by phoneme. In more formal contexts, the vowels are fully encoded. I believe that is always the case for the Quran.

In Unicode, the vowel marks are coded as separate characters, and the font engine is expected to combine them with the base character.)

SafwatHalaby · June 15, 2017, 8:57am

That is technically correct, but no one seems to be using Hebrew diacritics in OSM.

Additional background info: Arabic/Hebrew readers can read fine without diacritics and can intuitively deduce them on the fly based on previous knowledge of the words, and they are almost always absent in printed text or digital text. Diacritics are useful for people learning the language. Diacritics are also often present in works of literature, (esp. where the rhyming matters, e.g. poems), religious texts, highly formal contexts, and whenever there’s a chance of mispronunciation, (e.g. names of places, transliterations).

SomeoneElse · June 15, 2017, 9:11am

I’d ask the question over at Nominatim on Github: https://github.com/openstreetmap/Nominatim - that’s where the Nominatim developers are most likely to see it.

SafwatHalaby · June 15, 2017, 9:13am

If UTF encoding is as you describe, then the algorithm is absolutely trivial:


bool compare_arabic_or_hebrew(str1, str2)
{
      return str1.strip_all_diacritic_chars() == str2.strip_all_diacritic_chars()
}

Edit: This works for Arabic, but not necessarily for Hebrew.

SafwatHalaby · June 15, 2017, 9:16am

Will do. I posted here in an attempt to know if there were any previous discussions within the community, since I couldn’t find any.

hadw · June 15, 2017, 10:28am

Google returns nothing for strip_all_diacritic_chars, and the Unicode description (The Unicode Standard Version 4.0 (ISBN 0-321-18578-1)) of the Arabic characters doesn’t seem to classify some as diacriticals, so for someone not familiar with Arabic you need to define this function in more detail. If you interpret diacritical as anything in which the sample glyph appears next to a dotted circle, in the Unicode specification, there seem to be several that are normally included in text.

Also, as specified, it will strip “diacriticals” in other scripts. That may or may not be desirable.

More subtle. Arabic script is extensively used for languages that are phonetically very different from Arabic, and where vowels are much more important to the meaning. I have a suspicion that diacriticals are missed in its everyday use in those languages, but the validity of deliberately stripping them in say Farsi, Urdu or Bahasa Malaysia might be rather different, so needs to be considered. (According to Writing Systems of the World (ISBN 0-8048 1293-4), vowels are rarely marked in Urdu.)

From an engineering point of view, there is presumably an index that needs adding or rebuilding, although that can be handled, given time.

hadw · June 15, 2017, 11:10am

The Unicode database does define diacritics: ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt

Is this the exact set of characters you would want ignored in all languages using Arabic script?

064B…0652 ; Diacritic # Mn [8] ARABIC FATHATAN…ARABIC SUKUN
0657…0658 ; Diacritic # Mn [2] ARABIC INVERTED DAMMA…ARABIC MARK NOON GHUNNA
06DF…06E0 ; Diacritic # Mn [2] ARABIC SMALL HIGH ROUNDED ZERO…ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
06E5…06E6 ; Diacritic # Lm [2] ARABIC SMALL WAW…ARABIC SMALL YEH
06EA…06EC ; Diacritic # Mn [3] ARABIC EMPTY CENTRE LOW STOP…ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE

This does include vowels, but doesn’t include all the “dotted circle” characters.

wowik · June 15, 2017, 11:37am

Russian letters Ёё could be written as Еe so the same problems should occur in word search

Ё - ‘CYRILLIC CAPITAL LETTER IO’ (U+0401) Е -‘CYRILLIC CAPITAL LETTER IE’ (U+0415)
ё - ‘CYRILLIC SMALL LETTER IO’ (U+0451) e - ‘CYRILLIC SMALL LETTER IE’ (U+0435)

hadw · June 15, 2017, 12:07pm

The Russian case is technically rather different, in that you are asking for a character substitution to be made before indexing and comparison. In the Arabic (Hebrew) cases, the request is to have the (vowel etc.) character deleted completely, before comparison.

For the proposed method to work here, the Ё would need to be encoded as Ё U+0415 U+0308, which may or may not display as well. (ё and ё display more obviously alike on my browser).

The main standard does define this equivalence, so you could pre-process it mechanically into the explicit diacritic based on rules parsed from the specification.

SafwatHalaby · June 15, 2017, 12:38pm

I would say it’s almost the same. Strip some chars before indexing or comparison in Nominatim. The only difference is stripping rather than swapping, which is a programmatically negligible difference.

The Hebrew case is more complicated, stripping some diacritics requires adding letters. But Hebrew can be ignored. No one is using diacritics for Hebrew.

I think I can easily provide a list of all unicode characters to be stripped. Visually, they are all detached lines or circles above or below characters. I can later look at the list you provided to tell you if that’s all there is. This function sees if a char is in the list, and if so removes it from the string.

I don’t know how these writing systems work and whether or not they use the same charset. Worst case scenario: Strip exclusively for name:ar, guaranteed to be Arabic.