How to distinguish Nominatim failing in general and Nominatim failing to process queries with Japanese/Chinese characters?

Sometimes public Nominatim server responds with 503 error because it is overloaded by bad scrappers.

But some searches fail for some reason (I noticed it on some with Japanese and Chinese characters) also with 503 error.

For example CN, 210005, 南京, 江苏省南京市秦淮区汉中路1号南京国际金融中心1&2楼 fails with 503 error while other are working fine.

Is it possible to distinguish those? To stop failing both with timeout error ( Configuration Settings - Nominatim Manual )

Or maybe if search fails with 503 error I should try other known to be not broken and check is Nominatim down altogether?
Or avoid any searches with Japanese/Chinese letters? (are other alphabets also affected?)

I assume, you need to split your search term for Nominatim, otherwise it’s too complicated and it replies directly with a 503.

province: 江苏省
city: 南京市
district : 秦淮区
street: 汉中路
house number: 1号
name: 南京国际金融中心
floor: 1&2楼

南京国际金融中心,1&2楼,1号,汉中路,秦淮区,南京市,江苏省,210005,CN works for Nominatim

Edit: If you leave out floor and house number, Nominatim even finds it. :wink:

well, in my case sadly I have not enough info to decompose it automatically or it also fails in decomposed forms for some other queries

while some work without decomposition

and some work, but not right now (and it would be nice to distinguish it from cases that are not worth retrying)

A 503 error has rarely to do with servers being overloaded. You usually get a 504 when that happens.

Nginx sends back a 503 when you are temporarily blocked for going over the rate limit of 1 req/s. And the Nominatim software itself sends a 503 back when a SQL request to the database takes to long. You can distinguish those two by looking at the body. nginx usually sends some elaborate HTML message. From the internal process you get a simple text back with the message ‘Query took too long to process.’ (unless you requested debug output in which case the debug output is still available in the body).

I agree that 503 might not be the best error code for the latter case because it is permanent more often than not. (It might sometimes happen because another query has put a lock on the database but that is really rare.) So what is a better error code? A simple 500? Or maybe 507 - Insufficient Storage ?

As for Chinese script, that is a known issue with logographic scripts without word spaces. Commas will certainly help. Also, don’t even try addresses that include unit or floor numbers. Nominatim cannot resolve those. Once you remove the 1&2楼 part of the query, you get a nothing-found response pretty quickly (which, too be honest, smells like a bug in Nominatim).

1 Like

Maybe 413 Content Too Large - HTTP | MDN ?

507 Insufficient Storage - HTTP | MDN is claimed to be describing temporary problems - though 400 series describe client problems and it is server-side inefficiency

422 Unprocessable Content - HTTP | MDN “Unprocessable Content” sort of fits

507 looks best

Is inventing new error code a sane solution?

I definitely got it few times, when Nominatim was altogether failing generally, also for others.

See say Occasional error 503 on searching for a place name

oh, thanks - I will try to ban all queries with as workaround and try to rather query with floor stripped out

I will also try to look at other failing examples

in meantime I will probably skip all requests including Japanese and Chinese characters

would it make sense to create an issue for that on Nominatim tracker?

(my current workaround is

def is_affected_by_asian_bug(text):
    pattern = regex.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)
    match = pattern.search(text)
    return bool(match)

In my case I am still trying to getting minimal end-to-end processing working reliably, so I am fine with skipping processing even for 99% of world)

Ah true, the case where the request has been in the waiting queue for too long. That is a 503. So it might not be completely trivial to change the error code for the other cases. I have a look.

Japanese addresses should be working okayish. We’ve added a split algorithm for them as part of GSOC 2023. In any case, Hirakana and Katakana are not an issue. It’s just with long sequences of Kanji/Han script. Korean might be another candidate for trouble but I haven’t had a chance yet to learn more about that.

In the meantime I’ve found the root cause for your failing query. The word statistics on the server are outdated after more than a year of applying updates. Updating them now.

2 Likes

Not sure if 507 is telling a normal user the right thing. They might consider a “storage” their fault, their hard disk or RAM. In my understanding, it might be something like too many requests in that corner of the network, which identifies as storage as well, too many payloads coming towards a server. Perhaps the user doesn’t want a detailed message, just needs to know they have to try again later, without a number.

this errors should be intercepted by data consumers and more clear ones shown

but API ideally would return to data consumer more clear feedback what went wrong to distinguish overall server problems from specific query being problematic