Discourse Translator - Pricing and Language Coverage

Forking off from We Need The Discourse Translation Plugin , I researched a bit into the different translation services that could be used with the discourse translation plugin.

Language coverage

Actually, I was surprised that coverages of all services are pretty good. Part of this may be due to the Americas and large parts of Africa having English, French, Spanish or Portuguese as sole official or at least co-official languages.

DeepL


The most obvious missing language is Arabic, all of North-Africa and most of the Middle-East would be green otherwise.

LibreTranslate


LibreTranslate surprises with support for Arabic, Hindi and Indonesian. Has the least support for languages in East Europe.

Yandex


Only very few black spots. Has support for a lot of Russian minority languages, as well as some Indian languages.

Bing


Basically no black spots except Somali and Sinhalese (Sri Lanka). Also has support for even more minority languages.

Google


No black spots, a lot of support for minority languages (though none in Russia).

Note on language coverage

For our purpose - that people can communicate with each other, I think we don’t need support for minority languages / dialects and probably not even all official languages. Yandex, Bing and Google support for example Welsh, Irish and Galician. I believe everyone in Ireland, the UK or Spain can speak English or Castillan Spanish, respectively.

Pricing

Translation Service Price per month
LibreTranslate free if self-hosted, conditions differ if hosted ($9 flat-rate, but it is rate-limited)
Yandex $6 per 1M chars
Bing €9 per 1M chars (first 2M free)
Google $20 per 1M chars (first 0.5M free)
DeepL €20 per 1M chars (first 0.5M free)

For comparison, this post has a length of about 3000 chars, so would cost about 6 cent with Google or DeepL to translate into a single language.

Quality

On quality of translation, while maybe the most important part of a seemingless communication, I cannot say much.
From my experience just with German → English, Google is “ok”, DeepL is almost perfect.
But maybe there are some reviews and comparisons to be found around the internet, I haven’t looked

7 Likes

I have been looking around a little bit and testing with a randomly selected, non-trivial real post from the forum: Ende des Forums ... (Page 2) / users: Germany / OpenStreetMap Forum

This comparison site offers only three engines.
https://imtranslator.net/compare/
In my test Microsoft gives a reasonable translation, Google clearly fails to get the message across.

Fun fact: In the Google search, in the automatically compiled FAQ “What is the most accurate online translator?”, Google itself answers: DeepL Translate

This site gives an extensive comparison of Google, Microsoft, DeepL and Amazon with Graphs. It was conducted by DeepL, so it might be somewhat biased, but even if you reduce every statement to half, DeepL is still the clear winner.

Finally, I tried my text directly myself (German->English):

Funny enough, in my eyes Bing provided a better translation than DeepL. I did not expect that, while writing this post I thought the win would clearly go to DeepL.

Considering pricing my gut feeling is that a post in the forum is about 1k on average. Typical posts are about 0.5k, the more elaborate ones 1.5-3k. For better numbers we would need some database statistics.

So my conclusions so far

  • I would rule out Google, the quality isn’t good enough in any of the tests
  • my personal winner is Bing: best quality and the best pricing
  • I would invite more people to play with the two translator links above and give their interpretation, especially in other languages

I would definitely go for best translation, based on current OSM forum posts. Quality of speech engines, translations and auto-completion usually differs for varying context, so we should test it with actual OSM talk.

1 Like

Is it possible to get a list of all the languages used in the old forums and mailing lists (and maybe others) to get a sense on which are currently the “OSM languages”?

Maybe we can check

https://openstreetmap.community/

1 Like

This is the counts for diary entries in all languages with 10 or more posts:

 language_code |     english_name     | count 
---------------+----------------------+-------
 en            | English              | 19474
 de            | German               |  2584
 ru            | Russian              |  1977
 fr            | French               |  1216
 es            | Spanish              |  1177
 pt-BR         | Brazilian Portuguese |  1019
 it            | Italian              |   576
 ja            | Japanese             |   477
 pl            | Polish               |   296
 nl            | Dutch                |   260
 hr            | Croatian             |   256
 pt            | Portuguese           |   205
 sv            | Swedish              |   203
 zh-CN         | Chinese (China)      |   158
 hu            | Hungarian            |   146
 ar            | Arabic               |   126
 zh-TW         | Chinese (Taiwan)     |   123
 ab            | Abkhazian            |   122
 th            | Thai                 |   117
 id            | Indonesian           |   113
 uk            | Ukrainian            |   102
 fa            | Persian              |    85
 zh            | Chinese              |    75
 tr            | Turkish              |    69
 ko            | Korean               |    69
 cs            | Czech                |    62
 vi            | Vietnamese           |    57
 fi            | Finnish              |    53
 da            | Danish               |    51
 sk            | Slovak               |    45
 ro            | Romanian             |    41
 gl            | Galician             |    30
 ca            | Catalan              |    28
 he            | Hebrew               |    25
 el            | Greek                |    24
 tl            | Tagalog              |    22
 sw            | Swahili              |    21
 nb            | Norwegian Bokmål     |    20
 bg            | Bulgarian            |    19
 ms            | Malay                |    14
 my            | Burmese              |    14
 be            | Belarusian           |    13
 sr            | Serbian              |    12
 lv            | Latvian              |    12
 eo            | Esperanto            |    11
 bs            | Bosnian              |    10
6 Likes

618 is the answer :wink:

That’s what I found in the fluxBB database of the current forum.

There is no language associated with post message in fluxBB, it is possible to compute it with language detection if we need.

We have the language selected by the users for the interface, which gives:

MariaDB [forum]> select language, count(*) from users group by 1 order by 2 desc;
+----------------------+----------+
| language             | count(*) |
+----------------------+----------+
| English              |    31566 |
| German               |     1343 |
| Russian              |      845 |
| Polish               |      233 |
| French               |      102 |
| Ukrainian            |       63 |
| Brazilian_Portuguese |       58 |
| Italian              |       39 |
| Czech                |       11 |
| Finnish              |        8 |
| Hebrew               |        7 |
| Simplified_Chinese   |        7 |
| Arabic               |        1 |
| Dutch                |        1 |
| Traditional_Chinese  |        1 |
+----------------------+----------+
15 rows in set (0.139 sec)

For active users during the last year only:

+----------------------+----------+
| language             | count(*) |
+----------------------+----------+
| English              |     3395 |
| German               |      359 |
| Russian              |      158 |
| Polish               |       69 |
| French               |       17 |
| Ukrainian            |       13 |
| Brazilian_Portuguese |        8 |
| Italian              |        8 |
| Finnish              |        5 |
| Czech                |        2 |
| Simplified_Chinese   |        2 |
| Hebrew               |        1 |
| Traditional_Chinese  |        1 |
+----------------------+----------+
1 Like

But anyway, 6 cents per translation per such post per language kind of shocked me, I didn’t know auto-translation services were that expensive. Before, I believed that the costs per post would be far in the sub-cent area (i.e. factor 10-100 cheaper)

@cquest since you got the data right there nicely arranged, maybe you have enough data that you can calculate a rough estimate maybe similarily as this guy has done it?: How to estimate the cost of translation using the translator plugin - #3 by lee-dohm - support - Discourse Meta

Thanks for the analysis. However, I think it is quite off concerning English because of guys like myself who post mostly in German, but have the UI set to English because they just can’t stand the horrible translations of English technical terms which just should not be translated. :slight_smile:

On the other hand, it is save to assume that 99% of the posts in the regional forums are in the country specific language, and we already have those numbers.

Can you tell how many posts were created in the last month?

That would give us an idea on the volume of potentially translatable posts. Then we just need to look into our crystal ball and guess how many translation requests from interested other language speakers we would get if it was just a button press away. My guess would be that it would mostly happen in general areas like general talk, OSMF, tagging if we have that. But very little in the regional forums, mostly when a discussion turns global.

Currently, many (most?) discussions in the German forum are not really Germany exclusive, e.g. most “how to tag X?” topics, even topics like “someone mapped this situation [in Germany] is weirdly” because the fact that it is in Germany is not that relevant unless it concerns legislation - the tagging schemes are international after all.

They are just in German.

My point: My hope for the auto-translator functionality is that topics about tagging, software, other q&a, discussions about new developments in the OSM world etc etc are not silo’ed in country-specific categories only because they happen to be (started) in German but rather in the category they belong from a content point of view.
So, the “Germany” category could really be just for topics that just concern things in Germany: FOSSGIS stuff, tagging specific to Germany, local communities and meetups etc.

Here is the average monthly number of posts, and average monthly volume in GB :

MariaDB [forum]> select left(month,4) as year, avg(nb) as avg, round(avg(msg)/1024/1024,1) as msg_GB from (select left(from_unixtime(posted),7) as month, count(*) as nb, sum(length(message)) as msg from osm_posts group by 1 order by 1) as stats group by 1 order by 1;
+------+-----------+--------+
| year | avg       | msg_GB |
+------+-----------+--------+
| 2006 |   37.1667 |    0.0 |
| 2007 |   96.4000 |    0.1 |
| 2008 |  945.1667 |    0.5 |
| 2009 | 3244.5000 |    1.9 |
| 2010 | 6527.2500 |    3.7 |
| 2011 | 6637.1667 |    3.8 |
| 2012 | 7468.9167 |    4.5 |
| 2013 | 7242.7500 |    4.4 |
| 2014 | 7080.0833 |    4.1 |
| 2015 | 6143.5833 |    3.6 |
| 2016 | 4603.0833 |    2.8 |
| 2017 | 4528.6667 |    2.7 |
| 2018 | 4347.5833 |    2.6 |
| 2019 | 3408.7500 |    2.0 |
| 2020 | 3307.9167 |    2.1 |
| 2021 | 3079.9167 |    2.0 |
| 2022 | 2783.0000 |    1.7 |
+------+-----------+--------+
17 rows in set (6.744 sec)

This is just looking at the past, the future may be an increase of posts thanks to a more modern tool, new adopters, less langage barrier, etc…

When we switch the old phpBB forum at OSM France to discourse, activity increased a lot !

2 Likes

I think that as far as French language exchanges are concerned, most of the French is done on the French forum. We have a very active community, which answers the vast majority of questions, and the French people are poor in foreign languages for the most part (I am a perfect example).

I think that the majority of French speakers (outside France) speak mostly in English (Switzerland, Belgium, Canada, Africa). We have very few French speakers outside France who speak on the French forum

Translated with DeepL

A few thoughts around translation volume and costs.

To minimise the translation volume and thus the cost risk, the translation function could be limited to a few categories and languages at the beginning. Experience can then be gained in this “sandbox” and scaling up to unlimited language translation should be possible easy and safe.

I’m not sure if that’s a realistic option, but it would also be possible to offer the translation provider a cooperation. Translation corrections (of available or new languages) by the OSM community in exchange for free translation service from the translation provider. The prerequisite is, of course, that this is also technically realisable.

I think, this approach underlies the DeepL translation website DeepL Translate: The world's most accurate translator. There it is also possible to correct the translation by clicking on a word, which will probably be used to train the translation AI. A win-win situation for provider and user.

Another advantage would be, that the translation AI can learn how the OSM community communicate and this will increase translation quality also rapidly.

This list of diary posts that Tom shared is very telling, since I assume it’s not biased by the communication channel these people use (that can be forum, mailing list or even telegram people) and gives a good picture about OSM Communities activity.

Do we know what’s the coverage of these languages per translation engine? Maybe it should be optimized to at least cover languages with >50 posts?

It looks like the plugin is calling the API more often that expected, only to detect in which language the post is written.

I discovered that because my free 500KB quota on DeepL has been exhausted in a couple of days only on an instance I was barely using.

There is another option to detect post languages: GitHub - peterc/whatlanguage: A language detection library for Ruby that uses bloom filters for speed.

No API, only local

I’ve also sent a mail to DeepL to see if they have special offers for free / open projects… it cost nothing to ask :wink:

1 Like

The DeepL fork in particular or the translator plugin in general?

DeepL does not provide a language detection endpoint on its API, so the plugin is translating the post to the default instance language.

Google and Microsoft have a language detection endpoint, I don’t know if it is out of quota.

I don’t know much about ruby, but the code looks quite simple to me and it should not be too complex to improve it by doing language detection locally and/or mixing APIs.

Comparing the table posted by @TomH with the supported languages by translation engines yields this overview:

Translation Engine % of translatable diary posts
LibreTranslate 97.0%
DeepL 95.7%
Yandex 99.6%
Bing 99.5%
Google 99.5%

Not sure if this helps to make a decision, given the high coverage overall.

2 Likes

This sounds borked: So the plugin translates the text of the post automatically just in order to decide whether to show the :globe_with_meridians: (translate) button at all? Could then just translate everything automatically without user interaction, it would result in a smaller amount of API calls. (Maybe this is the reason why there is no support for DeepL on the mainline version of the plugin?)

FYI, LibreTranslate quality decreases when translation is not to or from english.

french → german will in fact be french → english → german and of course, the quality decreases.

I’ve installed it some time ago to do some local tests.


Language detection is really something that could be done locally, its is much simpler then translation and a lot of code exist to do this without calling an API, for DeepL and others too.

1 Like

Sure, this goes beyond using a plugin or using a fork of a plugin though. It requires a developer (Ruby).