Precision and Recall of ldig (twitter language detection)

In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages.
There are some requests to tell ldig’s precision and recall, so I calculated them.

lang size detected correct precision recall
cs 5329 5330 5319 0.9979 0.9981
da 5478 5483 5311 0.9686 0.9695
de 10065 10076 10014 0.9938 0.9949
en 9701 9670 9569 0.9896 0.9864
es 10066 10075 9989 0.9915 0.9924
fi 4490 4472 4459 0.9971 0.9931
fr 10098 10097 10048 0.9951 0.9950
id 10181 10233 10167 0.9936 0.9986
it 10150 10191 10109 0.9920 0.9960
nl 9671 9579 9521 0.9939 0.9845
no 8560 8442 8219 0.9736 0.9602
pl 10070 10079 10054 0.9975 0.9984
pt 9422 9441 9354 0.9908 0.9928
ro 5914 5831 5822 0.9985 0.9844
sv 9990 10034 9866 0.9833 0.9876
tr 10310 10321 10300 0.9980 0.9990
vi 10494 10486 10479 0.9993 0.9986
total 149989 148600 0.9907

The sum of data size is not equal to the amount of detected languages because ldig outputs “” as language when the max probability is lower than 0.6.
And the data size is not equal to one in the previous article because the dataset is updated.

I reckoned it doesn’t make sense over 99% accuracy, then what’s about?

This entry was posted in Language Detection, NLP. Bookmark the permalink.

5 Responses to Precision and Recall of ldig (twitter language detection)

  1. Num Nomi says:

    If there is any way to run this code from within java?
    I am using java for a project, and I need to call lang detection for twitter within the java?

    Thank you!

  2. Anastasija says:

    Dear Shuyo,

    I read all the articles and I am doing some research for a paper at university about using n-grams for detecting the language of short texts. I find tweets as a very descriptive example, and the tool is very useful.
    However I found a problem for languages that are not in the list of 17 languages but are very similar. For example Dutch and Afrikkans. (they seem to have the same language roots dating from the 15th or 16th century) So now in my Dutch tweets I try to detect for example, some Afrikaans tweets appear.

    So does tweets in other languages get to be classified as the language most similar to the one in the list of 17 languages?

    • shuyo says:

      I recognize the problem too (e.g. Indonesian is very similar to Malaysian, Czech is so to Slovak, and so on).
      As you say, all text are classified into one of 17 languages​​ (or “Indistinguishable”).
      Hence all classifiers can only predict labels in their training corpus, it is necessary Afrikaans corpus to support it.
      It is necessary to a corpus annotator (me! :D) can distinguish Dutch and Afrikaans for that….

      • Anastasija says:

        Dear shuyo,
        Thank you for the reply. I will try and create a corpus for the languages that sneak in and I can most probably provide you with the corpus if you are interested in extending the language models available at the moment.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s