language-detection supported 17 language profiles for short messages

language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus.
These are published at trunk of langdetect repository (which will be packaged sooner or later).

Those 17 languages are as the below.

  • cs : Czech
  • da : Dannish
  • de : German
  • en : English
  • es : Spanish
  • fi : Finnish
  • fr : French
  • id : Indonesian
  • it : Italian
  • nl : Dutch
  • no : Norwegian
  • pl : Polish
  • pt : Portuguese
  • ro : Romanian
  • sv : Swedish
  • tr : Turkish
  • vi : Vietnamese

These profiles perform more 2 point for short messages detection (tweet and so on) than the bundled profiles generated from Wikipedia abstracts.

language test size original profiles new profiles
correct accuracy correct accuracy
cs 4269 4261 100.0% 4258 100.0%
da 5484 5255 96.0% 5188 95.0%
de 9608 8495 88.0% 9020 94.0%
en 9630 8796 91.0% 9188 95.0%
es 10133 9721 96.0% 9943 98.0%
fi 2241 2238 100.0% 2236 100.0%
fr 10067 9719 97.0% 9906 98.0%
id 10184 9869 97.0% 10061 99.0%
it 10167 9844 97.0% 9960 98.0%
nl 9680 8449 87.0% 9399 97.0%
no 10505 10148 97.0% 10015 95.0%
pl 9886 9833 99.0% 9852 100.0%
pt 9456 8720 92.0% 9170 97.0%
ro 4057 3791 93.0% 3993 98.0%
sv 9932 9670 97.0% 9762 98.0%
tr 10309 10145 98.0% 10251 99.0%
vi 10932 10832 99.0% 10832 99.0%
total 146540 139786 95.4% 143034 97.6%

The Twitter corpus (training and test) used by generating profiles are based on collected tweets via ‘sample’ method of Twitter Streaming API, all of which are annotated by myself.
Those corpus size are as the below.

language training test
cs : Czech 3514 4342
da : Dannish 4007 5645
de : German 44115 9998
en : English 44335 10167
es : Spanish 44976 10296
fi : Finnish 3400 2310
fr : French 44279 10410
id : Indonesian 44912 10395
it : Italian 44183 10562
nl : Dutch 45033 10109
no : Norwegian 4272 10721
pl : Polish 5040 10282
pt : Portuguese 44505 9749
ro : Romanian 3490 4151
sv : Swedish 44330 10232
tr : Turkish 45024 10828
vi : Vietnamese 5029 11065

Those corpus are originally in order to a new short text detection, but as profile for langdetect, I confirmed they show higher performance of short message detection than the bundled profiles. So the new profiles are published too.
In using this profiles, text to detect should be converted into the lower case. Meanwhile langdetect tend to remove all upper case word as an acronym, twitter-like short messages are often written as all upper case sentence for emphasis.

Of cource the present profiles are bundled as until now. They have higher accuracy for news text and so on! :D

The prototype of the language detection for short texts are published at https://github.com/shuyo/ldig .
It is shortened from “Language Detection with Infinity-Gram”.
I wrote the presentation of ldig, but yet in Japanese only. I’ll translate it in English later…

About these ads
This entry was posted in Language Detection, NLP. Bookmark the permalink.

3 Responses to language-detection supported 17 language profiles for short messages

  1. Hi Shuyo,

    I’ve implemented a test based project to test your project langdetect.jar project, so far any false positive. I’d like say thanks to you for the great job. I posted in my blog http://www.hashcode.eti.br/?p=441-language-detection-using-pdf-documents or by direct link to github https://github.com/shairontoledo/langdetect-document-matcher-test .

  2. Gaurav Tuli says:

    Hey Shuyo,
    The accuracy of your approach looks outstanding. Thanks for sharing your work. Where can I get the twitter corpus you used?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s