language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus.
These are published at trunk of langdetect repository (which will be packaged sooner or later).
Those 17 languages are as the below.
- cs : Czech
- da : Dannish
- de : German
- en : English
- es : Spanish
- fi : Finnish
- fr : French
- id : Indonesian
- it : Italian
- nl : Dutch
- no : Norwegian
- pl : Polish
- pt : Portuguese
- ro : Romanian
- sv : Swedish
- tr : Turkish
- vi : Vietnamese
These profiles perform more 2 point for short messages detection (tweet and so on) than the bundled profiles generated from Wikipedia abstracts.
| language | test size | original profiles | new profiles | ||
|---|---|---|---|---|---|
| correct | accuracy | correct | accuracy | ||
| cs | 4269 | 4261 | 100.0% | 4258 | 100.0% |
| da | 5484 | 5255 | 96.0% | 5188 | 95.0% |
| de | 9608 | 8495 | 88.0% | 9020 | 94.0% |
| en | 9630 | 8796 | 91.0% | 9188 | 95.0% |
| es | 10133 | 9721 | 96.0% | 9943 | 98.0% |
| fi | 2241 | 2238 | 100.0% | 2236 | 100.0% |
| fr | 10067 | 9719 | 97.0% | 9906 | 98.0% |
| id | 10184 | 9869 | 97.0% | 10061 | 99.0% |
| it | 10167 | 9844 | 97.0% | 9960 | 98.0% |
| nl | 9680 | 8449 | 87.0% | 9399 | 97.0% |
| no | 10505 | 10148 | 97.0% | 10015 | 95.0% |
| pl | 9886 | 9833 | 99.0% | 9852 | 100.0% |
| pt | 9456 | 8720 | 92.0% | 9170 | 97.0% |
| ro | 4057 | 3791 | 93.0% | 3993 | 98.0% |
| sv | 9932 | 9670 | 97.0% | 9762 | 98.0% |
| tr | 10309 | 10145 | 98.0% | 10251 | 99.0% |
| vi | 10932 | 10832 | 99.0% | 10832 | 99.0% |
| total | 146540 | 139786 | 95.4% | 143034 | 97.6% |
The Twitter corpus (training and test) used by generating profiles are based on collected tweets via ‘sample’ method of Twitter Streaming API, all of which are annotated by myself.
Those corpus size are as the below.
| language | training | test |
|---|---|---|
| cs : Czech | 3514 | 4342 |
| da : Dannish | 4007 | 5645 |
| de : German | 44115 | 9998 |
| en : English | 44335 | 10167 |
| es : Spanish | 44976 | 10296 |
| fi : Finnish | 3400 | 2310 |
| fr : French | 44279 | 10410 |
| id : Indonesian | 44912 | 10395 |
| it : Italian | 44183 | 10562 |
| nl : Dutch | 45033 | 10109 |
| no : Norwegian | 4272 | 10721 |
| pl : Polish | 5040 | 10282 |
| pt : Portuguese | 44505 | 9749 |
| ro : Romanian | 3490 | 4151 |
| sv : Swedish | 44330 | 10232 |
| tr : Turkish | 45024 | 10828 |
| vi : Vietnamese | 5029 | 11065 |
Those corpus are originally in order to a new short text detection, but as profile for langdetect, I confirmed they show higher performance of short message detection than the bundled profiles. So the new profiles are published too.
In using this profiles, text to detect should be converted into the lower case. Meanwhile langdetect tend to remove all upper case word as an acronym, twitter-like short messages are often written as all upper case sentence for emphasis.
Of cource the present profiles are bundled as until now. They have higher accuracy for news text and so on!
The prototype of the language detection for short texts are published at https://github.com/shuyo/ldig .
It is shortened from “Language Detection with Infinity-Gram”.
I wrote the presentation of ldig, but yet in Japanese only. I’ll translate it in English later…
Hi Shuyo,
I’ve implemented a test based project to test your project langdetect.jar project, so far any false positive. I’d like say thanks to you for the great job. I posted in my blog http://www.hashcode.eti.br/?p=441-language-detection-using-pdf-documents or by direct link to github https://github.com/shairontoledo/langdetect-document-matcher-test .