I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter.
It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, Dutch, Norwegian, Polish, Portuguese, Romanian, Swedish, Turkish and Vietnamese).
ldig specialized with noisy short text (more than 3 words) and is limited to Latin alphabet language because input text can separate into character type blocks and Latin alphabet detection is most difficult.
My language-detection (langdetect) is not good at short text detection, so that most users seem troubled in language detection for twitter.
langdetect uses character 3-grams as feature so it is insufficient for short text detection.
I supposed that maximal substrings [Okanohara+ 09] makes sufficient features for short text detection and prepared twitter corpus with 17 languages.
Then training/test corpus size and estimation of ldig prototype is the below.
I’m preparing Catalan corpus with some helps (THANKS!), so supported languages will increase sooner.
ldig write out model as numpy binary format now, but I will modify it into more portable format, MessagePack like, then detector of ldig can probably port in other platforms easily.
The presentation of ldig is here (but this is written in Japanese! )
- Language Detection with Infinity-Gram (in Japanese)
And I’ll read its paper at the annual conference of The Association for Natural Language Processing in Japan (NLP2012).
I’ll publish the paper on this blog after the conference (but it’s in Japanese! ).
I opened a slide about twitter language detection in English.
- [Daisuke Okanohara and Jun'ichi Tsujii 09] “Text Categorization with All Substring Features”