Category Archives: Language Detection

Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology). This is its slide. Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a … Continue reading

Posted in Language Detection, NLP | 22 Comments

Estimation of ldig (twitter Language Detection) for LIGA dataset

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch). It uses a graph with 3-grams for long distance features and detects 95-98% accuracy. They open their dataset here which has 9066 tweets, … Continue reading

Posted in Language Detection, NLP, twitter | Leave a comment

Precision and Recall of ldig (twitter language detection)

In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages. There are some requests to tell ldig’s precision and recall, so I calculated them. lang size detected correct precision recall cs 5329 5330 … Continue reading

Posted in Language Detection, NLP | 2 Comments

Language Detection for twitter with 99.1% Accuracy

I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter. https://github.com/shuyo/ldig It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, … Continue reading

Posted in Language Detection, NLP, Python | 16 Comments

Whether language profile should be bundled or not?

I’m going to support maven for language-detection, but I have some troubles about language profiles… language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has … Continue reading

Posted in Java, Language Detection | 3 Comments

language-detection supported 17 language profiles for short messages

language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus. These are published at trunk of langdetect repository (which will be packaged sooner or later). http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Fprofiles.sm Those 17 languages are as the below. cs : Czech da … Continue reading

Posted in Language Detection, NLP | 1 Comment

langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)

My language detection library “langdetect” was updated. http://code.google.com/p/language-detection/ The added features are the following: Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene. Supported retrieving a list of loaded language profiles as getLangList() Supported generating a language profile from … Continue reading

Posted in Language Detection, NLP | 1 Comment

Updated LangDetect Library (4x faster)

I’ve updated LangDetect (Language Detection Library for Java). http://code.google.com/p/language-detection/ Download a package “langdetect-01-24-2011.zip” from Download List This updating has 4x faster detection based on Posted improvement code by Elmer Garduno. Very Thanks!! table: 100 times detection time for test data(ms). … Continue reading

Posted in i18n, Java, Language Detection, NLP | 4 Comments