Category Archives: Language Detection

Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology). This is its slide. Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a … Continue reading

Posted in Language Detection, NLP | 24 Comments

Why is Norwegian and Danish identification difficult?

I re-post the estimation table of ldig (twitter language detection). lang size detected correct precision recall cs 5329 5330 5319 0.9979 0.9981 da 5478 5483 5311 0.9686 0.9695 de 10065 10076 10014 0.9938 0.9949 en 9701 9670 9569 0.9896 0.9864 … Continue reading

Posted in Language Detection | 8 Comments

Estimation of ldig (twitter Language Detection) for LIGA dataset

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch). It uses a graph with 3-grams for long distance features and detects 95-98% accuracy. They open their dataset here which has 9066 tweets, … Continue reading

Posted in Language Detection, NLP, twitter | 2 Comments

Precision and Recall of ldig (twitter language detection)

In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages. There are some requests to tell ldig’s precision and recall, so I calculated them. lang size detected correct precision recall cs 5329 5330 … Continue reading

Posted in Language Detection, NLP | 5 Comments

Language Detection for twitter with 99.1% Accuracy

I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter. https://github.com/shuyo/ldig It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, … Continue reading

Posted in Language Detection, NLP, Python | 25 Comments

Whether language profile should be bundled or not?

I’m going to support maven for language-detection, but I have some troubles about language profiles… language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has … Continue reading

Posted in Java, Language Detection | 4 Comments

language-detection supported 17 language profiles for short messages

language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus. These are published at trunk of langdetect repository (which will be packaged sooner or later). http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Fprofiles.sm Those 17 languages are as the below. cs : Czech da … Continue reading

Posted in Language Detection, NLP | 3 Comments