Calendar
May 2013 M T W T F S S « Aug 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -
Recent Posts
Categories
Category Archives: Language Detection
Short Text Language Detection with Infinity-Gram
I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology). This is its slide. Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a … Continue reading
Posted in Language Detection, NLP
22 Comments
Estimation of ldig (twitter Language Detection) for LIGA dataset
LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch). It uses a graph with 3-grams for long distance features and detects 95-98% accuracy. They open their dataset here which has 9066 tweets, … Continue reading
Posted in Language Detection, NLP, twitter
Leave a comment
Precision and Recall of ldig (twitter language detection)
In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages. There are some requests to tell ldig’s precision and recall, so I calculated them. lang size detected correct precision recall cs 5329 5330 … Continue reading
Posted in Language Detection, NLP
2 Comments
Language Detection for twitter with 99.1% Accuracy
I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter. https://github.com/shuyo/ldig It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, … Continue reading
Posted in Language Detection, NLP, Python
16 Comments
Whether language profile should be bundled or not?
I’m going to support maven for language-detection, but I have some troubles about language profiles… language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has … Continue reading
Posted in Java, Language Detection
3 Comments
language-detection supported 17 language profiles for short messages
language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus. These are published at trunk of langdetect repository (which will be packaged sooner or later). http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Fprofiles.sm Those 17 languages are as the below. cs : Czech da … Continue reading
Posted in Language Detection, NLP
1 Comment
langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)
My language detection library “langdetect” was updated. http://code.google.com/p/language-detection/ The added features are the following: Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene. Supported retrieving a list of loaded language profiles as getLangList() Supported generating a language profile from … Continue reading
Posted in Language Detection, NLP
1 Comment
Updated LangDetect Library (4x faster)
I’ve updated LangDetect (Language Detection Library for Java). http://code.google.com/p/language-detection/ Download a package “langdetect-01-24-2011.zip” from Download List This updating has 4x faster detection based on Posted improvement code by Elmer Garduno. Very Thanks!! table: 100 times detection time for test data(ms). … Continue reading
Posted in i18n, Java, Language Detection, NLP
4 Comments