Estimation of ldig (twitter Language Detection) for LIGA dataset

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.
They open their dataset here which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: site, blog) to their result.

At first, it needs to comvert their dataset into ldig-available format.
Here is a ruby script to convert it.

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

open("liga_dataset.txt", "wb:UTF-8") do |f|
  ["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
    lang = dir[0,2]
    Dir.glob("#{dir}/*.txt") do |file|
      line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
      f.puts "#{lang}\t#{line}"
    end
  end
end

Here is a estimation of ldig for the generated dataset, liga_dataset.txt.

lang size detect correct precision recall
de 1479 1469 1463 0.9959 0.9892
en 1505 1504 1489 0.9900 0.9894
es 1562 1550 1541 0.9942 0.9866
fr 1551 1545 1539 0.9961 0.9923
it 1539 1532 1526 0.9961 0.9916
nl 1430 1429 1425 0.9972 0.9965
total 9066 8983 0.9908

It shows that ldig can detect over 99% accuracy for their dataset.

Reference

  • [Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”
Advertisements
Posted in Language Detection, NLP, twitter | 2 Comments

Precision and Recall of ldig (twitter language detection)

In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages.
There are some requests to tell ldig’s precision and recall, so I calculated them.

lang size detected correct precision recall
cs 5329 5330 5319 0.9979 0.9981
da 5478 5483 5311 0.9686 0.9695
de 10065 10076 10014 0.9938 0.9949
en 9701 9670 9569 0.9896 0.9864
es 10066 10075 9989 0.9915 0.9924
fi 4490 4472 4459 0.9971 0.9931
fr 10098 10097 10048 0.9951 0.9950
id 10181 10233 10167 0.9936 0.9986
it 10150 10191 10109 0.9920 0.9960
nl 9671 9579 9521 0.9939 0.9845
no 8560 8442 8219 0.9736 0.9602
pl 10070 10079 10054 0.9975 0.9984
pt 9422 9441 9354 0.9908 0.9928
ro 5914 5831 5822 0.9985 0.9844
sv 9990 10034 9866 0.9833 0.9876
tr 10310 10321 10300 0.9980 0.9990
vi 10494 10486 10479 0.9993 0.9986
total 149989 148600 0.9907

The sum of data size is not equal to the amount of detected languages because ldig outputs “” as language when the max probability is lower than 0.6.
And the data size is not equal to one in the previous article because the dataset is updated.

I reckoned it doesn’t make sense over 99% accuracy, then what’s about?

Posted in Language Detection, NLP | 5 Comments

Language Detection for twitter with 99.1% Accuracy

I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter.

It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, Dutch, Norwegian, Polish, Portuguese, Romanian, Swedish, Turkish and Vietnamese).
ldig specialized with noisy short text (more than 3 words) and is limited to Latin alphabet language because input text can separate into character type blocks and Latin alphabet detection is most difficult.

My language-detection (langdetect) is not good at short text detection, so that most users seem troubled in language detection for twitter.
langdetect uses character 3-grams as feature so it is insufficient for short text detection.
I supposed that maximal substrings [Okanohara+ 09] makes sufficient features for short text detection and prepared twitter corpus with 17 languages.
Then training/test corpus size and estimation of ldig prototype is the below.

lang training test correct accuracy
cs 4581 5329 5319 99.81
da 5480 5476 5308 96.93
de 43930 9659 9611 99.50
en 44912 9612 9497 98.80
es 44921 10127 10050 99.24
fi 4576 4490 4464 99.42
fr 44142 10063 10014 99.51
id 44873 10183 10163 99.80
it 44045 10152 10110 99.59
nl 44933 9677 9532 98.50
no 7525 8513 8192 96.23
pl 12854 10070 10059 99.89
pt 44464 9459 9359 98.94
ro 6114 5902 5812 98.48
sv 44339 9952 9870 99.18
tr 44787 10309 10301 99.92
vi 10413 10494 10481 99.88
total 496889 149467 148142 99.11

I’m preparing Catalan corpus with some helps (THANKS!), so supported languages will increase sooner.

ldig write out model as numpy binary format now, but I will modify it into more portable format, MessagePack like, then detector of ldig can probably port in other platforms easily.

The presentation of ldig is here (but this is written in Japanese! 😀 )

And I’ll read its paper at the annual conference of The Association for Natural Language Processing in Japan (NLP2012).
I’ll publish the paper on this blog after the conference (but it’s in Japanese! :P).

P.S.

I opened a slide about twitter language detection in English.

Reference

  • [Daisuke Okanohara and Jun’ichi Tsujii 09] “Text Categorization with All Substring Features”
Posted in Language Detection, NLP, Python | 25 Comments

Repository Migration from subversion into git on Google Code Project

I migrated language-detection’s repository on Google Code Project from subversion into git.
It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! :D)

Google Code Project supports subversion, git and Mercurial as Version Control System. Each repository is independent and exclusive, and it is necessary to migrate between them by manual.
So I wrote migration process from subversion into git as step-bt-step.

1. Prepare git and git-svn

I used them on Cygwin. But I reckon it is easier on Linux 😀

2. Migrate into local git repositories

At first, migrate the subversion repository into local git repositories (because of their exclusiveness).
The subversion repository of Google Code Project has histories of not only codes(trunk/branch/tag) but also wikis. In the case of git, codes and wikis are stored in each repository.
So you need to migrate 2 repositories if you didn’t use branches/tagges

Execute the below commands on an appropriate directory.

$ cd your-working-directory
$ git svn clone -s http://your-project-name.googlecode.com/svn/ 
$ git svn clone http://your-project-name.googlecode.com/svn/wiki/

“your-working-directory” and “your-project-name” must be replaced into your suitable names. It matters that the second command does not has -s option.
If you use branches and tagges, then you might need several steps for tagging on the git repository, but I didn’t research its methods because I didn’t use them 😛

If you run into an error like the below with git-svn on Cygwin (it may often occur at Windows 7 x64???),

10192702 [main] perl 9904 C:\cygwin\bin\perl.exe: *** fatal error - unable to re 
map C:\cygwin\bin\cygsvn_client-1-0.dll to same address as parent: 0xAB0000 != 0xAE0000

then execute rebaseall on cmd.exe as the below.

$ cd c:\cygwin\bin
$ ash rebaseall -v

3. Switch the project’s VCS into git

Switching into git is available at the source tab on the administrator’s console of the project.
The operation does not affect no inference to the subversion repository, but limit the only selected system to access. So you can restore the subversion at any time.

4. Push the local repositories into new ones

Execute the below commands.

$ cd svn
$ git push https://code.google.com/p/your-project-name/ master
$ cd ..
$ cd wiki 
$ git push https://code.google.com/p/your-project-name.wiki/ master

Then the git repositories will be available. You shoud check whether the project’s wiki is as before.
I like that the repositories of codes and wikis are separeted.

Reference:
http://d.hatena.ne.jp/kkobayashi_a/20090615/p1 (in Japanese)

Posted in Development | Leave a comment

Whether language profile should be bundled or not?

I’m going to support maven for language-detection, but I have some troubles about language profiles…

language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has over 1MB.)
However there are some reasonable requests that want to bundle them in jar file.

  • Profile-separated one makes the application know the installed directory.
  • Hadoop can’t distribute language profiles outside jar file.
  • I want to register the library into Maven Central, then profile-bundled one is more useful.

And a user told me how to generate jar file with selected profiles with maven’s template, so a developer who hope the slim-size library can generate a jar file with necessary profiles only (also no profiles!).

Then I begin to consider that the jar file packaged in language-detection library can bundle all language profiles. How do you think?

Posted in Java, Language Detection | 4 Comments

language-detection supported 17 language profiles for short messages

language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus.
These are published at trunk of langdetect repository (which will be packaged sooner or later).

Those 17 languages are as the below.

  • cs : Czech
  • da : Dannish
  • de : German
  • en : English
  • es : Spanish
  • fi : Finnish
  • fr : French
  • id : Indonesian
  • it : Italian
  • nl : Dutch
  • no : Norwegian
  • pl : Polish
  • pt : Portuguese
  • ro : Romanian
  • sv : Swedish
  • tr : Turkish
  • vi : Vietnamese

These profiles perform more 2 point for short messages detection (tweet and so on) than the bundled profiles generated from Wikipedia abstracts.

language test size original profiles new profiles
correct accuracy correct accuracy
cs 4269 4261 100.0% 4258 100.0%
da 5484 5255 96.0% 5188 95.0%
de 9608 8495 88.0% 9020 94.0%
en 9630 8796 91.0% 9188 95.0%
es 10133 9721 96.0% 9943 98.0%
fi 2241 2238 100.0% 2236 100.0%
fr 10067 9719 97.0% 9906 98.0%
id 10184 9869 97.0% 10061 99.0%
it 10167 9844 97.0% 9960 98.0%
nl 9680 8449 87.0% 9399 97.0%
no 10505 10148 97.0% 10015 95.0%
pl 9886 9833 99.0% 9852 100.0%
pt 9456 8720 92.0% 9170 97.0%
ro 4057 3791 93.0% 3993 98.0%
sv 9932 9670 97.0% 9762 98.0%
tr 10309 10145 98.0% 10251 99.0%
vi 10932 10832 99.0% 10832 99.0%
total 146540 139786 95.4% 143034 97.6%

The Twitter corpus (training and test) used by generating profiles are based on collected tweets via ‘sample’ method of Twitter Streaming API, all of which are annotated by myself.
Those corpus size are as the below.

language training test
cs : Czech 3514 4342
da : Dannish 4007 5645
de : German 44115 9998
en : English 44335 10167
es : Spanish 44976 10296
fi : Finnish 3400 2310
fr : French 44279 10410
id : Indonesian 44912 10395
it : Italian 44183 10562
nl : Dutch 45033 10109
no : Norwegian 4272 10721
pl : Polish 5040 10282
pt : Portuguese 44505 9749
ro : Romanian 3490 4151
sv : Swedish 44330 10232
tr : Turkish 45024 10828
vi : Vietnamese 5029 11065

Those corpus are originally in order to a new short text detection, but as profile for langdetect, I confirmed they show higher performance of short message detection than the bundled profiles. So the new profiles are published too.
In using this profiles, text to detect should be converted into the lower case. Meanwhile langdetect tend to remove all upper case word as an acronym, twitter-like short messages are often written as all upper case sentence for emphasis.

Of cource the present profiles are bundled as until now. They have higher accuracy for news text and so on! 😀

The prototype of the language detection for short texts are published at https://github.com/shuyo/ldig .
It is shortened from “Language Detection with Infinity-Gram”.
I wrote the presentation of ldig, but yet in Japanese only. I’ll translate it in English later…

Posted in Language Detection, NLP | 3 Comments

langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)

My language detection library “langdetect” was updated.

The added features are the following:

  • Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene.
  • Supported retrieving a list of loaded language profiles as getLangList()
  • Supported generating a language profile from plain text

and fixed some bugs.

Then I published a test data set for 21 languages based on Europarl Parallel Corpus ( http://www.statmt.org/europarl/ ) so that anyone are able to verify the library on the same condition.

It randomly samples 1000 sentences(lines) for each language from Europarl corpus.
Each line forms “[language code]\t[plain text by UTF-8]” to be avaiable for batch test tool of this langdetect library.
Europarl corpus has some of very short sentence(e.g. 2 words only!) that langdetect is not very good at. I remained them for fairness! 😀

Then the following is a result with this test data.

bg (985/1000=0.99): {bg=985, ru=4, mk=11}
cs (993/1000=0.99): {sk=6, en=1, cs=993}
da (971/1000=0.97): {da=971, no=28, en=1}
de (998/1000=1.00): {de=998, da=1, af=1}
el (1000/1000=1.00): {el=1000}
en (997/1000=1.00): {fr=1, en=997, nl=1, af=1}
es (995/1000=1.00): {pt=4, en=1, es=995}
et (996/1000=1.00): {de=1, fi=2, et=996, af=1}
fi (998/1000=1.00): {fi=998, et=2}
fr (998/1000=1.00): {it=1, sv=1, fr=998}
hu (999/1000=1.00): {id=1, hu=999}
it (998/1000=1.00): {it=998, es=2}
lt (998/1000=1.00): {lv=2, lt=998}
lv (999/1000=1.00): {pt=1, lv=999}
nl (977/1000=0.98): {de=1, sv=1, nl=977, af=21}
pl (999/1000=1.00): {pl=999, nl=1}
pt (994/1000=0.99): {it=1, hu=1, pt=994, en=1, es=3}
ro (999/1000=1.00): {ro=999, fr=1}
sk (987/1000=0.99): {sl=2, sk=987, ro=1, lt=1, et=1, cs=8}
sl (972/1000=0.97): {hr=27, sl=972, en=1}
sv (990/1000=0.99): {da=2, no=8, sv=990}
total: 20843/21000 = 0.993

This is obtained by batchtest tool.

java -jar lib/langdetect.jar -d profiles -s 0 --batchtest europarl.test

The random seed(-s) is set to 0 for reproduce.

Hence Danish(da), Dutch(nl) and Slovene(sl) are very similar to Norwegian(no), Afrikaans(af) and Croatian(hr) respectively, their detection accuracies are lower than others.
(While Norwegian has some proper letters which are not used in Dutch, so its accuracy leaves higher)

Posted in Language Detection, NLP | 2 Comments