langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)

My language detection library “langdetect” was updated.

The added features are the following:

  • Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene.
  • Supported retrieving a list of loaded language profiles as getLangList()
  • Supported generating a language profile from plain text

and fixed some bugs.

Then I published a test data set for 21 languages based on Europarl Parallel Corpus ( http://www.statmt.org/europarl/ ) so that anyone are able to verify the library on the same condition.

It randomly samples 1000 sentences(lines) for each language from Europarl corpus.
Each line forms “[language code]\t[plain text by UTF-8]” to be avaiable for batch test tool of this langdetect library.
Europarl corpus has some of very short sentence(e.g. 2 words only!) that langdetect is not very good at. I remained them for fairness! šŸ˜€

Then the following is a result with this test data.

bg (985/1000=0.99): {bg=985, ru=4, mk=11}
cs (993/1000=0.99): {sk=6, en=1, cs=993}
da (971/1000=0.97): {da=971, no=28, en=1}
de (998/1000=1.00): {de=998, da=1, af=1}
el (1000/1000=1.00): {el=1000}
en (997/1000=1.00): {fr=1, en=997, nl=1, af=1}
es (995/1000=1.00): {pt=4, en=1, es=995}
et (996/1000=1.00): {de=1, fi=2, et=996, af=1}
fi (998/1000=1.00): {fi=998, et=2}
fr (998/1000=1.00): {it=1, sv=1, fr=998}
hu (999/1000=1.00): {id=1, hu=999}
it (998/1000=1.00): {it=998, es=2}
lt (998/1000=1.00): {lv=2, lt=998}
lv (999/1000=1.00): {pt=1, lv=999}
nl (977/1000=0.98): {de=1, sv=1, nl=977, af=21}
pl (999/1000=1.00): {pl=999, nl=1}
pt (994/1000=0.99): {it=1, hu=1, pt=994, en=1, es=3}
ro (999/1000=1.00): {ro=999, fr=1}
sk (987/1000=0.99): {sl=2, sk=987, ro=1, lt=1, et=1, cs=8}
sl (972/1000=0.97): {hr=27, sl=972, en=1}
sv (990/1000=0.99): {da=2, no=8, sv=990}
total: 20843/21000 = 0.993

This is obtained by batchtest tool.

java -jar lib/langdetect.jar -d profiles -s 0 --batchtest europarl.test

The random seed(-s) is set to 0 for reproduce.

Hence Danish(da), Dutch(nl) and Slovene(sl) are very similar to Norwegian(no), Afrikaans(af) and Croatian(hr) respectively, their detection accuracies are lower than others.
(While Norwegian has some proper letters which are not used in Dutch, so its accuracy leaves higher)

Advertisements
This entry was posted in Language Detection, NLP. Bookmark the permalink.

2 Responses to langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)

  1. Pingback: Accuracy and performance of Google's Compact Language Detector - Blog - SearchWorkings.org

  2. Margus Waffa says:

    Thank You for this work!

    I come across your work when reading presentation in http://www.ndl.go.jp/jp/aboutus/pdf/IIPC_shibata.pdf
    Lets hope we will resolve all the problems on time.

    Many needed sites and sites of users, on public web should also have an option on cPanel and notified by email, that main server would them self make a copy of sites and send copy to archive centre by pipe made for that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s