My language detection library “langdetect” was updated.
The added features are the following:
- Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene.
- Supported retrieving a list of loaded language profiles as getLangList()
- Supported generating a language profile from plain text
and fixed some bugs.
Then I published a test data set for 21 languages based on Europarl Parallel Corpus ( http://www.statmt.org/europarl/ ) so that anyone are able to verify the library on the same condition.
It randomly samples 1000 sentences(lines) for each language from Europarl corpus.
Each line forms “[language code]\t[plain text by UTF-8]” to be avaiable for batch test tool of this langdetect library.
Europarl corpus has some of very short sentence(e.g. 2 words only!) that langdetect is not very good at. I remained them for fairness!
Then the following is a result with this test data.
bg (985/1000=0.99): {bg=985, ru=4, mk=11}
cs (993/1000=0.99): {sk=6, en=1, cs=993}
da (971/1000=0.97): {da=971, no=28, en=1}
de (998/1000=1.00): {de=998, da=1, af=1}
el (1000/1000=1.00): {el=1000}
en (997/1000=1.00): {fr=1, en=997, nl=1, af=1}
es (995/1000=1.00): {pt=4, en=1, es=995}
et (996/1000=1.00): {de=1, fi=2, et=996, af=1}
fi (998/1000=1.00): {fi=998, et=2}
fr (998/1000=1.00): {it=1, sv=1, fr=998}
hu (999/1000=1.00): {id=1, hu=999}
it (998/1000=1.00): {it=998, es=2}
lt (998/1000=1.00): {lv=2, lt=998}
lv (999/1000=1.00): {pt=1, lv=999}
nl (977/1000=0.98): {de=1, sv=1, nl=977, af=21}
pl (999/1000=1.00): {pl=999, nl=1}
pt (994/1000=0.99): {it=1, hu=1, pt=994, en=1, es=3}
ro (999/1000=1.00): {ro=999, fr=1}
sk (987/1000=0.99): {sl=2, sk=987, ro=1, lt=1, et=1, cs=8}
sl (972/1000=0.97): {hr=27, sl=972, en=1}
sv (990/1000=0.99): {da=2, no=8, sv=990}
total: 20843/21000 = 0.993
This is obtained by batchtest tool.
java -jar lib/langdetect.jar -d profiles -s 0 --batchtest europarl.test
The random seed(-s) is set to 0 for reproduce.
Hence Danish(da), Dutch(nl) and Slovene(sl) are very similar to Norwegian(no), Afrikaans(af) and Croatian(hr) respectively, their detection accuracies are lower than others.
(While Norwegian has some proper letters which are not used in Dutch, so its accuracy leaves higher)
Pingback: Accuracy and performance of Google's Compact Language Detector - Blog - SearchWorkings.org