I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter.
It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, Dutch, Norwegian, Polish, Portuguese, Romanian, Swedish, Turkish and Vietnamese).
ldig specialized with noisy short text (more than 3 words) and is limited to Latin alphabet language because input text can separate into character type blocks and Latin alphabet detection is most difficult.
My language-detection (langdetect) is not good at short text detection, so that most users seem troubled in language detection for twitter.
langdetect uses character 3-grams as feature so it is insufficient for short text detection.
I supposed that maximal substrings [Okanohara+ 09] makes sufficient features for short text detection and prepared twitter corpus with 17 languages.
Then training/test corpus size and estimation of ldig prototype is the below.
| lang | training | test | correct | accuracy |
|---|---|---|---|---|
| cs | 4581 | 5329 | 5319 | 99.81 |
| da | 5480 | 5476 | 5308 | 96.93 |
| de | 43930 | 9659 | 9611 | 99.50 |
| en | 44912 | 9612 | 9497 | 98.80 |
| es | 44921 | 10127 | 10050 | 99.24 |
| fi | 4576 | 4490 | 4464 | 99.42 |
| fr | 44142 | 10063 | 10014 | 99.51 |
| id | 44873 | 10183 | 10163 | 99.80 |
| it | 44045 | 10152 | 10110 | 99.59 |
| nl | 44933 | 9677 | 9532 | 98.50 |
| no | 7525 | 8513 | 8192 | 96.23 |
| pl | 12854 | 10070 | 10059 | 99.89 |
| pt | 44464 | 9459 | 9359 | 98.94 |
| ro | 6114 | 5902 | 5812 | 98.48 |
| sv | 44339 | 9952 | 9870 | 99.18 |
| tr | 44787 | 10309 | 10301 | 99.92 |
| vi | 10413 | 10494 | 10481 | 99.88 |
| total | 496889 | 149467 | 148142 | 99.11 |
I’m preparing Catalan corpus with some helps (THANKS!), so supported languages will increase sooner.
ldig write out model as numpy binary format now, but I will modify it into more portable format, MessagePack like, then detector of ldig can probably port in other platforms easily.
The presentation of ldig is here (but this is written in Japanese!
)
- Language Detection with Infinity-Gram (in Japanese)
And I’ll read its paper at the annual conference of The Association for Natural Language Processing in Japan (NLP2012).
I’ll publish the paper on this blog after the conference (but it’s in Japanese!
).
P.S.
I opened a slide about twitter language detection in English.
Reference
- [Daisuke Okanohara and Jun'ichi Tsujii 09] “Text Categorization with All Substring Features”
Would it be possible to get your training set, and add to it?
I ‘m thinking what license and format I can publish it with.
I reckon its format may be pairs of twitter ID and language label.
Its license … My company have not publish data like this, so I’m afraid I cannot decide without some agreements.
What about your test set?
If you release just the test data, at least people can compare their approach to yours.
I admit that publish is useful for most people who has the same theme.
But because both training and test sets are consist of someone’s tweets, I should publish it very carefully, I’m afraid.
And at the beginning I left publish in my mind, so half of data has no tweet id…
That is also why I have problems to publish.
Pingback: Precision and Recall of ldig (twitter language detection) | Shuyo's Weblog
Pingback: Estimation of ldig (twitter Language Detection) for LIGA dataset | Shuyo's Weblog
Hello Shuyo!
I have worked with others that have used Twitter as training data, they also had concerns about releasing Twitter messages. Providing message id and language label would be enough, and I would also find this data very useful! It would be important to release this as soon as possible, as I have found that often messages become impossible to access. Simon Carter released language ground truth for 5000 messages (http://ilps.science.uva.nl/resources/twitterlid) as id-language pairs, but I was only able to recover about 80% of them from Twitter. I would also like some information about how you prepare the ground truth – is it done by human, or is there some automatic language identification involved?
Cheers
Marco
Hi Marco,
I must negotiate in my company because we have not open such data, then I can’t open them right now.
But I want to do, so sooner or later…
I cannot help more time passes, less accessible tweets.
I prepared and am preparing the corpus with both automatic and manual.
At first tweets are separated by users’ timezone and are tentatively labeled with langdetect and other detectors.
Then I check and modify the tentative label.
More accurate ldig is, more useful it is for label error discovery by itself.
Hello Shuyo!
I have more questions for you. Other researchers have pointed out that over 50% of messages on twitter are not in English (http://semiocast.com/downloads/Semiocast_Half_of_messages_on_Twitter_are_not_in_English_20100224.pdf) and that the second most-common language is Japanese. It seems interesting to me that Japanese is not included in your training set! Another point I have noticed is that you have included Indonesian but not Malay. In my experience I have found that these two are quite difficult to distinguish. I suspect that your system will identify many Malay messages as Indonesian. Will you try to distinguish Malay from Indonesian eventually? I would also like to share with you the Indigenous Tweets project of Kevin Scannell, who has been collecting Twitter messages in a large number of low-density languages: http://indigenoustweets.com/
Cheers
Marco
Hi Marco,
Yes, I know Japanese is the second major language in twitter.
I want to develop short-text language detector for each character type and consider Latin-alphabet language is the most difficult.
It is because ldig doesn’t support Japanese.
I guess also that Japanese and Chinese, which has Kanji as the same character type, can be detected easier because Japanese has Kana additionally.
I want to support Malay too, but I reckon it is too difficult for me…
As long as I read http://en.wikipedia.org/wiki/Differences_between_Malaysian_and_Indonesian ,
I’m afraid whether ldig AND I can distinguish Indonesian and Malay.
> Indigenous Tweets project
Oh, it is very interesting!!
I avoid collecting tweets by user because I guess such corpus may have high bias.
But it is not realistic for minor languages.
I dream all language support, but it is impossible by myself…
Hi, i tested your library so far it works great from me.
My question: Is it possible to call it for one tweet at time.
eg. in the __init__ method, I would load the model or ldig, then each time I receive a tweet, I call a method with the text of the tweet and it gives me back the identified language.
I was thinking about rapidly doing it .. just in case you already have it.
thanks for sharing,
bests -
Hi,
You can do it if you create a class for it.
I have not prepared something for it yet.
I am planning to accelerate learning of ldig, then I may prepare after it…
Hello
i want to use ldig for farsi(persian). is it possible to give it a persian corpus and training it for farsi?
as you may know persian is so similar to ordu and arabic too.
in presentation i found some command like : ldig.py -m [] –init [] -x [] –ff=[]
can i use it to give my own corpus in farsi ?
I want to do it if I can, but I don’t have enough corpus of arabic, persian and so on.
Though it is necessary to modify slightly because ldig has some specialized code for latin-alphabet, I expect –init and –learning commands generate ldig-model for arabic-alphabet languages as you mentioned.
thanks for your reply
but when i use –init command with a persian corpus, i recieve this message:
no label data at 1 in UPEC.txt
no label data at 2 in UPEC.txt
no label data at 3 in UPEC.txt
no label data at 4 in UPEC.txt
.
.
and i use this command :
python ldig.py -m models/model.small –init UPEC.txt -x maxsubst/maxsubst.vcproj –ff=2
for -x command i use “maxsubst.vcproj” file in Ldig file you uploded
and format of corpus file is like this:
اولین ADJ_SUP
سیاره N_SING
خارج P
از P
منظومه N_SING
شمسی ADJ_SIM
دیده V_PP
شد V_PRE
See
http://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447/60
about data format.
maxsubst option needs the executable extractor, so it is necessary to build the maxsubst tool, e.g.
$ g++ -I cybozulib/include/ -O3 -o maxsubst maxsubst.cpp