Language Detection for twitter with 99.1% Accuracy

I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter.

It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, Dutch, Norwegian, Polish, Portuguese, Romanian, Swedish, Turkish and Vietnamese).
ldig specialized with noisy short text (more than 3 words) and is limited to Latin alphabet language because input text can separate into character type blocks and Latin alphabet detection is most difficult.

My language-detection (langdetect) is not good at short text detection, so that most users seem troubled in language detection for twitter.
langdetect uses character 3-grams as feature so it is insufficient for short text detection.
I supposed that maximal substrings [Okanohara+ 09] makes sufficient features for short text detection and prepared twitter corpus with 17 languages.
Then training/test corpus size and estimation of ldig prototype is the below.

lang training test correct accuracy
cs 4581 5329 5319 99.81
da 5480 5476 5308 96.93
de 43930 9659 9611 99.50
en 44912 9612 9497 98.80
es 44921 10127 10050 99.24
fi 4576 4490 4464 99.42
fr 44142 10063 10014 99.51
id 44873 10183 10163 99.80
it 44045 10152 10110 99.59
nl 44933 9677 9532 98.50
no 7525 8513 8192 96.23
pl 12854 10070 10059 99.89
pt 44464 9459 9359 98.94
ro 6114 5902 5812 98.48
sv 44339 9952 9870 99.18
tr 44787 10309 10301 99.92
vi 10413 10494 10481 99.88
total 496889 149467 148142 99.11

I’m preparing Catalan corpus with some helps (THANKS!), so supported languages will increase sooner.

ldig write out model as numpy binary format now, but I will modify it into more portable format, MessagePack like, then detector of ldig can probably port in other platforms easily.

The presentation of ldig is here (but this is written in Japanese! 😀 )

And I’ll read its paper at the annual conference of The Association for Natural Language Processing in Japan (NLP2012).
I’ll publish the paper on this blog after the conference (but it’s in Japanese! :P).


I opened a slide about twitter language detection in English.


  • [Daisuke Okanohara and Jun’ichi Tsujii 09] “Text Categorization with All Substring Features”
This entry was posted in Language Detection, NLP, Python. Bookmark the permalink.

25 Responses to Language Detection for twitter with 99.1% Accuracy

  1. crutis says:

    Would it be possible to get your training set, and add to it?

    • shuyo says:

      I ‘m thinking what license and format I can publish it with.
      I reckon its format may be pairs of twitter ID and language label.
      Its license … My company have not publish data like this, so I’m afraid I cannot decide without some agreements.

      • What about your test set?

        If you release just the test data, at least people can compare their approach to yours.

      • shuyo says:

        I admit that publish is useful for most people who has the same theme.
        But because both training and test sets are consist of someone’s tweets, I should publish it very carefully, I’m afraid.
        And at the beginning I left publish in my mind, so half of data has no tweet id…
        That is also why I have problems to publish.

    • Yasen says:

      Nice job, Shuyo!

      Since twitter has some language detection (not very good at first sight), do you have any numbers for their accuracy on the corpus you’re working with and have you done any comparison with their system?

      Good luck!

      • shuyo says:

        Accuracy of twitter’s self language detection is very bad because speed has top priority in their system (they MUST handle billions of tweets per a day!).
        So I am not interested in their accuracy and don’t use their estimating language in corpus creation.

      • Yasen says:

        Good point! Forgot about speed 🙂

  2. Pingback: Precision and Recall of ldig (twitter language detection) | Shuyo's Weblog

  3. Pingback: Estimation of ldig (twitter Language Detection) for LIGA dataset | Shuyo's Weblog

  4. Marco Lui says:

    Hello Shuyo!

    I have worked with others that have used Twitter as training data, they also had concerns about releasing Twitter messages. Providing message id and language label would be enough, and I would also find this data very useful! It would be important to release this as soon as possible, as I have found that often messages become impossible to access. Simon Carter released language ground truth for 5000 messages ( as id-language pairs, but I was only able to recover about 80% of them from Twitter. I would also like some information about how you prepare the ground truth – is it done by human, or is there some automatic language identification involved?


    • shuyo says:

      Hi Marco,
      I must negotiate in my company because we have not open such data, then I can’t open them right now.
      But I want to do, so sooner or later…
      I cannot help more time passes, less accessible tweets.

      I prepared and am preparing the corpus with both automatic and manual.
      At first tweets are separated by users’ timezone and are tentatively labeled with langdetect and other detectors.
      Then I check and modify the tentative label.
      More accurate ldig is, more useful it is for label error discovery by itself. 😀

  5. Marco Lui says:

    Hello Shuyo!

    I have more questions for you. Other researchers have pointed out that over 50% of messages on twitter are not in English ( and that the second most-common language is Japanese. It seems interesting to me that Japanese is not included in your training set! Another point I have noticed is that you have included Indonesian but not Malay. In my experience I have found that these two are quite difficult to distinguish. I suspect that your system will identify many Malay messages as Indonesian. Will you try to distinguish Malay from Indonesian eventually? I would also like to share with you the Indigenous Tweets project of Kevin Scannell, who has been collecting Twitter messages in a large number of low-density languages:


    • shuyo says:

      Hi Marco,
      Yes, I know Japanese is the second major language in twitter.
      I want to develop short-text language detector for each character type and consider Latin-alphabet language is the most difficult.
      It is because ldig doesn’t support Japanese.
      I guess also that Japanese and Chinese, which has Kanji as the same character type, can be detected easier because Japanese has Kana additionally.

      I want to support Malay too, but I reckon it is too difficult for me…
      As long as I read ,
      I’m afraid whether ldig AND I can distinguish Indonesian and Malay. 😀

      > Indigenous Tweets project

      Oh, it is very interesting!!
      I avoid collecting tweets by user because I guess such corpus may have high bias.
      But it is not realistic for minor languages. 😀
      I dream all language support, but it is impossible by myself…

  6. colin says:

    Hi, i tested your library so far it works great from me.
    My question: Is it possible to call it for one tweet at time.
    eg. in the __init__ method, I would load the model or ldig, then each time I receive a tweet, I call a method with the text of the tweet and it gives me back the identified language.
    I was thinking about rapidly doing it .. just in case you already have it.
    thanks for sharing,
    bests –

    • shuyo says:

      You can do it if you create a class for it.
      I have not prepared something for it yet.
      I am planning to accelerate learning of ldig, then I may prepare after it…

  7. zahra says:

    i want to use ldig for farsi(persian). is it possible to give it a persian corpus and training it for farsi?
    as you may know persian is so similar to ordu and arabic too.
    in presentation i found some command like : -m [] –init [] -x [] –ff=[]
    can i use it to give my own corpus in farsi ?

    • shuyo says:

      I want to do it if I can, but I don’t have enough corpus of arabic, persian and so on.
      Though it is necessary to modify slightly because ldig has some specialized code for latin-alphabet, I expect –init and –learning commands generate ldig-model for arabic-alphabet languages as you mentioned.

      • zahra says:

        thanks for your reply 🙂
        but when i use –init command with a persian corpus, i recieve this message:
        no label data at 1 in UPEC.txt
        no label data at 2 in UPEC.txt
        no label data at 3 in UPEC.txt
        no label data at 4 in UPEC.txt

        and i use this command :
        python -m models/model.small –init UPEC.txt -x maxsubst/maxsubst.vcproj –ff=2

        for -x command i use “maxsubst.vcproj” file in Ldig file you uploded
        and format of corpus file is like this:
        اولین ADJ_SUP
        سیاره N_SING
        خارج P
        از P
        منظومه N_SING
        شمسی ADJ_SIM
        دیده V_PP
        شد V_PRE

      • shuyo says:

        about data format.
        maxsubst option needs the executable extractor, so it is necessary to build the maxsubst tool, e.g.
        $ g++ -I cybozulib/include/ -O3 -o maxsubst maxsubst.cpp

  8. Dear Shuyo

    Thank you for ldig! I have tried it with some Italian tweets and it works very well.

    I am interested in using ldig models from another platform (erlang). I would like to build models with ldig, and then use the models from a “recogniser” written in erlang.

    You wrote:

    > ldig write out model as numpy binary format now, but I will modify it
    > into more portable format, MessagePack like, then detector of ldig
    > can probably port in other platforms easily.

    Did you make progress writing out models in a portable format?

    If you are agreeable, I would like to complete the necessary changes to (a fork of) ldig, and write the erlang recogniser. I have been authorised to release this code as open-source.

    Please could we discuss further by email?

    With thanks and best wishes


    (ps couldn’t comment from wordpress account)

    • shuyo says:

      Dear Ivan,

      Thank you for trying ldig!
      I am writing a faster implementation of ldig in C++ here .
      This handles the model in more portable format, but may change its format and has few documentation yet.
      If you are interested in it, I am happy you to read its code and ask me about it.


      • Dear Shuyo

        Thanks for your reply (sorry for late response).

        I saw the cpp branch after I wrote here. Unfortunately I got tangled up in dependencies (CMake, Boost, running on an old Mac, …). Meanwhile I have implemented the LIGA algorithm in python (model builder) and erlang (classifier) (I don’t really mind if the model builder is slow;).

        I am interested in your improvements / alternatives to LIGA so when and liga.erl are in their “optimisation” phase I’ll return and study your presentations :).

        Best wishes


  9. stepraux says:

    Hi Shuyo,

    Thank you for ldig, it is very impressive !

    I would like to use it as a java package, do you know if someone did a java implementation, or is planning to do so ?

    I will have a look at your cpp implementation.



  10. showdep says:

    Shuyo, thanks a lot for making it available for all. Question: how can I train ldig on my dataset and create my models? Thanks!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s