Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.

Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

The above slide explains how ldig solves those problems.

This entry was posted in Language Detection, NLP. Bookmark the permalink.

24 Responses to Short Text Language Detection with Infinity-Gram

  1. Jordan Dimov says:

    Great work.

    Just one minor issue. I cloned your git repository and ran the, but it didn’t work for me out of the box. When I load the page, I see the textbox, but no results. On closer inspection, the AJAX calls go through, but the server gives out the following error: “TypeError: 1013 is not JSON serializable” (I’m using Python2.7).

    Turns out this is because that particular instance of the number 1013 is of type numpy.int32, which the standard JSON encoder in Python2.7 can not serialize. The solution is to have a custom encoder like this:

    class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
    if isinstance(obj, numpy.int32):
    return int(obj)
    return json.JSONEncoder.default(self, obj)

    Then the call to json.dump should look like this:

    json.dump(detector.detect(text), self.wfile, cls=CustomJSONEncoder)

    And that fixes it.

    • shuyo says:

      Very thanks!
      Although I am using Python 2.7.2, your trouble has never reproduced in my environment…
      The current version of json may support numpy.int32 (I don’t confirm it yet)

      • shuyo says:

        Mmm, I tried some checkes.
        – The json module in Python 2.7.2 doesn’t support numpy object too.
        – The of ldig doesn’t put numpy objects into json.dump (all numerics convert into formatted string)
        Then I guess your trouble cannot happen. Did you modify the code somewhere?

  2. dbv says:

    Shuyo: Are there plans to support the CJK languages. Also, any plans for Python 3?

    • shuyo says:

      Hence CJK detection is not as difficult as Latin-alphabet languages and I am not using Python 3, I don’t have the both plans you mentioned at present.

  3. Randy says:

    Thanks for the great work Shuyo! I’m relatively new to Python and i’m trying to use this library for a school project. I’m having a little trouble figuring out how to run this against a tweet string using the predefined training dictionaries. Does anyone know of any examples of this posted

  4. Greg says:

    Dear Shuyo,

    your work on language detection is really impressive!

    i have tested ldig for small european sentences (not twitter like) and it looks very promising (even if i do not have any annotated/gold, so i can not do a real accurate test, it’s some kind of feeling about few manual reading), also i have readed that you may provide catalan soon, and hope you will do so,

    i am interested only in a few of the languages that you support with ldig, so i am wondering if there is any possibilities to constraint probabilities calculation so that it chooses among only a subset of the supported languages. For example i have noticed that it may be difficult to distinct french and romanian, as i do not expect to have romanian input, thus french should be the only choice. Any clue on how i shoud do that?

    last no least, is there any way to remove Numpy so it can be only Python dependant?

    • shuyo says:

      Though I replied to you on mail, I’ll also copy the answer here.

      ldig can detect as only supported languages and tends to detect as
      them by force.
      As the distributed model of ldig doesn’t support Romanian,
      many Romanian texts may be detected in some Romance languages like
      French, as you mentioned.
      Though it is needed to support Romanian to solve this conflict, I
      cannot do it because I don’t have enough corpus in Romanian…

      Although excluding numpy from ldig is not impossible, I guess it will
      be considerable slow then.
      So I have no plan to do it. Sorry.

      • Greg says:

        thank you for your replies !

        ok, so i think i misformulated my question, is there a way to choose/force calculation so the ldig will totally ignore some languages it supports? (i am sure that i will only have input from a smaller subset of the 19, so i do not want to have any misdetection)

      • shuyo says:

        Oh I see.
        Although ldig cannot do it, modifying ldig makes it possible.
        The predict method in has the following lines.

        exp_w = numpy.exp(sum_w – sum_w.max())
        return exp_w / exp_w.sum()

        If you want to ignore Romanian (I misunderstood that the bundled model does not support Romanian at the previous comment… sorry…), insert the code to eliminate probabilities of the target languages into them.

        exp_w = numpy.exp(sum_w – sum_w.max())
        exp_w[15] = 0 # turn Romanian probability into 0
        return exp_w / exp_w.sum()

        The language’s index corresponds to the index of the language label in labels.json on the model directory.
        I will be happy that this works for you.

  5. Khanh says:

    Dear Shuyo,
    You work is very impressive. Do you continue to extend your work to allow detecting Japanese in Twitter? By the way, I tried the java langdetect library for detecting Japanese tweets, but there are the following problem:
    com.cybozu.labs.langdetect.LangDetectException: no features in text
    at com.cybozu.labs.langdetect.Detector.detectBlock(
    at com.cybozu.labs.langdetect.Detector.getProbabilities(
    at com.cybozu.labs.langdetect.Detector.detect(
    Would you please recommend me any solution for solving the above problem or for detecting Japanese tweets? Thank you.

    • shuyo says:

      This exception is thrown when the target text has no available substring to detect.
      So please catch LangDetectException and manage suitably (e.g. set the detected label to null).

      Hence Japanese/Chinese detection is not as difficult as Latin-alphabet languages, I don’t have such plan at present. Sorry.

  6. Paul LaCrosse says:

    Any interest in a Scala version? This would allow python-like functional code, yet still run on the JVM ecosystem…

    • shuyo says:

      I have no plan to implement it in other platforms for the present. Sorry.
      Hence ldig is licensed in the MIT license, you can re-implement it freely 😀

  7. Dear Shuyo, your work is very impressive. I’d like to test under Phyton 2.7. I read usage notes at but unfortunately is not clear for me how to run your tool. Could you plese provide me a sample file and a sample of command line syntax?

    The error I have is:
    C:\Python27\Scripts\ldig-master> -m “c:\Python27\Scripts\ldig-master\mode
    ls” “C:\Python27\Scripts\ldig-master\tweets.txt”
    Usage: [options] error: features file doesn’t exist

    The file I use is:
    en\thello world
    en\tciao a tutti

    Thanks a lot

  8. colin says:

    Hi Shuyo,
    very nice library. I tested it last week and I compared your results with the results provided by twitter (since March 26, all tweets have now a lang attribute when using twitter online api). With twitter, around 50% of tweets have no language. So i will continue with your library.
    Just two questions:
    1/ is there an easy way to make online langage detection? Something like calling a method with a tweet as input and langage as output.
    2/ why do you need to provide langage name as input?

  9. colin says:

    ok, i looked the code and made a quick patch to have online language detection.
    still because I don’t know the language of the message, I don’t provide anything. Results looks ok so I let it like this.

    It means that the likelihood method simplifies to these 7 lines:
    label, text, org_text = normalize_text(msg)

    events = self.trie.extract_features(u”\u0001″ + text + u”\u0001″)
    y = predict(self.paramdata, events)
    predict_k = y.argmax()

    predict_lang = self.labels[predict_k]
    if y[predict_k] < 0.6: predict_lang = ""
    return predict_lang

  10. jebesen says:

    Hi Shuyo,

    great work.

    I am extremely interested in adding the Catalan as a language to detect. Any guide on how to do it?


    • shuyo says:

      I created Catalan corpus only with cooperation of Catalan speakers 😀
      Though some languages need particular normalization for accuracy, Catalan doesn’t.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s