Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.


Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

The above slide explains how ldig solves those problems.

27 thoughts on “Short Text Language Detection with Infinity-Gram

  1. Great work.

    Just one minor issue. I cloned your git repository and ran the server.py, but it didn’t work for me out of the box. When I load the page, I see the textbox, but no results. On closer inspection, the AJAX calls go through, but the server gives out the following error: “TypeError: 1013 is not JSON serializable” (I’m using Python2.7).

    Turns out this is because that particular instance of the number 1013 is of type numpy.int32, which the standard JSON encoder in Python2.7 can not serialize. The solution is to have a custom encoder like this:

    class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
    if isinstance(obj, numpy.int32):
    return int(obj)
    return json.JSONEncoder.default(self, obj)

    Then the call to json.dump should look like this:

    json.dump(detector.detect(text), self.wfile, cls=CustomJSONEncoder)

    And that fixes it.

    1. Very thanks!
      Although I am using Python 2.7.2, your trouble has never reproduced in my environment…
      The current version of json may support numpy.int32 (I don’t confirm it yet)

      1. Mmm, I tried some checkes.
        – The json module in Python 2.7.2 doesn’t support numpy object too.
        – The server.py of ldig doesn’t put numpy objects into json.dump (all numerics convert into formatted string)
        Then I guess your trouble cannot happen. Did you modify the code somewhere?

    1. Thanks.
      Hence CJK detection is not as difficult as Latin-alphabet languages and I am not using Python 3, I don’t have the both plans you mentioned at present.

  2. Thanks for the great work Shuyo! I’m relatively new to Python and i’m trying to use this library for a school project. I’m having a little trouble figuring out how to run this against a tweet string using the predefined training dictionaries. Does anyone know of any examples of this posted

    1. Prepare test data each line in which is formatted like the below.

      [language name]\t

      The language names are like en, fr or de (all available language labels are written at “Supported languages” in https://github.com/shuyo/ldig ) and ‘\t’ means TAB(\x09).

      Then exec ldig.py like the below.

      ldig.py -m [model directory] [test data path]
      I hope it work well.

  3. Dear Shuyo,

    your work on language detection is really impressive!

    i have tested ldig for small european sentences (not twitter like) and it looks very promising (even if i do not have any annotated/gold, so i can not do a real accurate test, it’s some kind of feeling about few manual reading), also i have readed that you may provide catalan soon, and hope you will do so,

    i am interested only in a few of the languages that you support with ldig, so i am wondering if there is any possibilities to constraint probabilities calculation so that it chooses among only a subset of the supported languages. For example i have noticed that it may be difficult to distinct french and romanian, as i do not expect to have romanian input, thus french should be the only choice. Any clue on how i shoud do that?

    last no least, is there any way to remove Numpy so it can be only Python dependant?

    1. Though I replied to you on mail, I’ll also copy the answer here.

      —-
      ldig can detect as only supported languages and tends to detect as
      them by force.
      As the distributed model of ldig doesn’t support Romanian,
      many Romanian texts may be detected in some Romance languages like
      French, as you mentioned.
      Though it is needed to support Romanian to solve this conflict, I
      cannot do it because I don’t have enough corpus in Romanian…

      Although excluding numpy from ldig is not impossible, I guess it will
      be considerable slow then.
      So I have no plan to do it. Sorry.

      1. thank you for your replies !

        ok, so i think i misformulated my question, is there a way to choose/force calculation so the ldig will totally ignore some languages it supports? (i am sure that i will only have input from a smaller subset of the 19, so i do not want to have any misdetection)

      2. Oh I see.
        Although ldig cannot do it, modifying ldig makes it possible.
        The predict method in ldig.py has the following lines.

        exp_w = numpy.exp(sum_w – sum_w.max())
        return exp_w / exp_w.sum()

        If you want to ignore Romanian (I misunderstood that the bundled model does not support Romanian at the previous comment… sorry…), insert the code to eliminate probabilities of the target languages into them.

        exp_w = numpy.exp(sum_w – sum_w.max())
        exp_w[15] = 0 # turn Romanian probability into 0
        return exp_w / exp_w.sum()

        The language’s index corresponds to the index of the language label in labels.json on the model directory.
        I will be happy that this works for you.

  4. Dear Shuyo,
    You work is very impressive. Do you continue to extend your work to allow detecting Japanese in Twitter? By the way, I tried the java langdetect library for detecting Japanese tweets, but there are the following problem:
    com.cybozu.labs.langdetect.LangDetectException: no features in text
    at com.cybozu.labs.langdetect.Detector.detectBlock(Detector.java:234)
    at com.cybozu.labs.langdetect.Detector.getProbabilities(Detector.java:220)
    at com.cybozu.labs.langdetect.Detector.detect(Detector.java:208)
    Would you please recommend me any solution for solving the above problem or for detecting Japanese tweets? Thank you.

    1. This exception is thrown when the target text has no available substring to detect.
      So please catch LangDetectException and manage suitably (e.g. set the detected label to null).

      Hence Japanese/Chinese detection is not as difficult as Latin-alphabet languages, I don’t have such plan at present. Sorry.

  5. Any interest in a Scala version? This would allow python-like functional code, yet still run on the JVM ecosystem…

    1. I have no plan to implement it in other platforms for the present. Sorry.
      Hence ldig is licensed in the MIT license, you can re-implement it freely 😀

  6. Dear Shuyo, your work is very impressive. I’d like to test ldig.py under Phyton 2.7. I read usage notes at https://github.com/shuyo/ldig but unfortunately is not clear for me how to run your tool. Could you plese provide me a sample file and a sample of command line syntax?

    The error I have is:
    C:\Python27\Scripts\ldig-master>ldig.py -m “c:\Python27\Scripts\ldig-master\mode
    ls” “C:\Python27\Scripts\ldig-master\tweets.txt”
    Usage: ldig.py [options]

    ldig.py: error: features file doesn’t exist

    The file I use is:
    en\thello world
    en\tciao a tutti

    Thanks a lot
    A.Z

  7. Hi Shuyo,
    very nice library. I tested it last week and I compared your results with the results provided by twitter (since March 26, all tweets have now a lang attribute when using twitter online api). With twitter, around 50% of tweets have no language. So i will continue with your library.
    Just two questions:
    1/ is there an easy way to make online langage detection? Something like calling a method with a tweet as input and langage as output.
    2/ why do you need to provide langage name as input?
    thanks
    colin

  8. ok, i looked the code and made a quick patch to have online language detection.
    still because I don’t know the language of the message, I don’t provide anything. Results looks ok so I let it like this.

    It means that the likelihood method simplifies to these 7 lines:
    label, text, org_text = normalize_text(msg)

    events = self.trie.extract_features(u”\u0001″ + text + u”\u0001″)
    y = predict(self.paramdata, events)
    predict_k = y.argmax()

    predict_lang = self.labels[predict_k]
    if y[predict_k] < 0.6: predict_lang = ""
    return predict_lang

    1. Hi,
      I created Catalan corpus only with cooperation of Catalan speakers 😀
      Though some languages need particular normalization for accuracy, Catalan doesn’t.

  9. Hi Shuyo,

    This has 18000 Malay tagged sentences. Using these sentences can I add detection of Malay to the existing model? Also, what can I do to add CJK languages? You have mentioned CJK-Kanji Normalization in your slides for ldig. Does this mean that CJK are added in ldig or you were talking about the java langdetect library?

    Thanks

  10. Hi Shuyo,

    いい仕事ですね,おめでとう!^^

    I am interested in knowing what would be the best way to use the Python package on https://github.com/shuyo/ldig as an import in other scripts, is this something that you plan to add in the future? something like:

    import ldig
    print ldig.detect(“some sample text”)

    As per other similar tools
    Thanks!

Leave a comment