Short Text Language Detection with Infinity-Gram

May 17, 2012August 15, 2012 ~ shuyo

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.

Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

https://github.com/shuyo/ldig

The above slide explains how ldig solves those problems.

Published by shuyo

View all posts by shuyo

27 thoughts on “Short Text Language Detection with Infinity-Gram”

Jordan Dimov says:

May 28, 2012 at 7:26 pm

Great work.

Just one minor issue. I cloned your git repository and ran the server.py, but it didn’t work for me out of the box. When I load the page, I see the textbox, but no results. On closer inspection, the AJAX calls go through, but the server gives out the following error: “TypeError: 1013 is not JSON serializable” (I’m using Python2.7).

Turns out this is because that particular instance of the number 1013 is of type numpy.int32, which the standard JSON encoder in Python2.7 can not serialize. The solution is to have a custom encoder like this:

class CustomJSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, numpy.int32):
return int(obj)
return json.JSONEncoder.default(self, obj)

Then the call to json.dump should look like this:

json.dump(detector.detect(text), self.wfile, cls=CustomJSONEncoder)

And that fixes it.

Reply
1. shuyo says:
  
  May 29, 2012 at 12:43 pm
  
  Very thanks!
  Although I am using Python 2.7.2, your trouble has never reproduced in my environment…
  The current version of json may support numpy.int32 (I don’t confirm it yet)
  
  Reply
  1. shuyo says:
    
    May 29, 2012 at 4:27 pm
    
    Mmm, I tried some checkes.
    – The json module in Python 2.7.2 doesn’t support numpy object too.
    – The server.py of ldig doesn’t put numpy objects into json.dump (all numerics convert into formatted string)
    Then I guess your trouble cannot happen. Did you modify the code somewhere?
dbv says:

June 23, 2012 at 3:13 am

Shuyo: Are there plans to support the CJK languages. Also, any plans for Python 3?

Reply
1. shuyo says:
  
  June 25, 2012 at 12:56 pm
  
  Thanks.
  Hence CJK detection is not as difficult as Latin-alphabet languages and I am not using Python 3, I don’t have the both plans you mentioned at present.
  
  Reply
Randy says:

June 28, 2012 at 12:11 am

Thanks for the great work Shuyo! I’m relatively new to Python and i’m trying to use this library for a school project. I’m having a little trouble figuring out how to run this against a tweet string using the predefined training dictionaries. Does anyone know of any examples of this posted

Reply
1. shuyo says:
  
  June 28, 2012 at 4:30 pm
  
  Prepare test data each line in which is formatted like the below.
  
  [language name]\t
  
  The language names are like en, fr or de (all available language labels are written at “Supported languages” in https://github.com/shuyo/ldig ) and ‘\t’ means TAB(\x09).
  
  Then exec ldig.py like the below.
  
  ldig.py -m [model directory] [test data path]
  I hope it work well.
  
  Reply
  1. Randy says:
    
    June 28, 2012 at 9:26 pm
    
    Thanks!
Greg says:

July 4, 2012 at 4:42 pm

Dear Shuyo,

your work on language detection is really impressive!

i have tested ldig for small european sentences (not twitter like) and it looks very promising (even if i do not have any annotated/gold, so i can not do a real accurate test, it’s some kind of feeling about few manual reading), also i have readed that you may provide catalan soon, and hope you will do so,

i am interested only in a few of the languages that you support with ldig, so i am wondering if there is any possibilities to constraint probabilities calculation so that it chooses among only a subset of the supported languages. For example i have noticed that it may be difficult to distinct french and romanian, as i do not expect to have romanian input, thus french should be the only choice. Any clue on how i shoud do that?

last no least, is there any way to remove Numpy so it can be only Python dependant?

Reply
1. shuyo says:
  
  July 4, 2012 at 8:05 pm
  
  Though I replied to you on mail, I’ll also copy the answer here.
  
  —-
  ldig can detect as only supported languages and tends to detect as
  them by force.
  As the distributed model of ldig doesn’t support Romanian,
  many Romanian texts may be detected in some Romance languages like
  French, as you mentioned.
  Though it is needed to support Romanian to solve this conflict, I
  cannot do it because I don’t have enough corpus in Romanian…
  
  Although excluding numpy from ldig is not impossible, I guess it will
  be considerable slow then.
  So I have no plan to do it. Sorry.
  
  Reply
  1. Greg says:
    
    July 4, 2012 at 9:12 pm
    
    thank you for your replies !
    
    ok, so i think i misformulated my question, is there a way to choose/force calculation so the ldig will totally ignore some languages it supports? (i am sure that i will only have input from a smaller subset of the 19, so i do not want to have any misdetection)
  2. shuyo says:
    
    July 6, 2012 at 5:55 pm
    
    Oh I see.
    Although ldig cannot do it, modifying ldig makes it possible.
    The predict method in ldig.py has the following lines.
    
    exp_w = numpy.exp(sum_w – sum_w.max())
    return exp_w / exp_w.sum()
    
    If you want to ignore Romanian (I misunderstood that the bundled model does not support Romanian at the previous comment… sorry…), insert the code to eliminate probabilities of the target languages into them.
    
    exp_w = numpy.exp(sum_w – sum_w.max())
    exp_w[15] = 0 # turn Romanian probability into 0
    return exp_w / exp_w.sum()
    
    The language’s index corresponds to the index of the language label in labels.json on the model directory.
    I will be happy that this works for you.
Khanh says:

July 10, 2012 at 12:59 am

Dear Shuyo,
You work is very impressive. Do you continue to extend your work to allow detecting Japanese in Twitter? By the way, I tried the java langdetect library for detecting Japanese tweets, but there are the following problem:
com.cybozu.labs.langdetect.LangDetectException: no features in text
at com.cybozu.labs.langdetect.Detector.detectBlock(Detector.java:234)
at com.cybozu.labs.langdetect.Detector.getProbabilities(Detector.java:220)
at com.cybozu.labs.langdetect.Detector.detect(Detector.java:208)
Would you please recommend me any solution for solving the above problem or for detecting Japanese tweets? Thank you.

Reply
1. shuyo says:
  
  July 10, 2012 at 12:23 pm
  
  This exception is thrown when the target text has no available substring to detect.
  So please catch LangDetectException and manage suitably (e.g. set the detected label to null).
  
  Hence Japanese/Chinese detection is not as difficult as Latin-alphabet languages, I don’t have such plan at present. Sorry.
  
  Reply
Paul LaCrosse says:

October 30, 2012 at 11:36 pm

Any interest in a Scala version? This would allow python-like functional code, yet still run on the JVM ecosystem…

Reply
1. shuyo says:
  
  November 2, 2012 at 5:59 pm
  
  I have no plan to implement it in other platforms for the present. Sorry.
  Hence ldig is licensed in the MIT license, you can re-implement it freely 😀
  
  Reply
Alessandro Zonin says:

December 19, 2012 at 12:55 am

Dear Shuyo, your work is very impressive. I’d like to test ldig.py under Phyton 2.7. I read usage notes at https://github.com/shuyo/ldig but unfortunately is not clear for me how to run your tool. Could you plese provide me a sample file and a sample of command line syntax?

The error I have is:
C:\Python27\Scripts\ldig-master>ldig.py -m “c:\Python27\Scripts\ldig-master\mode
ls” “C:\Python27\Scripts\ldig-master\tweets.txt”
Usage: ldig.py [options]

ldig.py: error: features file doesn’t exist

The file I use is:
en\thello world
en\tciao a tutti

Thanks a lot
A.Z

Reply
1. Alessandro Zonin says:
  
  December 19, 2012 at 7:38 pm
  
  Shuyo, I’ve fixed this issue. Was my misunderstunding in read.me instruction. I’ve missed to extract model.latin 🙂 . Now it works. Thanks a lot. Ciao e grazie
  
  Reply
  1. shuyo says:
    
    December 20, 2012 at 1:11 pm
    
    No problem. I’m happy to hear you work well. Thanks, ciao!
colin says:

March 31, 2013 at 12:21 am

Hi Shuyo,
very nice library. I tested it last week and I compared your results with the results provided by twitter (since March 26, all tweets have now a lang attribute when using twitter online api). With twitter, around 50% of tweets have no language. So i will continue with your library.
Just two questions:
1/ is there an easy way to make online langage detection? Something like calling a method with a tweet as input and langage as output.
2/ why do you need to provide langage name as input?
thanks
colin

Reply
colin says:

March 31, 2013 at 6:51 am

ok, i looked the code and made a quick patch to have online language detection.
still because I don’t know the language of the message, I don’t provide anything. Results looks ok so I let it like this.

It means that the likelihood method simplifies to these 7 lines:
label, text, org_text = normalize_text(msg)

events = self.trie.extract_features(u”\u0001″ + text + u”\u0001″)
y = predict(self.paramdata, events)
predict_k = y.argmax()

predict_lang = self.labels[predict_k]
if y[predict_k] < 0.6: predict_lang = ""
return predict_lang

Reply
1. shuyo says:
  
  April 1, 2013 at 12:34 pm
  
  Hi,
  Thank you for trying ldig!
  
  Your code is right to predict only. It is for test that my code needs the correct languages.
  
  Reply
jebesen says:

October 16, 2013 at 4:19 pm

Hi Shuyo,

great work.

I am extremely interested in adding the Catalan as a language to detect. Any guide on how to do it?

Thanks

Reply
1. shuyo says:
  
  October 16, 2013 at 5:43 pm
  
  Hi,
  I created Catalan corpus only with cooperation of Catalan speakers 😀
  Though some languages need particular normalization for accuracy, Catalan doesn’t.
  
  Reply
shruti says:

February 8, 2017 at 9:24 pm

Hi Shuyo,

This has 18000 Malay tagged sentences. Using these sentences can I add detection of Malay to the existing model? Also, what can I do to add CJK languages? You have mentioned CJK-Kanji Normalization in your slides for ldig. Does this mean that CJK are added in ldig or you were talking about the java langdetect library?

Thanks

Reply
Julien says:

February 15, 2017 at 11:38 pm

Hi Shuyo,

いい仕事ですね，おめでとう！^^

I am interested in knowing what would be the best way to use the Python package on https://github.com/shuyo/ldig as an import in other scripts, is this something that you plan to add in the future? something like:

import ldig
print ldig.detect(“some sample text”)

As per other similar tools
Thanks!

Reply
Sam says:

November 2, 2019 at 1:06 am

Hi Shuyo,
Is it possible to add one more language to existing model?
Thanks

Reply