My recent concerns are mainly Machine Learning and Natural Language Processing.


28 Responses to About

  1. Yusuf Nur says:


    I’m currently write thesis about text analysis and want to cited your
    “Language Detection Library for Java”. do you write any publication about that? and how to cited your project?

  2. shuyo says:

    Hi, I’m glad to hear that.
    Hence there are no publication of the library, would you write its author (Shuyo Nakatani), title (Language Detection Library for Java) and URL( http://code.google.com/p/language-detection/ ) to cite?

  3. Yusuf Nur says:

    Ok, thanks.

  4. Haidong Gao says:

    Are you a student?I guess so. I am a student researching on topic models. However, I am a newcomer in this field. Your blog posts have taught me much. Please accept my thanks.

    • shuyo says:

      Thank you, too!
      I am a web engineer and started to study machine learnings about 2 years ago.
      Please tell me if you find some mistakes! 😀

  5. Beau Moore says:

    Hi Shuyo,

    Are you still actively developing your language-detection Java library?

    • shuyo says:

      I’m maintaining the library currently and experimenting another idea of language-detection.
      What are you concerned in?

      • Beau Moore says:

        Is there any intention to port the library to C++? Is there any documentation regarding how to generate language profiles? I am very impressed with the abilities of the library, thank you for committing the time and effort. Would you be interested in some kind of assistance in furthering development and/or financial support?

      • shuyo says:

        I have no plan to support C++ about this library.
        I also have no document for language profile generation, but its code is very short and straightforward so that I guess you could probably understand it to read them.

        • com.cybozu.labs.langdetect.util.LangProfile
        • com.cybozu.labs.langdetect.Command#generateProfile()

        > Would you be interested in some kind of assistance in furthering development and/or financial support?

        Thanks for your proporsal. It is welcome to folk this library!

        I am a engineer in Cybozu Labs, a research subsidiary of a Japanese groupware company.
        I’m interested in various methods of natural language processing and am developing another langage detecition!

  6. shimon says:

    Hello Mr. Shuyo,
    I want to try the language detector code you’ve written , and I see a library

    is required.
    What is this library ? I couldn’t find any documentation about this. What does it do ?

    Thanks !

  7. shimon says:

    Somehow the library name was not added to my question ( perhaps since it was a url ) –
    so I meant JSONIC 🙂

  8. shimon says:

    Hi Again ,
    Another question regarding the language detector –
    The languages from the detector are a two string letters identifying the lanauge –
    Where could I find the mapping to the languages themselves ? Are these some standard language codes ?

    • Beau Moore says:

      The two alpha character codes are part of ISO639-1 here -> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

      How about enhancement to the library to use the ISO639-3 3 character codes for identified languages as it is more inclusive?

      • shuyo says:

        Thanks for your answer.
        As the previous comment mentioned, the language codes using on langdetect is quite standard.

        > How about enhancement to the library to use the ISO639-3 3 character codes for identified languages as it is more inclusive?

        The language code is decided when profile is generated. So 3 character-codes are probably available with the corresponding profiles.
        But I don’t verify it 😀

  9. Marco Lui says:

    Hello Shuyo

    I have been researching language identification as well. I recently published a paper about it at IJCNLP2011. I investigated training language identification models using training data from multiple domains, you can find my paper at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf . There is also an implementation of my system available at http://www.csse.unimelb.edu.au/research/lt/resources/langid/ . I hope that you will be able to include it in future comparisons! My testing indicates that it is very competitive with CLD, as well as your java language-detection software.


  10. Pingback: Quora

  11. Sachin Walunj says:

    i m new to apache mahout…
    plz give me the detail step to install it on the eclipse.

  12. Tom says:

    Nakatani-san, Your libraries are very nicely done! We are working on closely releated stuff — also in Python. Please send me an email — perhaps we can meet and exchange ideas. Are you in Tokyo?

  13. Dough Nguyen says:

    Hi Mr Shuyo,

    Thanks for your materials..they’re so helpful, I’ve been learning a lot of things in ML with it.
    I need to train CRF to recognize entities in a query, such as:

    where can I buy a cheap laptop that is small

    Can your implementation of CRF in python work for these purpose if I just use an annotated training data, or do I need to do something extra? Thanks

  14. Parth says:

    I have tried to configure mahout in eclipse but i got following error and could not solved yet.
    please help me.
    i am doing research project on mahout.

  15. munazza Khan says:

    I am currently working on language identification. I have been studying your blog. This is very very informative. Thank you so much for sharing. I have a question to ask about LIGA library. I need to find some proper tutorial about this. As i m totally a new bee in this arena I need a bit guidance. I have to use it in my project.

  16. Yaniv says:

    Hi Shuyo,
    Your Blog and project are very interesting!

    Can you suggest the best Optimization for using your language-detection Library for Java for identifying Short-Text messages ( like chat lines ) ?

    I’m looking to identify English, and only English, with a very high precision and with loosing as little as possible.

    I want to do it a “near realtime” so i couldn’t use the Ldig as a service
    Do you believe that I should load ALL profile.sm files or maybe only English profile or maybe just some of them?

    I also thought of implementing some kind of Majority test which will run detection for several times upon sentence and take the majority.


  17. Pete says:

    I’m writing a thesis about classification of documents, especially language detection. So I have a question about the profile files. How do you construct them? Does ” mm”:300023 means that you recognized 300023 times the n-gramm ” mm” inside the training corpus?
    Thanks for your work and best greetings from germany

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s