Language Detection Library for Java

I developed a Language Detection library for Java which is able to detect 49 languages for given text (English, Japanese, Chinese, …).

This library has 99% over accuracy for news corpus (see below presentation).

I’ll try to substitute Apache Nutch’s language identification module to this library.

Advertisements
This entry was posted in i18n, Java, NLP, text analysis. Bookmark the permalink.

3 Responses to Language Detection Library for Java

  1. Shashi Kiran says:

    Hi Shuyo,

    Thank you for the Library. Do you have any test projects which have been successful. If yes,
    could you provide me the details of it.

    Regards,
    SK

  2. Gheorghe Muresan says:

    Shuyo,

    Thanks for sharing this language detection tool.

    However, I have a question.
    I played a bit with the tool:
    I used the Google translator to translate “The quick brown fox jumped over the lazy dog” and to check if the language detector gets the language correctly.
    It detected correctly a bunch of languages, until I got to Russian (http://translate.google.com/#en/ru/The%20quick%20brown%20fox%20jumped%20over%20the%20lazy%20dog):
    @Test
    public final void testLanguageDetector() {
    LangDetector langDetector = new CybozuLangDetector();
    String text = “Быстрая коричневая лиса перепрыгнула через ленивую собаку”;
    String detected = langDetector.detect(text);
    assertEquals(“ru”, detected);
    }
    The language detector tells me that it’s Afrikaans.
    I’ve looked at the provided language profiles and Russian has lots of Cyrillic characters, while Afrikaans doesn’t, so there should be easy to distinguish between the two. When I deleted the Afrikaans profile and tried again, the text was classified as Dutch.

    I guess that the problem is with encoding, but I’m not sure how to fix this, in order to use the language detector successfully.

    Any suggestions would be appreciated.

    • shuyo says:

      Hi,
      Though I cannot tell correctly without detail, it seems Russian language profile is not loaded.
      If not, I would like you check character code of your source.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s