My recent concerns are mainly Machine Learning and Natural Language Processing.
I’m currently write thesis about text analysis and want to cited your
“Language Detection Library for Java”. do you write any publication about that? and how to cited your project?
Hi, I’m glad to hear that.
Hence there are no publication of the library, would you write its author (Shuyo Nakatani), title (Language Detection Library for Java) and URL( http://code.google.com/p/language-detection/ ) to cite?
Are you a student？I guess so. I am a student researching on topic models. However, I am a newcomer in this field. Your blog posts have taught me much. Please accept my thanks.
Thank you, too!
I am a web engineer and started to study machine learnings about 2 years ago.
Please tell me if you find some mistakes! 😀
Are you still actively developing your language-detection Java library?
I’m maintaining the library currently and experimenting another idea of language-detection.
What are you concerned in?
Is there any intention to port the library to C++? Is there any documentation regarding how to generate language profiles? I am very impressed with the abilities of the library, thank you for committing the time and effort. Would you be interested in some kind of assistance in furthering development and/or financial support?
I have no plan to support C++ about this library.
I also have no document for language profile generation, but its code is very short and straightforward so that I guess you could probably understand it to read them.
> Would you be interested in some kind of assistance in furthering development and/or financial support?
Thanks for your proporsal. It is welcome to folk this library!
I am a engineer in Cybozu Labs, a research subsidiary of a Japanese groupware company.
I’m interested in various methods of natural language processing and am developing another langage detecition!
Hello Mr. Shuyo,
I want to try the language detector code you’ve written , and I see a library
What is this library ? I couldn’t find any documentation about this. What does it do ?
langdetect library is distributed at the below.
Please download and try. Thanks!
Somehow the library name was not added to my question ( perhaps since it was a url ) –
so I meant JSONIC 🙂
JSONIC is provided at here http://sourceforge.jp/projects/jsonic/devel/ .
Hi Again ,
Another question regarding the language detector –
The languages from the detector are a two string letters identifying the lanauge –
Where could I find the mapping to the languages themselves ? Are these some standard language codes ?
The two alpha character codes are part of ISO639-1 here -> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
How about enhancement to the library to use the ISO639-3 3 character codes for identified languages as it is more inclusive?
Thanks for your answer.
As the previous comment mentioned, the language codes using on langdetect is quite standard.
> How about enhancement to the library to use the ISO639-3 3 character codes for identified languages as it is more inclusive?
The language code is decided when profile is generated. So 3 character-codes are probably available with the corresponding profiles.
But I don’t verify it 😀
I have been researching language identification as well. I recently published a paper about it at IJCNLP2011. I investigated training language identification models using training data from multiple domains, you can find my paper at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf . There is also an implementation of my system available at http://www.csse.unimelb.edu.au/research/lt/resources/langid/ . I hope that you will be able to include it in future comparisons! My testing indicates that it is very competitive with CLD, as well as your java language-detection software.
Very Thanks! I’ll try your system.
i m new to apache mahout…
plz give me the detail step to install it on the eclipse.
Nakatani-san, Your libraries are very nicely done! We are working on closely releated stuff — also in Python. Please send me an email — perhaps we can meet and exchange ideas. Are you in Tokyo?
Hi Mr Shuyo,
Thanks for your materials..they’re so helpful, I’ve been learning a lot of things in ML with it.
I need to train CRF to recognize entities in a query, such as:
where can I buy a cheap laptop that is small
Can your implementation of CRF in python work for these purpose if I just use an annotated training data, or do I need to do something extra? Thanks
For example, you can use Treebank corpus for your purpose.
But in general, python scripts are useless for large-scale problems.
I guess you should use other CRF libraries in C/C++ implementation (e.g. CRF++ and so on).
Some CRF libraries are introduced in Wikipedia’s Software section.
I have tried to configure mahout in eclipse but i got following error and could not solved yet.
please help me.
i am doing research project on mahout.
I am currently working on language identification. I have been studying your blog. This is very very informative. Thank you so much for sharing. I have a question to ask about LIGA library. I need to find some proper tutorial about this. As i m totally a new bee in this arena I need a bit guidance. I have to use it in my project.
I don’t know about LIGA because it is not my project and closed.
Your Blog and project are very interesting!
Can you suggest the best Optimization for using your language-detection Library for Java for identifying Short-Text messages ( like chat lines ) ?
I’m looking to identify English, and only English, with a very high precision and with loosing as little as possible.
I want to do it a “near realtime” so i couldn’t use the Ldig as a service
Do you believe that I should load ALL profile.sm files or maybe only English profile or maybe just some of them?
I also thought of implementing some kind of Majority test which will run detection for several times upon sentence and take the majority.
I’m writing a thesis about classification of documents, especially language detection. So I have a question about the profile files. How do you construct them? Does ” mm”:300023 means that you recognized 300023 times the n-gramm ” mm” inside the training corpus?
Thanks for your work and best greetings from germany
Fill in your details below or click an icon to log in:
You are commenting using your WordPress.com account.
( Log Out /
You are commenting using your Google+ account.
( Log Out /
You are commenting using your Twitter account.
( Log Out /
You are commenting using your Facebook account.
( Log Out /
Connecting to %s
Notify me of new comments via email.