In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages.
There are some requests to tell ldig’s precision and recall, so I calculated them.
| lang | size | detected | correct | precision | recall |
|---|---|---|---|---|---|
| cs | 5329 | 5330 | 5319 | 0.9979 | 0.9981 |
| da | 5478 | 5483 | 5311 | 0.9686 | 0.9695 |
| de | 10065 | 10076 | 10014 | 0.9938 | 0.9949 |
| en | 9701 | 9670 | 9569 | 0.9896 | 0.9864 |
| es | 10066 | 10075 | 9989 | 0.9915 | 0.9924 |
| fi | 4490 | 4472 | 4459 | 0.9971 | 0.9931 |
| fr | 10098 | 10097 | 10048 | 0.9951 | 0.9950 |
| id | 10181 | 10233 | 10167 | 0.9936 | 0.9986 |
| it | 10150 | 10191 | 10109 | 0.9920 | 0.9960 |
| nl | 9671 | 9579 | 9521 | 0.9939 | 0.9845 |
| no | 8560 | 8442 | 8219 | 0.9736 | 0.9602 |
| pl | 10070 | 10079 | 10054 | 0.9975 | 0.9984 |
| pt | 9422 | 9441 | 9354 | 0.9908 | 0.9928 |
| ro | 5914 | 5831 | 5822 | 0.9985 | 0.9844 |
| sv | 9990 | 10034 | 9866 | 0.9833 | 0.9876 |
| tr | 10310 | 10321 | 10300 | 0.9980 | 0.9990 |
| vi | 10494 | 10486 | 10479 | 0.9993 | 0.9986 |
| total | 149989 | 148600 | 0.9907 |
The sum of data size is not equal to the amount of detected languages because ldig outputs “” as language when the max probability is lower than 0.6.
And the data size is not equal to one in the previous article because the dataset is updated.
I reckoned it doesn’t make sense over 99% accuracy, then what’s about?
Hiya,
If there is any way to run this code from within java?
I am using java for a project, and I need to call lang detection for twitter within the java?
Thank you!
You can do it to call ldig as an external process (though it is slow).
Thanks.