I re-post the estimation table of ldig (twitter language detection).
| lang | size | detected | correct | precision | recall |
|---|---|---|---|---|---|
| cs | 5329 | 5330 | 5319 | 0.9979 | 0.9981 |
| da | 5478 | 5483 | 5311 | 0.9686 | 0.9695 |
| de | 10065 | 10076 | 10014 | 0.9938 | 0.9949 |
| en | 9701 | 9670 | 9569 | 0.9896 | 0.9864 |
| es | 10066 | 10075 | 9989 | 0.9915 | 0.9924 |
| fi | 4490 | 4472 | 4459 | 0.9971 | 0.9931 |
| fr | 10098 | 10097 | 10048 | 0.9951 | 0.9950 |
| id | 10181 | 10233 | 10167 | 0.9936 | 0.9986 |
| it | 10150 | 10191 | 10109 | 0.9920 | 0.9960 |
| nl | 9671 | 9579 | 9521 | 0.9939 | 0.9845 |
| no | 8560 | 8442 | 8219 | 0.9736 | 0.9602 |
| pl | 10070 | 10079 | 10054 | 0.9975 | 0.9984 |
| pt | 9422 | 9441 | 9354 | 0.9908 | 0.9928 |
| ro | 5914 | 5831 | 5822 | 0.9985 | 0.9844 |
| sv | 9990 | 10034 | 9866 | 0.9833 | 0.9876 |
| tr | 10310 | 10321 | 10300 | 0.9980 | 0.9990 |
| vi | 10494 | 10486 | 10479 | 0.9993 | 0.9986 |
| total | 149989 | 148600 | 0.9907 |
This shows that the accuracies of Norwegian and Danish are lower than others.
It is because Norwegian and Danish are very similar.
Here is top 25 of the word distribution of Norwegian and Danish.
| word | Danish | Norwegian | amount | |
| 1 | er | 0.0311 | 0.0238 | 0.0311 |
| 2 | det | 0.0287 | 0.0228 | 0.0287 |
| 3 | i | 0.0253 | 0.0275 | 0.0275 |
| 4 | på | 0.0165 | 0.0263 | 0.0263 |
| 5 | jeg | 0.0185 | 0.0202 | 0.0202 |
| 6 | og | 0.0188 | 0.0202 | 0.0202 |
| 7 | at | 0.0183 | 0.0083 | 0.0183 |
| 8 | å | 0.0001 | 0.0167 | 0.0167 |
| 9 | til | 0.0157 | 0.0140 | 0.0157 |
| 10 | en | 0.0149 | 0.0120 | 0.0149 |
| 11 | ikke | 0.0119 | 0.0146 | 0.0146 |
| 12 | har | 0.0122 | 0.0132 | 0.0132 |
| 13 | med | 0.0120 | 0.0117 | 0.0120 |
| 14 | som | 0.0044 | 0.0116 | 0.0116 |
| 15 | for | 0.0097 | 0.0115 | 0.0115 |
| 16 | du | 0.0111 | 0.0093 | 0.0111 |
| 17 | så | 0.0110 | 0.0072 | 0.0110 |
| 18 | der | 0.0101 | 0.0016 | 0.0101 |
| 19 | av | 0.0000 | 0.0093 | 0.0093 |
| 20 | den | 0.0089 | 0.0056 | 0.0089 |
| 21 | af | 0.0089 | 0.0000 | 0.0089 |
| 22 | om | 0.0052 | 0.0073 | 0.0073 |
| 23 | kan | 0.0073 | 0.0044 | 0.0073 |
| 24 | men | 0.0062 | 0.0072 | 0.0072 |
| 25 | de | 0.0066 | 0.0054 | 0.0066 |
Most of high frequency words, ‘er’(it nearly corresponds to ‘is’ in English, following is the same), ‘det’(‘it’) and’i'(‘in’) are common function words between 2 languages, so it is very difficult to identify them.
Useful words for the identification are a pretty few. For example, ‘of’ in English corresponds to ‘af’ in Danish or ‘av’ in Norwegian, ‘me’ corresponds to ‘mig’(da) or ‘meg’(no), ‘now’ corresponds to ‘nu’(da) or ‘nå’(no), ‘just’ correspond to ‘lige’(da) or ‘like’(no), ‘what’ corresponds to ‘hvad’(da) or ‘hva’(no), ‘no’ corresponds to ‘nej’(da) or ‘nei’(no), and so on.
The most serious problem is my language identification skill for Danish and Norwegian…
I collected tens of thousands of tweets in them, but I’m afraid that they contains several percent errors.
Of cource, that causes detection errors.
If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I’m glad you to introduce me!
Is there a test someone can take to see how good are their skills in detecting between Norwegian and Danish?
Mmm, I can perhaps prepare such easy test,
but I myself have no confidence that I can distinguish Norwegian and Danish difficult sentences…
Hi,
I am writing a software to handle about 20+ languages and the accuracy for Danish and Norwegian is pretty bad (Infact around 45% for Danish and 65% for Norwargian). Can you pls suggest me some simple solution to increase them upto 80%.
I really appreciate your answer.
Thank you,
I can tell nothing because I don’t know what model and features you use.
Do you use character-grams or just compare complete words? There are spelling differences that might come in handy for less frequent words, e.g. in general Danish has a lot more d/g/b where Norwegian has t/k/p (kage/kake, uddannelse/utdannelse). And Danish is more likely to end nouns in “-e” where Norwegian (Bokmål) ends them with “-er”, preterite with “-ede” where Norwegian (Bokmål) uses “-de/-te”. Say you have a tweet containing the word “kvoteflygtninge” – fairly infrequent word form so it might not be in your training data, but the ending might give it away as Danish (along with the g where Norwegian has k: kvoteflyktninger)
I missed your comment. Sorry.
My language detection library use character 3-gram. I also guess it is more effective than word features as you mentioned.
We use your langdetect lib in Apache Solr.
Wouldn’t it be possible to statistically compare the dictionaries from the Wikipedia corups of no/da and extract a list of possibly unique words for each (quite frequent in one, not in the other), then wash this list (I could help being native Norwegian) to generate a disambiguation dictionary for use in a 2nd pass in case you end up with “no” and “da” as 1st and 2nd candidate?
Also, I’m not happy with “no”"zh” detection. A text with majority of chinese characters and some few names (capitalized first letter) gets classified as Norwegian with 0.99999… and by changin only one letter typically a ‘sen’ trigram into something else, can flip it around to being “zh”. Looks as if Norwegian ‘-en’ and ‘-sen’ are so frequent that they dominate the choice, and that there are not enough bigram and trigram combinations for chinese to match most texts, so the profiles for chinese should be larger? And to me it makes sense that if, say >50% of the characters in a text are chinese, the text is tagged as chinese.
> Wouldn’t it be possible to statistically compare the dictionaries from the Wikipedia corups of no/da and extract a list of possibly unique words for each (quite frequent in one, not in the other), then wash this list (I could help being native Norwegian) to generate a disambiguation dictionary for use in a 2nd pass in case you end up with “no” and “da” as 1st and 2nd candidate?
Word features are very sparse, so I guess it needs very huge dictionary for a effective detector.
> Also, I’m not happy with “no””zh” detection. A text with majority of chinese characters and some few names (capitalized first letter) gets classified as Norwegian with 0.99999… and by changin only one letter typically a ‘sen’ trigram into something else, can flip it around to being “zh”.
Though I cannot guess surely without seeing the actual data,
it may have different distributions from training data.
I guess it may need some preprocessing according to features of the text.