Why is Norwegian and Danish identification difficult?

I re-post the estimation table of ldig (twitter language detection).

lang size detected correct precision recall
cs 5329 5330 5319 0.9979 0.9981
da 5478 5483 5311 0.9686 0.9695
de 10065 10076 10014 0.9938 0.9949
en 9701 9670 9569 0.9896 0.9864
es 10066 10075 9989 0.9915 0.9924
fi 4490 4472 4459 0.9971 0.9931
fr 10098 10097 10048 0.9951 0.9950
id 10181 10233 10167 0.9936 0.9986
it 10150 10191 10109 0.9920 0.9960
nl 9671 9579 9521 0.9939 0.9845
no 8560 8442 8219 0.9736 0.9602
pl 10070 10079 10054 0.9975 0.9984
pt 9422 9441 9354 0.9908 0.9928
ro 5914 5831 5822 0.9985 0.9844
sv 9990 10034 9866 0.9833 0.9876
tr 10310 10321 10300 0.9980 0.9990
vi 10494 10486 10479 0.9993 0.9986
total 149989 148600 0.9907

This shows that the accuracies of Norwegian and Danish are lower than others.
It is because Norwegian and Danish are very similar.

Here is top 25 of the word distribution of Norwegian and Danish.

word Danish Norwegian amount
1 er 0.0311 0.0238 0.0311
2 det 0.0287 0.0228 0.0287
3 i 0.0253 0.0275 0.0275
4 0.0165 0.0263 0.0263
5 jeg 0.0185 0.0202 0.0202
6 og 0.0188 0.0202 0.0202
7 at 0.0183 0.0083 0.0183
8 å 0.0001 0.0167 0.0167
9 til 0.0157 0.0140 0.0157
10 en 0.0149 0.0120 0.0149
11 ikke 0.0119 0.0146 0.0146
12 har 0.0122 0.0132 0.0132
13 med 0.0120 0.0117 0.0120
14 som 0.0044 0.0116 0.0116
15 for 0.0097 0.0115 0.0115
16 du 0.0111 0.0093 0.0111
17 0.0110 0.0072 0.0110
18 der 0.0101 0.0016 0.0101
19 av 0.0000 0.0093 0.0093
20 den 0.0089 0.0056 0.0089
21 af 0.0089 0.0000 0.0089
22 om 0.0052 0.0073 0.0073
23 kan 0.0073 0.0044 0.0073
24 men 0.0062 0.0072 0.0072
25 de 0.0066 0.0054 0.0066

Most of high frequency words, ‘er'(it nearly corresponds to ‘is’ in English, following is the same), ‘det'(‘it’) and’i'(‘in’) are common function words between 2 languages, so it is very difficult to identify them.
Useful words for the identification are a pretty few. For example, ‘of’ in English corresponds to ‘af’ in Danish or ‘av’ in Norwegian, ‘me’ corresponds to ‘mig'(da) or ‘meg'(no), ‘now’ corresponds to ‘nu'(da) or ‘nå'(no), ‘just’ correspond to ‘lige'(da) or ‘like'(no), ‘what’ corresponds to ‘hvad'(da) or ‘hva'(no), ‘no’ corresponds to ‘nej'(da) or ‘nei'(no), and so on.

The most serious problem is my language identification skill for Danish and Norwegian…
I collected tens of thousands of tweets in them, but I’m afraid that they contains several percent errors.
Of cource, that causes detection errors.

If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I’m glad you to introduce me! 😀

Advertisements
This entry was posted in Language Detection. Bookmark the permalink.

8 Responses to Why is Norwegian and Danish identification difficult?

  1. motre says:

    Is there a test someone can take to see how good are their skills in detecting between Norwegian and Danish?

    • shuyo says:

      Mmm, I can perhaps prepare such easy test,
      but I myself have no confidence that I can distinguish Norwegian and Danish difficult sentences…

  2. musi says:

    Hi,

    I am writing a software to handle about 20+ languages and the accuracy for Danish and Norwegian is pretty bad (Infact around 45% for Danish and 65% for Norwargian). Can you pls suggest me some simple solution to increase them upto 80%.

    I really appreciate your answer.

    Thank you,

  3. k says:

    Do you use character-grams or just compare complete words? There are spelling differences that might come in handy for less frequent words, e.g. in general Danish has a lot more d/g/b where Norwegian has t/k/p (kage/kake, uddannelse/utdannelse). And Danish is more likely to end nouns in “-e” where Norwegian (Bokmål) ends them with “-er”, preterite with “-ede” where Norwegian (Bokmål) uses “-de/-te”. Say you have a tweet containing the word “kvoteflygtninge” – fairly infrequent word form so it might not be in your training data, but the ending might give it away as Danish (along with the g where Norwegian has k: kvoteflyktninger)

    • shuyo says:

      I missed your comment. Sorry.
      My language detection library use character 3-gram. I also guess it is more effective than word features as you mentioned.

  4. We use your langdetect lib in Apache Solr.

    Wouldn’t it be possible to statistically compare the dictionaries from the Wikipedia corups of no/da and extract a list of possibly unique words for each (quite frequent in one, not in the other), then wash this list (I could help being native Norwegian) to generate a disambiguation dictionary for use in a 2nd pass in case you end up with “no” and “da” as 1st and 2nd candidate?

    Also, I’m not happy with “no””zh” detection. A text with majority of chinese characters and some few names (capitalized first letter) gets classified as Norwegian with 0.99999… and by changin only one letter typically a ‘sen’ trigram into something else, can flip it around to being “zh”. Looks as if Norwegian ‘-en’ and ‘-sen’ are so frequent that they dominate the choice, and that there are not enough bigram and trigram combinations for chinese to match most texts, so the profiles for chinese should be larger? And to me it makes sense that if, say >50% of the characters in a text are chinese, the text is tagged as chinese.

    • shuyo says:

      > Wouldn’t it be possible to statistically compare the dictionaries from the Wikipedia corups of no/da and extract a list of possibly unique words for each (quite frequent in one, not in the other), then wash this list (I could help being native Norwegian) to generate a disambiguation dictionary for use in a 2nd pass in case you end up with “no” and “da” as 1st and 2nd candidate?

      Word features are very sparse, so I guess it needs very huge dictionary for a effective detector.

      > Also, I’m not happy with “no””zh” detection. A text with majority of chinese characters and some few names (capitalized first letter) gets classified as Norwegian with 0.99999… and by changin only one letter typically a ‘sen’ trigram into something else, can flip it around to being “zh”.

      Though I cannot guess surely without seeing the actual data,
      it may have different distributions from training data.
      I guess it may need some preprocessing according to features of the text.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s