LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.
They open their dataset here which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: site, blog) to their result.
At first, it needs to comvert their dataset into ldig-available format.
Here is a ruby script to convert it.
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
open("liga_dataset.txt", "wb:UTF-8") do |f|
["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
lang = dir[0,2]
Dir.glob("#{dir}/*.txt") do |file|
line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
f.puts "#{lang}\t#{line}"
end
end
end
Here is a estimation of ldig for the generated dataset, liga_dataset.txt.
| lang | size | detect | correct | precision | recall |
|---|---|---|---|---|---|
| de | 1479 | 1469 | 1463 | 0.9959 | 0.9892 |
| en | 1505 | 1504 | 1489 | 0.9900 | 0.9894 |
| es | 1562 | 1550 | 1541 | 0.9942 | 0.9866 |
| fr | 1551 | 1545 | 1539 | 0.9961 | 0.9923 |
| it | 1539 | 1532 | 1526 | 0.9961 | 0.9916 |
| nl | 1430 | 1429 | 1425 | 0.9972 | 0.9965 |
| total | 9066 | 8983 | 0.9908 |
It shows that ldig can detect over 99% accuracy for their dataset.
Reference
- [Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”