Estimation of ldig (twitter Language Detection) for LIGA dataset

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.
They open their dataset here which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: site, blog) to their result.

At first, it needs to comvert their dataset into ldig-available format.
Here is a ruby script to convert it.

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

open("liga_dataset.txt", "wb:UTF-8") do |f|
  ["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
    lang = dir[0,2]
    Dir.glob("#{dir}/*.txt") do |file|
      line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
      f.puts "#{lang}\t#{line}"
    end
  end
end

Here is a estimation of ldig for the generated dataset, liga_dataset.txt.

lang size detect correct precision recall
de 1479 1469 1463 0.9959 0.9892
en 1505 1504 1489 0.9900 0.9894
es 1562 1550 1541 0.9942 0.9866
fr 1551 1545 1539 0.9961 0.9923
it 1539 1532 1526 0.9961 0.9916
nl 1430 1429 1425 0.9972 0.9965
total 9066 8983 0.9908

It shows that ldig can detect over 99% accuracy for their dataset.

Reference

  • [Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”
Advertisements
This entry was posted in Language Detection, NLP, twitter. Bookmark the permalink.

2 Responses to Estimation of ldig (twitter Language Detection) for LIGA dataset

  1. Peter Wong says:

    hi, I have read the LIGA[Tromp+ 11] paper, and found that the paper used cross-validation on the LIGA data set, i.e., the paper only used part of the data set as training data and test on the rest data.
    I am wondering whether your results here were obtained by retraining your ldig on the same data set. or you just trained your ldig on some other data and test on the LIGA data?
    Thank you!

    • shuyo says:

      Hi,
      I prepared the other dataset by myself and estimated LIGA dataset as a test data.
      My dataset is not open because I cannot solve its copyright problems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s