In Feb 25, I attended the last day of CICLing 2011 (International Conference on Intelligent Text Processing and Computational Linguistics) at Waseda University, Japan.
I enjoyed very much so this is my first time to attend international conferences.
Well, I’ll write my impressions for several sessions.
Christopher D. Manning: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?
It is timely so I’ve just read Chapter 10 of “Foundations of Statistical Natural Language Processing”(FSNLP).
FSNLP which was published in 1999 said POS taggers have 97% accuracy, and this session said the current best taggers have also 97% accuracy. In the last decade, the accuracy of POS taggers have mainly improved for unknown words. Then analyzing the tag errors prove that mistakes in training data cause 40% over of the errors.
So he claims it needs to improve not only models and features but the traning data for nearly 100% accuracy. The thought seems very natural for me, as engineers.
This session’s paper can be found at Manning’s site.
Dongwoo Kim, Alice Oh: Topic Chains for Understanding a News Corpus
This is about news tracking over time using LDA (Latent Dirichlet Allocation). It clusters news by some metrics (as similarities) between topic probabilities of documents.
The used similarity metrics are Cosine similarity, Jaccard’s coefficient, Kendall’s tau coefficient, Discounted cumulative gain, KL divergence and Jensen-Shannon divergence. But I wonder how about using inner product as metrics.
It is more consistent than cosine similarity so the inner product between topic probabilities of documents means the probability that both documents have the same topic.
Hiram Calvo, Kentaro Inui, Yuji Matsumoto: Co-related Verb Argument Selectional Preferences
This is about selectional preferences. The normal selectional preferences independently solve common usages of subject and object for verbs, while this proposal handles co-occurence of subject and objective noun. So it doesn’t allow a strange phrase like “Human eats grass”.
This uses SVM with features as PLSI (Probabilistic Latent Semantic Indexes) and PMI (Pointwise Mutual Information). I reckon this idea can be applied other problems.
This session wins the first place of Best paper award.