- Apache Nutch
Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system.
Nutch’s crawler has a language identification plugin
I’ll want to substitute Nutch’s LanguageIdentifier for our Language Detection library, but I’m afraid that Apache Nutch’s document is quite poor. So I researched the structure of Nutch.
I referred to the Nutch’s Wiki and the following presentations.
Nutch has 3 strages (crawl db, link db and segments) and index/indexes.
It is able to put on the distributed system with Hadoop HDFS, but I leave it.
- crawl db
- scheduling for crawl
- link db
- linked graph information for page scoring
- resources themselves(texts of web page, snippets, intermediate resources)
- search index
- differences of index for each segment of incremental update
Commands of nutch_readdb and nutch_segread are available to confirmation of the link db and the segments.
luke, the Lucene’s index administrative tool, is available to confirmation of index.
Works of Nutch Crawler
Command nutch_crawl works injecting seed urls, crawling web pages, parsing contents and generating index automatically.
Based on the actual behaviors of nutch_crawl, Nutch indexing work is the following.
- inject : add seed urls to webdb (a table in crawl db?) and link db(as vertexes?)
- generate : generate fetchlist (a intermediate resource in segments?) from webdb
- fetch : get contents (according to fetchlist?) and store them into segments
- parse : parse contents in segments (Apache Tika, parser for each mime type)
- update crawldb : add linked url which extracted by parsing into webdb?
- go back 2 if webdb is not empty
- update linkdb : update link db with outputs of fetcher (and parser?)
- index : generate indexes from each segment
- Dedup : delete duplicated documents (example : http://www.cnn.com and cnn.com)
- IndexMerger : merge updated indexes into index