Apache Nutch’s Architecture

Apache Nutch
http://nutch.apache.org/

Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system.
Nutch’s crawler has a language identification plugin

I’ll want to substitute Nutch’s LanguageIdentifier for our Language Detection library, but I’m afraid that Apache Nutch’s document is quite poor. So I researched the structure of Nutch.

I referred to the Nutch’s Wiki and the following presentations.

Strages

Nutch has 3 strages (crawl db, link db and segments) and index/indexes.
It is able to put on the distributed system with Hadoop HDFS, but I leave it.

crawl db
scheduling for crawl
link db
linked graph information for page scoring
segments
resources themselves(texts of web page, snippets, intermediate resources)
index
search index
indexex
differences of index for each segment of incremental update

Commands of nutch_readdb and nutch_segread are available to confirmation of the link db and the segments.

luke, the Lucene’s index administrative tool, is available to confirmation of index.

Works of Nutch Crawler

Command nutch_crawl works injecting seed urls, crawling web pages, parsing contents and generating index automatically.

Based on the actual behaviors of nutch_crawl, Nutch indexing work is the following.

  1. inject : add seed urls to webdb (a table in crawl db?) and link db(as vertexes?)
  2. generate : generate fetchlist (a intermediate resource in segments?) from webdb
  3. fetch : get contents (according to fetchlist?) and store them into segments
  4. parse : parse contents in segments (Apache Tika, parser for each mime type)
  5. update crawldb : add linked url which extracted by parsing into webdb?
  6. go back 2 if webdb is not empty
  7. update linkdb : update link db with outputs of fetcher (and parser?)
  8. index : generate indexes from each segment
  9. Dedup : delete duplicated documents (example : http://www.cnn.com and cnn.com)
  10. IndexMerger : merge updated indexes into index

4 thoughts on “Apache Nutch’s Architecture

  1. Thanks, I agree that Nutch has a poor administration(too flexible to handle).
    Your Crawl Anywhere is a simple solution for this problem, isn’t it?

  2. I think everything typed was actually very reasonable.
    However, think about this, what if you were to write a killer post title?
    I mean, I don’t wish to tell you how to run your blog, but what if you added something that grabbed people’s attention?
    I mean Apache Nutch’s Architecture | Shuyo’s Weblog is a little boring. You could look at Yahoo’s home page and watch how they create post headlines to get people
    to click. You might add a related video or a related pic
    or two to grab people interested about what you’ve written. Just my opinion, it could make your posts a little bit more interesting.

  3. Hey, I think your site might be having browser compatibility issues.
    When I look at your blog site in Safari, it looks fine but when opening
    in Internet Explorer, it has some overlapping. I just wanted to give
    you a quick heads up! Other then that, very good blog!

Leave a comment