Apache Nutch’s Architecture

Apache Nutch: http://nutch.apache.org/

Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system.
Nutch’s crawler has a language identification plugin

I’ll want to substitute Nutch’s LanguageIdentifier for our Language Detection library, but I’m afraid that Apache Nutch’s document is quite poor. So I researched the structure of Nutch.

I referred to the Nutch’s Wiki and the following presentations.

Strages

Nutch has 3 strages (crawl db, link db and segments) and index/indexes.
It is able to put on the distributed system with Hadoop HDFS, but I leave it.

crawl db: scheduling for crawl
link db: linked graph information for page scoring
segments: resources themselves(texts of web page, snippets, intermediate resources)
index: search index
indexex: differences of index for each segment of incremental update

Commands of nutch_readdb and nutch_segread are available to confirmation of the link db and the segments.

luke, the Lucene’s index administrative tool, is available to confirmation of index.

http://code.google.com/p/luke/

Works of Nutch Crawler

Command nutch_crawl works injecting seed urls, crawling web pages, parsing contents and generating index automatically.

http://wiki.apache.org/nutch/bin/nutch_crawl

Based on the actual behaviors of nutch_crawl, Nutch indexing work is the following.

inject : add seed urls to webdb (a table in crawl db?) and link db(as vertexes?)
generate : generate fetchlist (a intermediate resource in segments?) from webdb
fetch : get contents (according to fetchlist?) and store them into segments
parse : parse contents in segments (Apache Tika, parser for each mime type)
update crawldb : add linked url which extracted by parsing into webdb?
go back 2 if webdb is not empty
update linkdb : update link db with outputs of fetcher (and parser?)
index : generate indexes from each segment
Dedup : delete duplicated documents (example : http://www.cnn.com and cnn.com)
IndexMerger : merge updated indexes into index

4 thoughts on “Apache Nutch’s Architecture”

Hi,

On my side, I think nutch needs a more user friendly administration interface.
Crawl Anywhere proposes an easy to use administration interface
http://www.crawl-anywhere.com

Dominique

Thanks, I agree that Nutch has a poor administration(too flexible to handle).
Your Crawl Anywhere is a simple solution for this problem, isn’t it?

I think everything typed was actually very reasonable.
However, think about this, what if you were to write a killer post title?
I mean, I don’t wish to tell you how to run your blog, but what if you added something that grabbed people’s attention?
I mean Apache Nutch�s Architecture | Shuyo’s Weblog is a little boring. You could look at Yahoo’s home page and watch how they create post headlines to get people
to click. You might add a related video or a related pic
or two to grab people interested about what you’ve written. Just my opinion, it could make your posts a little bit more interesting.

Hey, I think your site might be having browser compatibility issues.
When I look at your blog site in Safari, it looks fine but when opening
in Internet Explorer, it has some overlapping. I just wanted to give
you a quick heads up! Other then that, very good blog!

Apache Nutch’s Architecture

Strages

Works of Nutch Crawler

Published by shuyo

4 thoughts on “Apache Nutch’s Architecture”

Leave a comment Cancel reply

Strages

Works of Nutch Crawler

Share this:

Related

Published by shuyo

4 thoughts on “Apache Nutch’s Architecture”

Leave a comment Cancel reply