- Apache Nutch
- http://nutch.apache.org/
Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system.
Nutch’s crawler has a language identification plugin
I’ll want to substitute Nutch’s LanguageIdentifier for our Language Detection library, but I’m afraid that Apache Nutch’s document is quite poor. So I researched the structure of Nutch.
I referred to the Nutch’s Wiki and the following presentations.
Strages
Nutch has 3 strages (crawl db, link db and segments) and index/indexes.
It is able to put on the distributed system with Hadoop HDFS, but I leave it.
- crawl db
- scheduling for crawl
- link db
- linked graph information for page scoring
- segments
- resources themselves(texts of web page, snippets, intermediate resources)
- index
- search index
- indexex
- differences of index for each segment of incremental update
Commands of nutch_readdb and nutch_segread are available to confirmation of the link db and the segments.
luke, the Lucene’s index administrative tool, is available to confirmation of index.
Works of Nutch Crawler
Command nutch_crawl works injecting seed urls, crawling web pages, parsing contents and generating index automatically.
Based on the actual behaviors of nutch_crawl, Nutch indexing work is the following.
- inject : add seed urls to webdb (a table in crawl db?) and link db(as vertexes?)
- generate : generate fetchlist (a intermediate resource in segments?) from webdb
- fetch : get contents (according to fetchlist?) and store them into segments
- parse : parse contents in segments (Apache Tika, parser for each mime type)
- update crawldb : add linked url which extracted by parsing into webdb?
- go back 2 if webdb is not empty
- update linkdb : update link db with outputs of fetcher (and parser?)
- index : generate indexes from each segment
- Dedup : delete duplicated documents (example : http://www.cnn.com and cnn.com)
- IndexMerger : merge updated indexes into index
Hi,
On my side, I think nutch needs a more user friendly administration interface.
Crawl Anywhere proposes an easy to use administration interface
http://www.crawl-anywhere.com
Dominique
Thanks, I agree that Nutch has a poor administration(too flexible to handle).
Your Crawl Anywhere is a simple solution for this problem, isn’t it?
I think everything typed was actually very reasonable.
However, think about this, what if you were to write a killer post title?
I mean, I don’t wish to tell you how to run your blog, but what if you added something that grabbed people’s attention?
I mean Apache Nutch’s Architecture | Shuyo’s Weblog is a little boring. You could look at Yahoo’s home page and watch how they create post headlines to get people
to click. You might add a related video or a related pic
or two to grab people interested about what you’ve written. Just my opinion, it could make your posts a little bit more interesting.
Hey, I think your site might be having browser compatibility issues.
When I look at your blog site in Safari, it looks fine but when opening
in Internet Explorer, it has some overlapping. I just wanted to give
you a quick heads up! Other then that, very good blog!