Skip to content
Chris Mattmann edited this page Oct 13, 2015 · 3 revisions

Welcome to the nutch-python wiki!

Getting started with Nutch-Python

Right now the API is evolving rapidly, but here is some code that should get you running.

Start the Nutch Server

Download and build the latest version of Nutch trunk (in a separate terminal).

  1. git clone https://github.com/apache/nutch.git
  2. cd nutch
  3. ant runtime
  4. cd runtime/local
  5. ./bin/nutch startserver

Get your Nutch-Python script going

from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch

sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls) 

nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
    job = cc.progress() # gets the current job if no progress, else iterates and makes progress
    if job == None:
        break
Clone this wiki locally