Crawlpod - open source scalable web crawler

Dec 12, 2015

Intro

Earlier I wrote about Building a scalable distributed web crawler. Recently I built an open source one based on it.

Code

Interested in reading code rather than this blog post. Here it is for you.

New Goals

Self contained, should be able to run in a single node.
Fully asynchronous, no blocking call anywhere.
Easy to plug in different providers for underlying storage, say cache, queue, etc.,

Tech Stack

Entire framework is written in Scala. Apparently, now Scala is my default JVM language.
Everything is a Akka actor and so clear separation of responsibilities.
Mongodb used for various sub systems and mongodb scala driver is used.
Dispatch for http requests and jsoup as dom parser.
Scalatest for testing and Logback for logging.
Json4s for handling JSON and Scala xml for XML.

Design

Let me explain the design in terms of Storage systems and Actors involved.

Storage systems

Following four storage systems are required. They can be implemented on top various providers based on the use case and scale. Currently all four are implemented using Mongodb.

Queue is used to queue http requests.
Request Store is used to determine if a given request is already processed or not.
Raw Store is used to cache the entire response from the http request.
Json Store is used to store the extracted JSON from the response.

In its earlier avatar, Kafka was used for Queue, Couchbase was used for Request Store and Json Store and S3 was used for Raw Store.

Actors involved

Now comes the interesting part, Actors.

All the world's a stage, and all the men and women merely players.
- William Shakespeare, As You Like It.

We have following actors with specific responsibility.

Controller Actor is the lead actor which controls the flow and not yet mature.
Http Actor is a brave actor which sends HTTP request. Once HTTP response is received, it sends it to Extract Actor and Raw Store Actor.
Extract Actor is a soft hardworking actor, which gets HTTP response and process it. It sends extracted new requests to Queue Actor and extracted json to Json Store Actor. Once job done, it reports to Controller Actor and also tell Request Store Actor that particular request is processed.
Queue Actor is the friend of Controller Actor which is responsible for enqueue and dequeue of HTTP requests to the queue. When dequeued it sends the request to Request Store Actor to see if it has to be processed.
Raw Store Actor is the local cache actor which caches HTTP response. If it doesn't have response for specific request, it sends that to its brave friend Http Actor. In case, it already has the response, it send that to hardworking Extract Actor.
Json Store Actor is an easy actor, which just writes down all extracted JSON. It is the most underrated actor in the play which does its job very well.
Request Store Actor is the tough one in this lot, which do the heavy lifting of tracking all requests and also, implements mechanism to re extract the data.

Better read the code, CoreActors.scala. Also look at, ControllerActor.scala which is not yet mature.

Http

Currently dispatch is used as http client. May be moving to akka http later. Http client is used in the Http Actor

Core models

Look at the Models file. Important two classes are explained here.

CrawlRequest - Contains all data required to send HTTP request.

case class CrawlRequest (
    url: String, // URL of request, with query params.
    extractor: String, // package.classname.methodname of extractor.
    method: String = "GET", // Http method.
    headers: Option[Map[String, String]] = None, // Optional Header.
    // Optional data that can be passed around.
    passData: Option[Map[String, String]] = None,
    requestBody: Option[String] = None, //Post request body
    cache: Boolean = true )

CrawlResponse - represents the http response with additional data.

case class CrawlResponse(
    request: CrawlRequest, // Original crawl request object
    status: Int, // http status. Say 200, 404, etc.,
    headers: Map[String, List[String]], // Response headers
    body: String, // response body
    created: Long = System.currentTimeMillis,
    timeTaken: Int = -1) // Time taken to get this response.

Now lets Crawl.

Following is an illustrative one, it won't work as such.
Assumed that mongodb is running somewhere and configured mongodb.url in application.conf

Steps 1, 2, 3.

Create a extractor code which returns Extract object.

package net.crawlpod.extract

class Example {
  def init(response: CrawlResponse): Extract = {
    val dom = response.toDom
    // Heavy lifting next two lines,
    // Custom implementation for various pages of interest.
    val docs = extractDocsFromDom(dom) // Extract json docs.
    val requests = extractRequestsFromDom(dom) // Extract next set of urls.
    new Extract(docs,requests)
  }
}

Add an entry to the queue. Since Mongodb is used, we have to add an Json object into queue collection.

{
    "url" : "http://example.com",
    "extractor" : "net.crawlpod.extract.Example.init",
    "method" : "GET",
    "passData" : {
        "date1" : "01-Apr-2014",
        "date2" : "31-Mar-2015"
    },
    "cache" : true,
    "used" : false
}

Now run the application by launching net.crawlpod.core.CrawlPod

Output

We can more find more data added to following collections in Mongodb.
- queue
- request
- raw
- json
Just export the data from json collection. We are done!

Just into second gear.

I hope this is just a start, not yet ready for general use.
Lot of ideas yet to be implemented to make it easier to use.
Need to document Crawlpod in its own site as it grows.

Kudos! you read till this point, just go ahead and share it. Thanks!