Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/CrawlingMultipleStartURLs"

(Workflow Definition)
(Workflow Definition)
Line 14: Line 14:
  
 
<source lang="javascript">
 
<source lang="javascript">
POST /smila/jobmanager/workflows
+
POST /smila/jobmanager/workflows/
 
{
 
{
 
   "name":"multiWebCrawling",
 
   "name":"multiWebCrawling",

Revision as of 10:52, 23 August 2012

Crawling multiple start URLs in one job run

This page describes an alternative of using the WebCrawler worker. One use case for this is to make it possible to define multiple start URLs that should be crawled in a single job run (but multiple workflow runs). The main idea is to send the start URLs as single records via the document push API (/smila/job/<crawlJobName>/record) instead of specifying it in the job definition.

It would easy to define more variants that crawl start URLs produced by other workers or similar use cases.

Idea.png

We did test this only with the WebCrawler worker. Similar workflows should be possible with all crawler workers that support an input slot containing follow-up links to crawl produced by the same crawler worker in a previous task, for example the FileCrawler worker. Some workers might expect internal attributes to be set in these follow-up link records. Please notify us if you have such problems so that we can extend the respecive worker to support this use case.



Workflow Definition

You add such a workflow definition:

POST /smila/jobmanager/workflows/
{
  "name":"multiWebCrawling",
  "modes":[
    "standard"
  ],
  "parameters": {
    "startUrl":"<send start urls via bulkbuilder>",
    "bulkLimitSize":1
  },
  "startAction":{
    "worker":"bulkbuilder",
    "output":{
      "insertedRecords":"linksToCrawlBucket"
    }
  },
  "actions":[
    {
      "worker":"webCrawler",
      "input":{
        "linksToCrawl":"linksToCrawlBucket"
      },
      "output":{
        "linksToCrawl":"linksToCrawlBucket",
        "crawledRecords":"crawledLinksBucket"
      }
    },
    {
      "worker":"deltaChecker",
      "input":{
        "recordsToCheck":"crawledLinksBucket"
      },
      "output":{
        "updatedRecords":"updatedLinksBucket",
        "updatedCompounds":"compoundLinksBucket"
      }
    },
    {
      "worker":"webExtractor",
      "input":{
        "compounds":"compoundLinksBucket"
      },
      "output":{
        "files":"fetchedLinksBucket"
      }
    },
    {
      "worker":"webFetcher",
      "input":{
        "linksToFetch":"updatedLinksBucket"
      },
      "output":{
        "fetchedLinks":"fetchedLinksBucket"
      }
    },
    {
      "worker":"updatePusher",
      "input":{
        "recordsToPush":"fetchedLinksBucket"
      }
    }
  ]
}

The differences to the standard "webCrawling" workflow are:

  • The start action is the bulkbuilder, not the webCrawler itself, so you can send records to this job using the document push API when it is running. We will use this to send records containing the start URLs.
  • The bulkbuilder parameter bulkLimitSize is set to 1 (byte), so each inserted record will be written to an own bulk and crawled in its own workflow run. This way fatal errors caused by one start URL will not abort the crawl of another start URL.
  • The webCrawler parameter startUrl is set to a dummy value, because it is required, but we do not need it, so we do not have to include it in the job definition.
  • This job wants to run in "standard" mode instead of "runOnce" mode. This means that you have to finish it yourself after providing the start URLs.

Job Definition

This could be the job definition:

{
  "name":"crawlMultipleStartUrls",
  "workflow":"multiWebCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"multiweb",
    "jobToPushTo":"indexUpdate",
    "linksPerBulk": 100,
    "filters":{
      "maxCrawlDepth": 5,
      "urlPatterns": {
         "include": [
           "http://.*eclipse\\.org/.*SMILA.*",
           "http://.*eclipse\\.org/.*smila.*"],
         "exclude": [".*\\?.*" ]
      }
    },
    "mapping":{
          "httpCharset": "Charset",
          "httpContenttype": "ContentType",
          "httpLastModified": "LastModifiedDate",
          "httpMimetype": "MimeType",
          "httpSize": "Size",
          "httpUrl": "Url",
          "httpContent": "Content"
    }
  }
}

The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the urlPatterns will be applied to all URLs for each start URL, so include patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. In this example we want to crawl only different parts on eclipse.org hosts, so the include patterns will work.

Idea.png

A "stayOnHost/stayOnDomain" parameter will be implemented soon to better support such use cases. It will cause the crawler to ignore all links on a web page that to not point to the same host or domain than the URL of the web page itself.



Running the Job

Start the target job and the crawl job:

POST /smila/jobmanager/jobs/indexUpdate
POST /smila/jobmanager/jobs/crawlMultipleStartUrls

The jobs are in RUNNING mode now, but nothing else happens.

Push start URLs:

POST /smila/jobs/crawlMultipleStartUrls/record
{
  "_recordid": "startUrl",
  "httpUrl":"http://www.eclipse.org/smila"
}
POST /smila/jobs/crawlMultipleStartUrls/record
{
  "_recordid": "startUrl",
  "httpUrl":"http://wiki.eclipse.org/SMILA"
}

Things to note:

  • The value of the _recordid attribute is irrelevant, but the bulkbuilder requires it to be set.
  • The start URL must be provided as attribute httpUrl, regardless of the attribute mapping specified in the job.

Finish the job:

POST /smila/jobmanager/jobs/crawlMultipleStartUrls/20120823-162109/finish

This will also cause the delta-delete to be triggered, when the crawling is done. Note that you should disable delta-delete if you do not crawl all start URLs in each job run, or else the documents from the start URLs not crawled in the latest job run will be removed from the index.

Back to the top