Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/CrawlingMultipleStartURLs"

(New page: == Crawling multiple start URLs in one job run == This page describes an alternative of using the WebCrawler worker. One use case for this is to make it possible to define multiple start ...)
 
m (Running the Job)
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
== Crawling multiple start URLs in one job run ==
 
== Crawling multiple start URLs in one job run ==
  
This page describes an alternative of using the WebCrawler worker. One use case for this is to make it possible to define multiple start URLs that should be crawled in a single job run (but multiple workflow runs). The main idea is to send the start URLs as single records via the document push API (<tt>/smila/job/<crawlJobName>/record</tt>) instead of specifying it in the job definition.
+
This page describes an alternative way of using the WebCrawler worker. This way allows you to define multiple start URLs to be crawled in a single job run (but multiple workflow runs) instead of a single start URL only. The main idea is to send each start URL as a simple record to the [[SMILA/Documentation/Bulkbuilder#Record_push_REST_API|Bulkbuilder Push API]] at (<tt>/smila/job/<crawlJobName>/record</tt>) instead of specifying a single start URL as a parameter value in the job definition.
  
It would easy to define more variants that crawl start URLs produced by other workers or similar use cases.
+
It would easy to define more variants that crawl start URLs produced by some other worker or similar use cases.
  
 
{{Tip|
 
{{Tip|
We did test this only with the WebCrawler worker. Similar workflows should be possible with all crawler workers that support an input slot containing follow-up links to crawl produced by the same crawler worker in a previous task, for example the FileCrawler worker. Some workers might expect internal attributes to be set in these follow-up link records. Please notify us if you have such problems so that we can extend the respecive worker to support this use case.
+
Though we tested the following workflow and settings with the WebCrawler worker only, similar workflows using other crawler workers should work as well, provided that the used crawler worker is able to crawl follow-up links in its input slot that were produced by the very same worker in a previous task. An example of such a worker is the FileCrawler worker. Some workers might expect internal attributes to be set in these follow-up link records, which might cause problems. Please notify us if you observe such issues so that we can extend the respective worker accordingly.
 
}}
 
}}
  
 
=== Workflow Definition ===
 
=== Workflow Definition ===
  
You could add such a workflow definition:
+
You add such a workflow definition:
  
 
<source lang="javascript">
 
<source lang="javascript">
 +
POST /smila/jobmanager/workflows/
 
{
 
{
 
   "name":"multiWebCrawling",
 
   "name":"multiWebCrawling",
Line 89: Line 90:
  
 
<source lang="javascript">
 
<source lang="javascript">
 +
POST /smila/jobmanager/jobs/
 
{
 
{
 
   "name":"crawlMultipleStartUrls",
 
   "name":"crawlMultipleStartUrls",
Line 98: Line 100:
 
     "linksPerBulk": 100,
 
     "linksPerBulk": 100,
 
     "filters":{
 
     "filters":{
       "maxCrawlDepth": 5,
+
       "maxCrawlDepth": 3,
 
       "urlPatterns": {
 
       "urlPatterns": {
 
         "include": [
 
         "include": [
 
           "http://.*eclipse\\.org/.*SMILA.*",
 
           "http://.*eclipse\\.org/.*SMILA.*",
           "http://.*eclipse\\.org/.*smila.*"]
+
           "http://.*eclipse\\.org/.*smila.*"],
 
         "exclude": [".*\\?.*" ]
 
         "exclude": [".*\\?.*" ]
 
       }
 
       }
Line 119: Line 121:
 
</source>
 
</source>
  
The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the <tt>urlPatterns</tt> will be applied to all URLs for each start URL, so <tt>include</tt> patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all.
+
The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the <tt>urlPatterns</tt> will be applied to all URLs for each start URL, so <tt>include</tt> patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. In this example we want to crawl only different parts on eclipse.org hosts, so the include patterns will work.
  
 
{{Tip|
 
{{Tip|
A "stayOnHost/stayOnDomain" parameter will be implemented soon to better support such use cases. It will cause the crawler to ignore all links on a web page that to not point to the same host or domain than the URL of the web page itself.
+
You can also use the "stayOn" parameter for such use cases. It will cause the crawler to ignore all links on a web page that to not point to the same host or domain than the URL of the web page itself. [[SMILA/Documentation/Importing/Crawler/Web#Web_Crawler_Worker|WebCrawlerWorker parameters]]
 
}}
 
}}
  
Line 129: Line 131:
 
Start the target job and the crawl job:
 
Start the target job and the crawl job:
  
<source lang="text">
+
<source lang="javascript">
POST /smila/jobmanager/jobs/indexUpdate
+
POST /smila/jobmanager/jobs/indexUpdate/
POST /smila/jobmanager/jobs/crawlMultipleStartUrls
+
POST /smila/jobmanager/jobs/crawlMultipleStartUrls/
 
</source>
 
</source>
  
Line 138: Line 140:
 
Push start URLs:
 
Push start URLs:
  
<source lang="text">
+
<source lang="javascript">
POST /smila/jobs/crawlMultipleStartUrls/record
+
POST /smila/job/crawlMultipleStartUrls/record/
 
{
 
{
 
   "_recordid": "startUrl",
 
   "_recordid": "startUrl",
   "httpUrl":"http://www.eclipse.org/smila"
+
   "httpUrl":"http://www.eclipse.org/smila",
 +
  "crawlDepth": 4
 
}
 
}
 
</source>
 
</source>
  
<source lang="text">
+
<source lang="javascript">
POST /smila/jobs/crawlMultipleStartUrls/record
+
POST /smila/job/crawlMultipleStartUrls/record/
 
{
 
{
 
   "_recordid": "startUrl",
 
   "_recordid": "startUrl",
   "httpUrl":"http://wiki.eclipse.org/SMILA"
+
   "httpUrl":"http://wiki.eclipse.org/SMILA",
 
}
 
}
 
</source>
 
</source>
Line 157: Line 160:
 
* The value of the <tt>_recordid</tt> attribute is irrelevant, but the bulkbuilder requires it to be set.
 
* The value of the <tt>_recordid</tt> attribute is irrelevant, but the bulkbuilder requires it to be set.
 
* The start URL must be provided as attribute <tt>httpUrl</tt>, regardless of the attribute mapping specified in the job.
 
* The start URL must be provided as attribute <tt>httpUrl</tt>, regardless of the attribute mapping specified in the job.
 +
* The optional <tt>crawlDepth</tt> parameter can be used to specify an individual crawl depth for the given (start) url. When this parameter isn't set, the Web Crawler Worker ''maxCrawlDepth'' parameter will be used as default. If ''maxCrawlDepth'' is also not set, the crawl depth is unlimited.
 +
** In the first pushed record above, <tt>"crawlDepth: 4"</tt> is set, so this is used as limit when following links.
 +
** In the second record, no crawlDepth is set, so <tt>"maxCrawlDepth: 3"</tt> (set in the crawl job above) will be used
 +
** Hint: You can use <tt>"crawlDepth: -1"</tt> to set the crawl depth unlimited, despite of a ''maxCrawlDepth'' setting.
  
 
Finish the job:
 
Finish the job:
  
<source lang="text">
+
<source lang="javascript">
POST /smila/jobmanager/jobs/crawlMultipleStartUrls/20120823-162109/finish
+
POST /smila/jobmanager/jobs/crawlMultipleStartUrls/20120823-164700474635/finish/
 
</source>
 
</source>
 +
 +
(of course, you have to adapt the job run id)
  
 
This will also cause the delta-delete to be triggered, when the crawling is done. Note that you should disable delta-delete if you do not crawl all start URLs in each job run, or else the documents from the start URLs not crawled in the latest job run will be removed from the index.
 
This will also cause the delta-delete to be triggered, when the crawling is done. Note that you should disable delta-delete if you do not crawl all start URLs in each job run, or else the documents from the start URLs not crawled in the latest job run will be removed from the index.
 +
 +
[[Category:SMILA]]

Latest revision as of 11:01, 10 September 2013

Crawling multiple start URLs in one job run

This page describes an alternative way of using the WebCrawler worker. This way allows you to define multiple start URLs to be crawled in a single job run (but multiple workflow runs) instead of a single start URL only. The main idea is to send each start URL as a simple record to the Bulkbuilder Push API at (/smila/job/<crawlJobName>/record) instead of specifying a single start URL as a parameter value in the job definition.

It would easy to define more variants that crawl start URLs produced by some other worker or similar use cases.

Idea.png

Though we tested the following workflow and settings with the WebCrawler worker only, similar workflows using other crawler workers should work as well, provided that the used crawler worker is able to crawl follow-up links in its input slot that were produced by the very same worker in a previous task. An example of such a worker is the FileCrawler worker. Some workers might expect internal attributes to be set in these follow-up link records, which might cause problems. Please notify us if you observe such issues so that we can extend the respective worker accordingly.



Workflow Definition

You add such a workflow definition:

POST /smila/jobmanager/workflows/
{
  "name":"multiWebCrawling",
  "modes":[
    "standard"
  ],
  "parameters": {
    "startUrl":"<send start urls via bulkbuilder>",
    "bulkLimitSize":1
  },
  "startAction":{
    "worker":"bulkbuilder",
    "output":{
      "insertedRecords":"linksToCrawlBucket"
    }
  },
  "actions":[
    {
      "worker":"webCrawler",
      "input":{
        "linksToCrawl":"linksToCrawlBucket"
      },
      "output":{
        "linksToCrawl":"linksToCrawlBucket",
        "crawledRecords":"crawledLinksBucket"
      }
    },
    {
      "worker":"deltaChecker",
      "input":{
        "recordsToCheck":"crawledLinksBucket"
      },
      "output":{
        "updatedRecords":"updatedLinksBucket",
        "updatedCompounds":"compoundLinksBucket"
      }
    },
    {
      "worker":"webExtractor",
      "input":{
        "compounds":"compoundLinksBucket"
      },
      "output":{
        "files":"fetchedLinksBucket"
      }
    },
    {
      "worker":"webFetcher",
      "input":{
        "linksToFetch":"updatedLinksBucket"
      },
      "output":{
        "fetchedLinks":"fetchedLinksBucket"
      }
    },
    {
      "worker":"updatePusher",
      "input":{
        "recordsToPush":"fetchedLinksBucket"
      }
    }
  ]
}

The differences to the standard "webCrawling" workflow are:

  • The start action is the bulkbuilder, not the webCrawler itself, so you can send records to this job using the document push API when it is running. We will use this to send records containing the start URLs.
  • The bulkbuilder parameter bulkLimitSize is set to 1 (byte), so each inserted record will be written to an own bulk and crawled in its own workflow run. This way fatal errors caused by one start URL will not abort the crawl of another start URL.
  • The webCrawler parameter startUrl is set to a dummy value, because it is required, but we do not need it, so we do not have to include it in the job definition.
  • This job wants to run in "standard" mode instead of "runOnce" mode. This means that you have to finish it yourself after providing the start URLs.

Job Definition

This could be the job definition:

POST /smila/jobmanager/jobs/
{
  "name":"crawlMultipleStartUrls",
  "workflow":"multiWebCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"multiweb",
    "jobToPushTo":"indexUpdate",
    "linksPerBulk": 100,
    "filters":{
      "maxCrawlDepth": 3,
      "urlPatterns": {
         "include": [
           "http://.*eclipse\\.org/.*SMILA.*",
           "http://.*eclipse\\.org/.*smila.*"],
         "exclude": [".*\\?.*" ]
      }
    },
    "mapping":{
          "httpCharset": "Charset",
          "httpContenttype": "ContentType",
          "httpLastModified": "LastModifiedDate",
          "httpMimetype": "MimeType",
          "httpSize": "Size",
          "httpUrl": "Url",
          "httpContent": "Content"
    }
  }
}

The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the urlPatterns will be applied to all URLs for each start URL, so include patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. In this example we want to crawl only different parts on eclipse.org hosts, so the include patterns will work.

Idea.png

You can also use the "stayOn" parameter for such use cases. It will cause the crawler to ignore all links on a web page that to not point to the same host or domain than the URL of the web page itself. WebCrawlerWorker parameters



Running the Job

Start the target job and the crawl job:

POST /smila/jobmanager/jobs/indexUpdate/
POST /smila/jobmanager/jobs/crawlMultipleStartUrls/

The jobs are in RUNNING mode now, but nothing else happens.

Push start URLs:

POST /smila/job/crawlMultipleStartUrls/record/
{
  "_recordid": "startUrl",
  "httpUrl":"http://www.eclipse.org/smila",
  "crawlDepth": 4
}
POST /smila/job/crawlMultipleStartUrls/record/
{
  "_recordid": "startUrl",
  "httpUrl":"http://wiki.eclipse.org/SMILA",
}

Things to note:

  • The value of the _recordid attribute is irrelevant, but the bulkbuilder requires it to be set.
  • The start URL must be provided as attribute httpUrl, regardless of the attribute mapping specified in the job.
  • The optional crawlDepth parameter can be used to specify an individual crawl depth for the given (start) url. When this parameter isn't set, the Web Crawler Worker maxCrawlDepth parameter will be used as default. If maxCrawlDepth is also not set, the crawl depth is unlimited.
    • In the first pushed record above, "crawlDepth: 4" is set, so this is used as limit when following links.
    • In the second record, no crawlDepth is set, so "maxCrawlDepth: 3" (set in the crawl job above) will be used
    • Hint: You can use "crawlDepth: -1" to set the crawl depth unlimited, despite of a maxCrawlDepth setting.

Finish the job:

POST /smila/jobmanager/jobs/crawlMultipleStartUrls/20120823-164700474635/finish/

(of course, you have to adapt the job run id)

This will also cause the delta-delete to be triggered, when the crawling is done. Note that you should disable delta-delete if you do not crawl all start URLs in each job run, or else the documents from the start URLs not crawled in the latest job run will be removed from the index.

Back to the top