Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/Importing/CrawlingMultipleStartURLs"
(→Job Definition) |
(→Job Definition) |
||
Line 119: | Line 119: | ||
</source> | </source> | ||
− | The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the <tt>urlPatterns</tt> will be applied to all URLs for each start URL, so <tt>include</tt> patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. | + | The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the <tt>urlPatterns</tt> will be applied to all URLs for each start URL, so <tt>include</tt> patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. In this example we want to crawl only different parts on eclipse.org hosts, so the include patterns will work. |
{{Tip| | {{Tip| |
Revision as of 10:50, 23 August 2012
Contents
Crawling multiple start URLs in one job run
This page describes an alternative of using the WebCrawler worker. One use case for this is to make it possible to define multiple start URLs that should be crawled in a single job run (but multiple workflow runs). The main idea is to send the start URLs as single records via the document push API (/smila/job/<crawlJobName>/record) instead of specifying it in the job definition.
It would easy to define more variants that crawl start URLs produced by other workers or similar use cases.
Workflow Definition
You could add such a workflow definition:
{ "name":"multiWebCrawling", "modes":[ "standard" ], "parameters": { "startUrl":"<send start urls via bulkbuilder>", "bulkLimitSize":1 }, "startAction":{ "worker":"bulkbuilder", "output":{ "insertedRecords":"linksToCrawlBucket" } }, "actions":[ { "worker":"webCrawler", "input":{ "linksToCrawl":"linksToCrawlBucket" }, "output":{ "linksToCrawl":"linksToCrawlBucket", "crawledRecords":"crawledLinksBucket" } }, { "worker":"deltaChecker", "input":{ "recordsToCheck":"crawledLinksBucket" }, "output":{ "updatedRecords":"updatedLinksBucket", "updatedCompounds":"compoundLinksBucket" } }, { "worker":"webExtractor", "input":{ "compounds":"compoundLinksBucket" }, "output":{ "files":"fetchedLinksBucket" } }, { "worker":"webFetcher", "input":{ "linksToFetch":"updatedLinksBucket" }, "output":{ "fetchedLinks":"fetchedLinksBucket" } }, { "worker":"updatePusher", "input":{ "recordsToPush":"fetchedLinksBucket" } } ] }
The differences to the standard "webCrawling" workflow are:
- The start action is the bulkbuilder, not the webCrawler itself, so you can send records to this job using the document push API when it is running. We will use this to send records containing the start URLs.
- The bulkbuilder parameter bulkLimitSize is set to 1 (byte), so each inserted record will be written to an own bulk and crawled in its own workflow run. This way fatal errors caused by one start URL will not abort the crawl of another start URL.
- The webCrawler parameter startUrl is set to a dummy value, because it is required, but we do not need it, so we do not have to include it in the job definition.
- This job wants to run in "standard" mode instead of "runOnce" mode. This means that you have to finish it yourself after providing the start URLs.
Job Definition
This could be the job definition:
{ "name":"crawlMultipleStartUrls", "workflow":"multiWebCrawling", "parameters":{ "tempStore":"temp", "dataSource":"multiweb", "jobToPushTo":"indexUpdate", "linksPerBulk": 100, "filters":{ "maxCrawlDepth": 5, "urlPatterns": { "include": [ "http://.*eclipse\\.org/.*SMILA.*", "http://.*eclipse\\.org/.*smila.*"], "exclude": [".*\\?.*" ] } }, "mapping":{ "httpCharset": "Charset", "httpContenttype": "ContentType", "httpLastModified": "LastModifiedDate", "httpMimetype": "MimeType", "httpSize": "Size", "httpUrl": "Url", "httpContent": "Content" } } }
The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the urlPatterns will be applied to all URLs for each start URL, so include patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. In this example we want to crawl only different parts on eclipse.org hosts, so the include patterns will work.
Running the Job
Start the target job and the crawl job:
POST /smila/jobmanager/jobs/indexUpdate POST /smila/jobmanager/jobs/crawlMultipleStartUrls
The jobs are in RUNNING mode now, but nothing else happens.
Push start URLs:
POST /smila/jobs/crawlMultipleStartUrls/record { "_recordid": "startUrl", "httpUrl":"http://www.eclipse.org/smila" }
POST /smila/jobs/crawlMultipleStartUrls/record { "_recordid": "startUrl", "httpUrl":"http://wiki.eclipse.org/SMILA" }
Things to note:
- The value of the _recordid attribute is irrelevant, but the bulkbuilder requires it to be set.
- The start URL must be provided as attribute httpUrl, regardless of the attribute mapping specified in the job.
Finish the job:
POST /smila/jobmanager/jobs/crawlMultipleStartUrls/20120823-162109/finish
This will also cause the delta-delete to be triggered, when the crawling is done. Note that you should disable delta-delete if you do not crawl all start URLs in each job run, or else the documents from the start URLs not crawled in the latest job run will be removed from the index.