Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/Crawler/Web"

(Implementation details)
(47 intermediate revisions by 7 users not shown)
Line 1: Line 1:
Currently, the web crawler workers are implemented very simplistic so that we can test the importing framework. A more sophisticated implementation will follow soon (hopefully).
+
WebCrawler, WebFetcher and WebExtractor worker are used for importing files from a web server. For a big picture and the worker's interaction have a look at the [[SMILA/Documentation/Importing/Concept | Importing Concept]].
  
 
=== Web Crawler Worker ===
 
=== Web Crawler Worker ===
Line 9: Line 9:
 
** <tt>waitBetweenRequests</tt>: ''(opt.)'' long value in milliseconds on how long to wait between HTTP requests (default: 0).
 
** <tt>waitBetweenRequests</tt>: ''(opt.)'' long value in milliseconds on how long to wait between HTTP requests (default: 0).
 
** <tt>linksPerBulk</tt>: ''(opt.)'' number of links in one bulk object for follow-up tasks (default: 10)
 
** <tt>linksPerBulk</tt>: ''(opt.)'' number of links in one bulk object for follow-up tasks (default: 10)
 +
** <tt>linkErrorHandling</tt>: ''(opt., default "drop")'' specifies how to handle IO errors (e.g. network problems) when trying to access the URL in a link record to crawl: possible values are:
 +
*** "drop": The record will be ignored and not added to the "crawledRecords" output, but other links in the same bulk can be processed successfully.
 +
*** "retry": The current task is finished with a recoverable error so that it might be retried later (depending on job/task manager settings). However, if the task cannot be finished successfully within the configured number of retries, it will finally fail and all other links contained in the same bulk will not be crawled, too.
 +
{{Tip|This parameter does not affect handling errors reported by the accessed webserver like "404 Not Found", "400 Bad Request", "500 Internal Server Error" etc. Records causing such errors will always be dropped.}}
 
** <tt>filters</tt>: ''(opt.)'' A map containing filter settings, i.e. instructions which links to include or exclude from the crawl. This parameter is optional.  
 
** <tt>filters</tt>: ''(opt.)'' A map containing filter settings, i.e. instructions which links to include or exclude from the crawl. This parameter is optional.  
 
*** <tt>maxCrawlDepth</tt>: the maximum crawl depth when following links.  
 
*** <tt>maxCrawlDepth</tt>: the maximum crawl depth when following links.  
 
*** <tt>followRedirects</tt>: whether to follow redirects or not (default: false).  
 
*** <tt>followRedirects</tt>: whether to follow redirects or not (default: false).  
 
*** <tt>maxRedirects</tt>: maximum number of allowed redirects when following redirects is enabled (default: 1).
 
*** <tt>maxRedirects</tt>: maximum number of allowed redirects when following redirects is enabled (default: 1).
*** <tt>urlPatterns</tt>: regex patterns for filtering crawled elements on the basis of their URL
+
*** <tt>stayOn</tt>: whether to follow links that would leave the host resp. domain or not. Valid values: <tt>host</tt>, <tt>domain</tt> (default: follow all links).
 +
**** ''Hint: implementation of <tt>stayOn 'domain'</tt> is currently very simple. It just takes the domain as the host name without the first part, if at least two host name parts remain. (<tt>www.foo.com -> foo.com, foo.com -> foo.com</tt>). Sometimes this won't fit (<tt>bbc.co.uk -> co.uk </tt>(!)), than you should use url patterns instead.
 +
*** <tt>urlPatterns</tt>: regex patterns for filtering crawled elements on the basis of their URL. Note that the crawler additionally reads the crawled sites' <tt>robots.txt</tt>. See [[SMILA/Documentation/Importing/Crawler/Web#User-Agent_and_robots.txt|below]] for details.
 
**** <tt>include</tt>: if include patterns are specified, at least one of them must match the URL. If no include patterns are specified, this is handled as if all URLs are included.
 
**** <tt>include</tt>: if include patterns are specified, at least one of them must match the URL. If no include patterns are specified, this is handled as if all URLs are included.
 
**** <tt>exclude</tt>: if at least one exclude pattern matches the URL, the crawled element is filtered out
 
**** <tt>exclude</tt>: if at least one exclude pattern matches the URL, the crawled element is filtered out
Line 30: Line 36:
 
** <tt>linksToCrawl</tt>: Records describing outgoing links from the crawled resources. Should be connected to the same bucket as the input slot.
 
** <tt>linksToCrawl</tt>: Records describing outgoing links from the crawled resources. Should be connected to the same bucket as the input slot.
 
** <tt>crawledRecords</tt>: Records describing crawled resources. For resources of mimetype <tt>text/html</tt> the records have the content attached. For other resources, use a webFetcher worker later in the workflow to get the content.
 
** <tt>crawledRecords</tt>: Records describing crawled resources. For resources of mimetype <tt>text/html</tt> the records have the content attached. For other resources, use a webFetcher worker later in the workflow to get the content.
 +
 +
===== Filter patterns and normalization =====
 +
When defining filter patterns, keep in mind that URLs are normalized ''before'' filters are applied. Normalization means:
 +
* the URL will be made absolute when it's relative (e.g. /relative/link -> http://my.domain.de/relative/link)
 +
* paths will be normalized (e.g. host/path/../path2 -> host/path2)
 +
* scheme and host will be converted to lower case (e.g. HTTP://WWW.Host.de/Path -> http://www.host.de/Path)
 +
** ''Hint: The path will not be converted to lower case!''
 +
* fragments will be removed  (e.g. host/path#fragment -> host/path)
 +
* the default port 80 will be removed (e.g. host:80 -> host)
 +
* 'opaque' URIs can not be handled and will be filtered out automatically (e.g. javascript:void(0), mailto:andreas.weber@empolis.com)
  
 
==== Configuration ====
 
==== Configuration ====
Line 38: Line 54:
 
* proxyPort (default: 80)
 
* proxyPort (default: 80)
 
* socketTimeout (default: none, i.e. no socket timeout)
 
* socketTimeout (default: none, i.e. no socket timeout)
 +
* userAgent (default: "SMILA ([http://wiki.eclipse.org/SMILA/UserAgent http://wiki.eclipse.org/SMILA/UserAgent]; smila-dev@eclipse.org)").
  
The configuration properties <tt>proxyHost</tt> and <tt>proxyPort</tt> are used to define a proxy for the web crawler (i.e. the <tt>SimpleFetcher</tt> class is using these configuration to configure its HTTP client) whereas the <tt>socketTimeout</tt> parameter defines how the fetcher's timeout is while retrieving data from the server. If you omit the <tt>socketTimeout</tt> parameter, the fetcher will set no timeout.
+
The configuration properties <tt>proxyHost</tt> and <tt>proxyPort</tt> are used to define a proxy for the web crawler (i.e. the <tt>DefaultFetcher</tt> class is using these configuration to configure its HTTP client) whereas the <tt>socketTimeout</tt> parameter defines how the fetcher's timeout is while retrieving data from the server. If you omit the <tt>socketTimeout</tt> parameter, the fetcher will set  
 +
no timeout.
 +
 
 +
===== User-Agent and robots.txt =====
 +
 
 +
If you use SMILA for your own project, please change <tt>userAgent</tt> to your an own name, URL and email address. Apart from telling the web server who you are, the value is relevant for choosing the appropriate settings from the crawled sites' <tt>robots.txt</tt> files: The WebCrawler chooses the first set of "Disallow:" lines for which the "User-agent:" line is a case-insensitive substring of this configuration property.
 +
 
 +
It is currently not possible to ignore the <tt>robots.txt</tt> by configuration or job parameters.
 +
 
 +
Also, the web crawler uses only the basic <tt>robots.txt</tt> directives, "User-agent:" and "Disallow:" as defined in the original standard (see [http://www.robotstxt.org/orig.html http://www.robotstxt.org/orig.html]). That means it does not use "Allow:" lines, and the values in "Disallow:" lines are not evalutated as regular or other expressions. The crawler will just ignore all links that start exactly (case-sensitive) with one of the "Disallow:" values. The only exception is an empty "Disallow:" value meaning that all links on this site are allowed to be crawled.
 +
 
 +
Also, we do not observe "Crawl-delay:" parameters in <tt>robots.txt</tt>, so please take care that you use appropriate <tt>waitBetweenRequests</tt> (see above), <tt>[[SMILA/Documentation/JobDefinitions|taskControl/delay]]</tt> and scale-up settings for the web crawler worker yourself.
 +
 
 +
The <tt>robots.txt</tt> is fetched from the web server only once per site and job run. So changes in a <tt>robots.txt</tt> will not become effective until the crawl job is restarted.
  
 
===== Configuring a proxy =====
 
===== Configuring a proxy =====
 +
 
You can configure the proxy the web crawler should use by defining the proxy in the configuration file (see above). E.g. to set up the web crawler to use a proxy at proxy-host:3128, use the following configuration:
 
You can configure the proxy the web crawler should use by defining the proxy in the configuration file (see above). E.g. to set up the web crawler to use a proxy at proxy-host:3128, use the following configuration:
 
<pre>
 
<pre>
Line 50: Line 81:
 
Alternatively you can also use the JRE system properties <tt>http.proxyHost</tt> and <tt>http.proxyPort</tt> (see [http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html] for more information on proxy system properties).
 
Alternatively you can also use the JRE system properties <tt>http.proxyHost</tt> and <tt>http.proxyPort</tt> (see [http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html] for more information on proxy system properties).
  
==== Internal structure ''(TODO: adapt for new implementation!)'' ====
+
==== Internal structure ====
  
 
To make it easier to extend and improve the web crawler it is divided internally into components. Each of them is a single OSGi service that handles one part of the crawl functionality and can be exchanged individually to improve a single part of the functionality. The architecture looks like this:
 
To make it easier to extend and improve the web crawler it is divided internally into components. Each of them is a single OSGi service that handles one part of the crawl functionality and can be exchanged individually to improve a single part of the functionality. The architecture looks like this:
Line 56: Line 87:
 
[[Image:SMILA-Importing-Web-Crawler-Internal.png]]
 
[[Image:SMILA-Importing-Web-Crawler-Internal.png]]
  
The WebCrawler worker is started with one input bulk that contains records with URLS to crawl. (The exception to this rule is the start of the crawl process where it gets a task without an input bulk, which causes it to generate an input record from the task parameters or configuration in a later version.). Then the components are executed like this:
+
The WebCrawler worker is started with an input bulk that contains records with URLs to crawl. (The exception to this rule is the start of the crawl process where it gets a task without an input bulk, which causes it to generate an input record from its configured <tt>startUrl</tt> parameter). Then the components are executed like this:
* First a <tt>VisitedLinksService</tt> is asked if this link was already crawled by someone else in this crawl job run. If so, the record is just dropped and no output is produced.
+
* First a <tt>VisitedLinksService</tt> is asked if this link was already crawled by someone else in this crawl job run. If so, the record is just dropped and no output is produced. Otherwise the link is marked as visited in the <tt>VisitedLinksService</tt> and processing goes on.
* Otherwise, the <tt>Fetcher</tt> is called to get metadata and content, if the content type of the resource is suitable for link extraction. Else the content will only be fetched in the WebFetcher worker later in the crawl workflow to save IO load.
+
* The <tt>Fetcher</tt> is called to get the metadata (e.g. the mime type). If the mime type of the resource is suitable for link extraction, the Fetcher also gets the content. Otherwise the content will only be fetched in the WebFetcher worker later in the crawl workflow to save IO load.
* If the resource could be fetched without problems, the <tt>RecordProducer</tt> is called to decide if this record should be written to the <tt>crawledLinks</tt> output bulk. The producer could also modify the records or split them into multiple records, if necessary for the use case.
+
 
* If the content of the resource was fetched, the <tt>LinkExtractor</tt> is called to extract outgoing links (e.g. look for <A> tags). It can produce multiple link records containing one absolute outgoing URL each.
 
* If the content of the resource was fetched, the <tt>LinkExtractor</tt> is called to extract outgoing links (e.g. look for <A> tags). It can produce multiple link records containing one absolute outgoing URL each.
* If outgoing links were found the <tt>LinkFilter</tt> is called to remove links that should not be followed (e.g. because they are on a different site) or remove duplicates.
+
* If outgoing links were found the current crawl depth is checked and if a maximum crawl depth is configured for this job and it is exceeded the links are discarded. The current crawl depth is stored in each link record (using the attribute <tt>crawlDepth</tt>).
 +
* The <tt>LinkFilter</tt> is called next to remove links that should not be followed (e.g. because they are on a different site) or remove duplicates.
 +
* In a last step the <tt>RecordProducer</tt> is called to decide how the processed record should be written to the <tt>recordBulks</tt> output bulk. The producer could modify the records or split them into multiple records, if necessary for the use case.
  
Finally, when all records from an input bulk have been processed, all links visited in this task must be marked as "visited" in the <tt>VisitedLinksService</tt>.
+
Both fetcher and link filter will check the input links against the <tt>robots.txt</tt> file of the site. To prevent multiple access to the same site's <tt>robots.txt</tt>, the disallowed links will be stored in the job-run data of the crawl job. Only the fetcher will actually read the <tt>robots.txt</tt> if it has not been read yet. The link filter will only use already cached settings.
  
Outgoing links are separated into multiple bulks to improve scaling: The outgoing links from the initial task that crawls the <tt>startUrl</tt> will be written to an own bulk each, while outgoing links from later tasks will be written to separate bulks according to the <tt>linksPerBulk</tt> parameter.
+
===== Scaling =====
The crawled records are divided into bulks of 100 records at most, but this will usually not have an effect as each incoming link produces one record at most.
+
  
==== Implementation details ''(TODO: adapt for new implementation!)'' ====
+
Outgoing links are separated into multiple bulks to improve scaling: The outgoing links from the initial task that crawls the <tt>startUrl</tt> will be written to an own bulk each, while outgoing links from later tasks will be written to separate bulks according to the <tt>linksPerBulk</tt> parameter. The outgoing crawled records are divided into bulks of 100 records at most.
 +
 
 +
==== Implementation details ====
  
 
* <tt>ObjectStoreVisitedLinksService</tt> (implements <tt>VisitedLinksService</tt>): Uses the <tt>ObjectStoreService</tt> to store which links have been visited, similar to the <tt>[[SMILA/Documentation/Importing/DeltaCheck#ObjectStoreDeltaService|ObjectStoreDeltaService]]</tt>. It uses a configuration file with the same properties in the same configuration directory, but named <tt>visitedlinksstore.properties</tt>.
 
* <tt>ObjectStoreVisitedLinksService</tt> (implements <tt>VisitedLinksService</tt>): Uses the <tt>ObjectStoreService</tt> to store which links have been visited, similar to the <tt>[[SMILA/Documentation/Importing/DeltaCheck#ObjectStoreDeltaService|ObjectStoreDeltaService]]</tt>. It uses a configuration file with the same properties in the same configuration directory, but named <tt>visitedlinksstore.properties</tt>.
* <tt>SimpleFetcher</tt>: Uses a GET request to read the URL. Currently, authentication is not supported. Writes content to attachment <tt>httpContent</tt>, if the resource is of mimetype <tt>text/html</tt> and set the following attributes:
+
* <tt>DefaultFetcher</tt>: Uses a GET request to read the URL. Currently, authentication is not supported. Writes content to attachment <tt>httpContent</tt>, if the resource is of mimetype <tt>text/html</tt> and sets the following attributes:
 
** <tt>httpSize</tt>: value from HTTP header <tt>Content-Length</tt> (-1, if not set), as a Long value.
 
** <tt>httpSize</tt>: value from HTTP header <tt>Content-Length</tt> (-1, if not set), as a Long value.
 
** <tt>httpContenttype</tt>: value from HTTP header <tt>Content-Type</tt>, if set.
 
** <tt>httpContenttype</tt>: value from HTTP header <tt>Content-Type</tt>, if set.
Line 78: Line 111:
 
** <tt>httpLastModified</tt>: value from HTTP header <tt>Last-Modified</tt>, if set, as a DateTime value.
 
** <tt>httpLastModified</tt>: value from HTTP header <tt>Last-Modified</tt>, if set, as a DateTime value.
 
** <tt>_isCompound</tt>: set to <tt>true</tt> for resources that are identified as extractable compound objects by the running CompoundExtractor service.
 
** <tt>_isCompound</tt>: set to <tt>true</tt> for resources that are identified as extractable compound objects by the running CompoundExtractor service.
* <tt>SimpleRecordProducer</tt>: Set record source  and calculates <tt>_deltaHash</tt> value for DeltaChecker worker (first wins):
+
* <tt>DefaultRecordProducer</tt>: Set record source  and calculates <tt>_deltaHash</tt> value for DeltaChecker worker (first wins):
 
** if content is attached, calculate a digest.
 
** if content is attached, calculate a digest.
 
** if <tt>httpLastModified</tt> attribute is set, use it as the hash.
 
** if <tt>httpLastModified</tt> attribute is set, use it as the hash.
 
** if <tt>httpSize</tt> attribute is set, concatenate value of <tt>httpMimetype</tt> attribute and use it as hash
 
** if <tt>httpSize</tt> attribute is set, concatenate value of <tt>httpMimetype</tt> attribute and use it as hash
 
** if nothing works, create a UUID to force updating.
 
** if nothing works, create a UUID to force updating.
* <tt>SimpleLinkExtractor</tt> (implements <tt>LinkExtractor</tt>: Simple link extraction from HTML <tt><A href="..."></tt> tags using the tagsoup HTML parser.
+
* <tt>DefaultLinkExtractor</tt> (implements <tt>LinkExtractor</tt>: Simple link extraction from HTML <tt><A href="..."></tt> tags using the tagsoup HTML parser.
 
* <tt>DefaultLinkFilter</tt>: Links are normalized (e.g. fragment parts from URLs ("#...") are removed) and filtered against the specified filter configuration.
 
* <tt>DefaultLinkFilter</tt>: Links are normalized (e.g. fragment parts from URLs ("#...") are removed) and filtered against the specified filter configuration.
 +
* The attribute <tt>crawlDepth</tt> is used to track the crawl depth of each link to support checking the crawl depth with the <tt>maxCrawlDepth</tt> filter: It's initialized with the <tt>maxCrawlDepth</tt> value for the start URL and decreased with each crawl step. If it reaches 0 in a <tt>linksToCrawl</tt> record, no links are extracted from this resource, but only a <tt>crawledRecord</tt> for the resource itself is produced.
  
 
=== Web Fetcher Worker ===
 
=== Web Fetcher Worker ===
Line 91: Line 125:
 
* Parameters:
 
* Parameters:
 
** <tt>waitBetweenRequests</tt>: ''(opt., see Web Crawler)''
 
** <tt>waitBetweenRequests</tt>: ''(opt., see Web Crawler)''
 +
** <tt>linkErrorHandling</tt>: ''(opt., default "drop")'' specifies how to handle IO errors (e.g. network problems that might resolve after a while) when trying to access the URL in a link record to crawl. It's similar to the Web Crawler parameter with the same name. Possible values are:
 +
*** "drop": The record will be ignored and not added to the "crawledRecords" output, but other links in the same bulk can be processed successfully.
 +
*** "retry": the current task is finished with a recoverable error so that it might be retried later (depending on job/task manager settings). However, if the task cannot be finished successfully within the configured number of retries, it will finally fail and all other links contained in the same bulk will not be crawled, too.
 +
{{Tip|This parameter does not affect handling errors reported by the accessed webserver like "404 Not Found", "400 Bad Request", "500 Internal Server Error" etc. Records causing such errors will always be written to output unchanged.}}
 
** <tt>filters</tt>:
 
** <tt>filters</tt>:
 
*** <tt>followRedirects</tt>: ''(opt., see Web Crawler)''
 
*** <tt>followRedirects</tt>: ''(opt., see Web Crawler)''
 
*** <tt>maxRedirects</tt>: ''(opt., see Web Crawler)''
 
*** <tt>maxRedirects</tt>: ''(opt., see Web Crawler)''
 +
*** <tt>urlPatterns</tt>: ''(opt., see Web Crawler) applied to resulting URL of a redirect''
 +
**** <tt>include</tt>: ''(opt., see Web Crawler)''
 +
**** <tt>exclude</tt>: ''(opt., see Web Crawler)''
 
** <tt>mapping</tt> ''(req., see Web Crawler)''
 
** <tt>mapping</tt> ''(req., see Web Crawler)''
 
*** <tt>httpUrl</tt> ''(req.)'' to read the attribute that contains the URL where to fetch the content
 
*** <tt>httpUrl</tt> ''(req.)'' to read the attribute that contains the URL where to fetch the content
Line 107: Line 148:
 
** <tt>fetchedLinks</tt>: The incoming records with the content of the resource attached.
 
** <tt>fetchedLinks</tt>: The incoming records with the content of the resource attached.
  
The fetcher tries to get the content of a web resource identified by attribute <tt>httpUrl</tt>, if attachment <tt>httpContent</tt> is not yet set. Like the <tt>SimpleFetcher</tt> above it does not do redirects or authentication or other fancy stuff to read the resource, but just uses a simple <tt>GET</tt> request.
+
The fetcher tries to get the content of a web resource identified by attribute <tt>httpUrl</tt>, if attachment <tt>httpContent</tt> is not yet set. Like the <tt>DefaultFetcher</tt> above it does not do authentication to read the resource.
  
 
=== Web Extractor Worker ===
 
=== Web Extractor Worker ===
Line 113: Line 154:
 
* Worker name: <tt>webExtractor</tt>
 
* Worker name: <tt>webExtractor</tt>
 
* Parameters:  
 
* Parameters:  
** <tt>waitBetweenRequests</tt>: ''(opt., see Web Crawler)''
 
 
** <tt>filters</tt>: ''(opt., see Web Crawler)''
 
** <tt>filters</tt>: ''(opt., see Web Crawler)''
*** <tt>maxCrawlDepth</tt>: ''(opt., see Web Crawler)''
 
 
*** <tt>followRedirects</tt>: ''(opt., see Web Crawler)''
 
*** <tt>followRedirects</tt>: ''(opt., see Web Crawler)''
 
*** <tt>maxRedirects</tt>: ''(opt., see Web Crawler)''
 
*** <tt>maxRedirects</tt>: ''(opt., see Web Crawler)''
Line 172: Line 211:
 
  }
 
  }
 
</pre>
 
</pre>
 
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Revision as of 06:21, 17 September 2013

WebCrawler, WebFetcher and WebExtractor worker are used for importing files from a web server. For a big picture and the worker's interaction have a look at the Importing Concept.

Web Crawler Worker

  • Worker name: webCrawler
  • Parameters:
    • dataSource: (req.) name of data source, used only to mark produced records currently.
    • startUrl: (req.) URL to start crawling at. Must be a valid URL, no additional escaping is done.
    • waitBetweenRequests: (opt.) long value in milliseconds on how long to wait between HTTP requests (default: 0).
    • linksPerBulk: (opt.) number of links in one bulk object for follow-up tasks (default: 10)
    • linkErrorHandling: (opt., default "drop") specifies how to handle IO errors (e.g. network problems) when trying to access the URL in a link record to crawl: possible values are:
      • "drop": The record will be ignored and not added to the "crawledRecords" output, but other links in the same bulk can be processed successfully.
      • "retry": The current task is finished with a recoverable error so that it might be retried later (depending on job/task manager settings). However, if the task cannot be finished successfully within the configured number of retries, it will finally fail and all other links contained in the same bulk will not be crawled, too.
Idea.png
This parameter does not affect handling errors reported by the accessed webserver like "404 Not Found", "400 Bad Request", "500 Internal Server Error" etc. Records causing such errors will always be dropped.
    • filters: (opt.) A map containing filter settings, i.e. instructions which links to include or exclude from the crawl. This parameter is optional.
      • maxCrawlDepth: the maximum crawl depth when following links.
      • followRedirects: whether to follow redirects or not (default: false).
      • maxRedirects: maximum number of allowed redirects when following redirects is enabled (default: 1).
      • stayOn: whether to follow links that would leave the host resp. domain or not. Valid values: host, domain (default: follow all links).
        • Hint: implementation of stayOn 'domain' is currently very simple. It just takes the domain as the host name without the first part, if at least two host name parts remain. (www.foo.com -> foo.com, foo.com -> foo.com). Sometimes this won't fit (bbc.co.uk -> co.uk (!)), than you should use url patterns instead.
      • urlPatterns: regex patterns for filtering crawled elements on the basis of their URL. Note that the crawler additionally reads the crawled sites' robots.txt. See below for details.
        • include: if include patterns are specified, at least one of them must match the URL. If no include patterns are specified, this is handled as if all URLs are included.
        • exclude: if at least one exclude pattern matches the URL, the crawled element is filtered out
    • mapping (req.) specifies how to map link properties to record attributes
      • httpUrl (req.) mapping attribute for the URL
      • httpMimetype (opt.) mapping attribute for the mime type
      • httpCharset (opt.) mapping attribute for character set
      • httpContenttype (opt.) mapping attribute for the content type
      • httpLastModified (opt.) mapping attribute for the link's last modified date
      • httpSize (opt.) mapping attribute for the link content's size (in bytes)
      • httpContent (opt.) attachment name where the link content is written to
  • Task generator: runOnceTrigger
  • Input slots:
    • linksToCrawl: Records describing links to crawl.
  • Output slots:
    • linksToCrawl: Records describing outgoing links from the crawled resources. Should be connected to the same bucket as the input slot.
    • crawledRecords: Records describing crawled resources. For resources of mimetype text/html the records have the content attached. For other resources, use a webFetcher worker later in the workflow to get the content.
Filter patterns and normalization

When defining filter patterns, keep in mind that URLs are normalized before filters are applied. Normalization means:

  • the URL will be made absolute when it's relative (e.g. /relative/link -> http://my.domain.de/relative/link)
  • paths will be normalized (e.g. host/path/../path2 -> host/path2)
  • scheme and host will be converted to lower case (e.g. HTTP://WWW.Host.de/Path -> http://www.host.de/Path)
    • Hint: The path will not be converted to lower case!
  • fragments will be removed (e.g. host/path#fragment -> host/path)
  • the default port 80 will be removed (e.g. host:80 -> host)
  • 'opaque' URIs can not be handled and will be filtered out automatically (e.g. javascript:void(0), mailto:andreas.weber@empolis.com)

Configuration

The configuration directory org.eclipse.smila.importing.crawler.web contains the configuration file webcrawler.properties.

The configuration properties can contain the following properties:

  • proxyHost (default: none)
  • proxyPort (default: 80)
  • socketTimeout (default: none, i.e. no socket timeout)
  • userAgent (default: "SMILA (http://wiki.eclipse.org/SMILA/UserAgent; smila-dev@eclipse.org)").

The configuration properties proxyHost and proxyPort are used to define a proxy for the web crawler (i.e. the DefaultFetcher class is using these configuration to configure its HTTP client) whereas the socketTimeout parameter defines how the fetcher's timeout is while retrieving data from the server. If you omit the socketTimeout parameter, the fetcher will set no timeout.

User-Agent and robots.txt

If you use SMILA for your own project, please change userAgent to your an own name, URL and email address. Apart from telling the web server who you are, the value is relevant for choosing the appropriate settings from the crawled sites' robots.txt files: The WebCrawler chooses the first set of "Disallow:" lines for which the "User-agent:" line is a case-insensitive substring of this configuration property.

It is currently not possible to ignore the robots.txt by configuration or job parameters.

Also, the web crawler uses only the basic robots.txt directives, "User-agent:" and "Disallow:" as defined in the original standard (see http://www.robotstxt.org/orig.html). That means it does not use "Allow:" lines, and the values in "Disallow:" lines are not evalutated as regular or other expressions. The crawler will just ignore all links that start exactly (case-sensitive) with one of the "Disallow:" values. The only exception is an empty "Disallow:" value meaning that all links on this site are allowed to be crawled.

Also, we do not observe "Crawl-delay:" parameters in robots.txt, so please take care that you use appropriate waitBetweenRequests (see above), taskControl/delay and scale-up settings for the web crawler worker yourself.

The robots.txt is fetched from the web server only once per site and job run. So changes in a robots.txt will not become effective until the crawl job is restarted.

Configuring a proxy

You can configure the proxy the web crawler should use by defining the proxy in the configuration file (see above). E.g. to set up the web crawler to use a proxy at proxy-host:3128, use the following configuration:

proxyHost=proxy-host
proxyPort=3128

Alternatively you can also use the JRE system properties http.proxyHost and http.proxyPort (see http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html for more information on proxy system properties).

Internal structure

To make it easier to extend and improve the web crawler it is divided internally into components. Each of them is a single OSGi service that handles one part of the crawl functionality and can be exchanged individually to improve a single part of the functionality. The architecture looks like this:

SMILA-Importing-Web-Crawler-Internal.png

The WebCrawler worker is started with an input bulk that contains records with URLs to crawl. (The exception to this rule is the start of the crawl process where it gets a task without an input bulk, which causes it to generate an input record from its configured startUrl parameter). Then the components are executed like this:

  • First a VisitedLinksService is asked if this link was already crawled by someone else in this crawl job run. If so, the record is just dropped and no output is produced. Otherwise the link is marked as visited in the VisitedLinksService and processing goes on.
  • The Fetcher is called to get the metadata (e.g. the mime type). If the mime type of the resource is suitable for link extraction, the Fetcher also gets the content. Otherwise the content will only be fetched in the WebFetcher worker later in the crawl workflow to save IO load.
  • If the content of the resource was fetched, the LinkExtractor is called to extract outgoing links (e.g. look for <A> tags). It can produce multiple link records containing one absolute outgoing URL each.
  • If outgoing links were found the current crawl depth is checked and if a maximum crawl depth is configured for this job and it is exceeded the links are discarded. The current crawl depth is stored in each link record (using the attribute crawlDepth).
  • The LinkFilter is called next to remove links that should not be followed (e.g. because they are on a different site) or remove duplicates.
  • In a last step the RecordProducer is called to decide how the processed record should be written to the recordBulks output bulk. The producer could modify the records or split them into multiple records, if necessary for the use case.

Both fetcher and link filter will check the input links against the robots.txt file of the site. To prevent multiple access to the same site's robots.txt, the disallowed links will be stored in the job-run data of the crawl job. Only the fetcher will actually read the robots.txt if it has not been read yet. The link filter will only use already cached settings.

Scaling

Outgoing links are separated into multiple bulks to improve scaling: The outgoing links from the initial task that crawls the startUrl will be written to an own bulk each, while outgoing links from later tasks will be written to separate bulks according to the linksPerBulk parameter. The outgoing crawled records are divided into bulks of 100 records at most.

Implementation details

  • ObjectStoreVisitedLinksService (implements VisitedLinksService): Uses the ObjectStoreService to store which links have been visited, similar to the ObjectStoreDeltaService. It uses a configuration file with the same properties in the same configuration directory, but named visitedlinksstore.properties.
  • DefaultFetcher: Uses a GET request to read the URL. Currently, authentication is not supported. Writes content to attachment httpContent, if the resource is of mimetype text/html and sets the following attributes:
    • httpSize: value from HTTP header Content-Length (-1, if not set), as a Long value.
    • httpContenttype: value from HTTP header Content-Type, if set.
    • httpMimetype: mimetype part of HTTP header Content-Type, if set.
    • httpCharset: charset part of HTTP header Content-Type, if set.
    • httpLastModified: value from HTTP header Last-Modified, if set, as a DateTime value.
    • _isCompound: set to true for resources that are identified as extractable compound objects by the running CompoundExtractor service.
  • DefaultRecordProducer: Set record source and calculates _deltaHash value for DeltaChecker worker (first wins):
    • if content is attached, calculate a digest.
    • if httpLastModified attribute is set, use it as the hash.
    • if httpSize attribute is set, concatenate value of httpMimetype attribute and use it as hash
    • if nothing works, create a UUID to force updating.
  • DefaultLinkExtractor (implements LinkExtractor: Simple link extraction from HTML <A href="..."> tags using the tagsoup HTML parser.
  • DefaultLinkFilter: Links are normalized (e.g. fragment parts from URLs ("#...") are removed) and filtered against the specified filter configuration.
  • The attribute crawlDepth is used to track the crawl depth of each link to support checking the crawl depth with the maxCrawlDepth filter: It's initialized with the maxCrawlDepth value for the start URL and decreased with each crawl step. If it reaches 0 in a linksToCrawl record, no links are extracted from this resource, but only a crawledRecord for the resource itself is produced.

Web Fetcher Worker

  • Worker name: webFetcher
  • Parameters:
    • waitBetweenRequests: (opt., see Web Crawler)
    • linkErrorHandling: (opt., default "drop") specifies how to handle IO errors (e.g. network problems that might resolve after a while) when trying to access the URL in a link record to crawl. It's similar to the Web Crawler parameter with the same name. Possible values are:
      • "drop": The record will be ignored and not added to the "crawledRecords" output, but other links in the same bulk can be processed successfully.
      • "retry": the current task is finished with a recoverable error so that it might be retried later (depending on job/task manager settings). However, if the task cannot be finished successfully within the configured number of retries, it will finally fail and all other links contained in the same bulk will not be crawled, too.
Idea.png
This parameter does not affect handling errors reported by the accessed webserver like "404 Not Found", "400 Bad Request", "500 Internal Server Error" etc. Records causing such errors will always be written to output unchanged.
    • filters:
      • followRedirects: (opt., see Web Crawler)
      • maxRedirects: (opt., see Web Crawler)
      • urlPatterns: (opt., see Web Crawler) applied to resulting URL of a redirect
        • include: (opt., see Web Crawler)
        • exclude: (opt., see Web Crawler)
    • mapping (req., see Web Crawler)
      • httpUrl (req.) to read the attribute that contains the URL where to fetch the content
      • httpContent (req.) attachment name where the file content is written to
      • httpMimetype (opt., see Web Crawler)
      • httpCharset (opt., see Web Crawler)
      • httpContenttype (opt., see Web Crawler)
      • httpLastModified (opt., see Web Crawler)
      • httpSize (opt., see Web Crawler)
  • Input slots:
    • linksToFetch: Records describing crawled resources, with or without the content of the resource.
  • Output slots:
    • fetchedLinks: The incoming records with the content of the resource attached.

The fetcher tries to get the content of a web resource identified by attribute httpUrl, if attachment httpContent is not yet set. Like the DefaultFetcher above it does not do authentication to read the resource.

Web Extractor Worker

  • Worker name: webExtractor
  • Parameters:
    • filters: (opt., see Web Crawler)
      • followRedirects: (opt., see Web Crawler)
      • maxRedirects: (opt., see Web Crawler)
      • urlPatterns: (opt., see Web Crawler)
        • include: (opt., see Web Crawler)
        • exclude: (opt., see Web Crawler)
    • mapping (req., see Web Crawler)
      • httpUrl (req., see Web Crawler) URLs of compounds have the compound link as prefix, e.g. http://example.com/compound.zip/compound-element.txt
      • httpMimetype (req., see Web Crawler)
      • httpCharset (opt., see Web Crawler)
      • httpContenttype (opt., see Web Crawler)
      • httpLastModified (opt., see Web Crawler)
      • httpSize (opt., see Web Crawler)
      • httpContent (opt., see Web Crawler)
  • Input slots:
    • compounds
  • Output slots:
    • files

For each input record, an input stream to the described web resource is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler. Additional internal attributes that are set:

  • _deltaHash: computed as in the WebCrawler worker
  • _compoundRecordId: record ID of top-level compound this element was extracted from
  • _isCompound: set to true for elements that are compounds themselves.* _compoundPath: sequence of httpUrl attribute values of the compound objects needed to navigate to the compound element.

The crawler attributes httpContenttype, httpMimetype and httpCharset are currently not set by the WebExtractor worker.

If the element is not a compound itself, its content is added as attachment httpContent.

Sample web crawl job

Job definition for crawling from start URL "http://wiki.eclipse.org/SMILA", pushing the imported records to job "indexUpdateJob". An include pattern is defined to make sure that we only crawl URLs from "below" our start URL.

  {
   "name":"crawlWebJob",
   "workflow":"webCrawling",
   "parameters":{
     "tempStore":"temp",
     "dataSource":"web",
     "startUrl":"http://wiki.eclipse.org/SMILA",
     "jobToPushTo":"indexUpdateJob",
     "waitBetweenRequests": 100,
     "mapping":{
       "httpContent":"Content",
       "httpUrl":"Path"
     },
     "filters":{              
       "urlPatterns":{         
         "include":["http://wiki\\.eclipse\\.org/SMILA/.*"]  
       }
     }
   }
 }