Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Web Crawler"

m (Crawling configuration explanation)
m (Crawling configuration explanation)
Line 92: Line 92:
 
***# <tt>Path</tt>: accept if on same host and a shared path-prefix as seeds. This scope goes yet further and limits the discovered URIs to a section of paths on hosts defined by the seeds. Of course any host that has a seed **:pointing at its root (i.e. www.sample.com/index.html) will be included in full where as a host whose only seed is www.sample2.com/path/index.html **:will be limited to URIs under /path/.
 
***# <tt>Path</tt>: accept if on same host and a shared path-prefix as seeds. This scope goes yet further and limits the discovered URIs to a section of paths on hosts defined by the seeds. Of course any host that has a seed **:pointing at its root (i.e. www.sample.com/index.html) will be included in full where as a host whose only seed is www.sample2.com/path/index.html **:will be limited to URIs under /path/.
 
**** <tt>Filters</tt>: every scope can have additional filters to select URI that will be considered to be within or out of scope ( see the section Filters for details)
 
**** <tt>Filters</tt>: every scope can have additional filters to select URI that will be considered to be within or out of scope ( see the section Filters for details)
 +
*** <tt>CrawlLimits</tt>: In addition to limits imposed on the scope of the crawl it is possible to enforce arbitrary limits on the duration and extent of the crawling process with the following setting:
 +
**** <tt>SizeLimits</tt>:
 +
***** <tt>MaxBytesDownload</tt>: stop after a fixed number of bytes have been downloaded (0 means unlimited).
 +
***** <tt>MaxDocumentDownload</tt>: stop after downloading a fixed number of documents (0 means unlimited).
 +
***** <tt>MaxTimeSec</tt>: stop after a certain number of seconds have elapsed (0 means unlimited).
 +
These are not supposed to be hard limits. Once one of these limits is reached, it will trigger a graceful termination of the crawl job, which means that URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.
 +
***** <tt>MaxLengthBytes</tt>: maximum number of bytes to download per document. Will truncate file once this limit is reached.
 +
**** <tt>TimeoutLimits</tt>: Whenever crawler connects to or reads from a remote host, it checks the timeouts and aborts the operation if any is exceeded. This prevents anomalous occurrences such as hanging reads or infinite connects.
 +
***** <tt>Timeout</tt>: This limit is the total time need to connect and get the download website, and such represents the total of a ConnectTimeout plus a ReadTimeout.
 +
***** <tt>ConnectTimeout</tt>: Connect timeout in seconds. TCP connections that take longer to establish will be aborted.
 +
***** <tt>ReadTimeout</tt>: Read (and write) timeout in seconds. Reads that take longer will fail. The default value for read timeout is 900 seconds.
  
 
== See also ==
 
== See also ==

Revision as of 07:20, 19 March 2009

What does Web Crawler do

A WebCrawler collects data from the internet. Starting with an initial URL it recursively crawls all linked Websites. Due to the manifold capabilities of webpage structures and much linking to other pages, the configuration of this crawler enables you to limit the downloaded data to match your needs.

Crawling configuration

Defining Schema: org.eclipse.smila.connectivitiy.framework.crawler.web/schemas/WebIndexOrder.xsd

Crawling configuration explanation

The root element of crawling configuration is IndexOrderConfiguration and contains the following sub elements:

  • DataSourceID – the identification of a data source.
  • SchemaID – specify the schema for a crawler job.
  • DataConnectionID – describes which agent crawler should be used.
    • Crawler – implementation class of a Crawler.
    • Agent – implementation class of an Agent.
  • CompoundHandling – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
  • Attributes – list all attributes which describe a website.
    • FieldAttribute (URL, Title, Content):
      • Type (required) – the data type (String, Integer or Date).
      • Name (required) – attributes name.
      • HashAttribute – specify if a hash should be created (true or false).
      • KeyAttribute – creates a key for this object, for example for record id (true or false).
      • Attachment – specify if the attribute return the data as attachment of record.
    • MetaAttribute (MetaData, ResponseHeader, MetaDataWithResponseGeaderFallBack, MimeType):
      • Type (required) – the data type (String)
      • Name (required) – attributes name
      • Attachment - specify if the attribute return the data as attachment of record.
        • ReturnType – structure the metadata will be returned
          • MetaDataString – default structure, metadata is returned as single string, for example:
<A n="ResponseHeaser">
  <L>
    <V>Content-type: text/html</V>
  </L>
  ...
</A>
  • MetaDataValue – only values of metadata are returned, for example:
<A n="ResponseHeader">
  <L>
    <V>text/html</V>
  <L>
</A>
  • MetaDataMObject – metadata is returned as MObject containing attributes with metadata names and values, for example:
<A n="ResponseHeader">
  <O>
    <A n="Content-Type">
      <L>
        <V>text/html</V>
      </L>
    </A>
    ...
  </O>
</A>
  • Process – this element is responsible for selecting data
    • Website - contains all important information for accessing and crawling a website.
      • ProjectName - defines project name
      • Sitemaps - for supporting Google site maps. sitemap.xml, sitemap.xml.gz and sitemap.gz formats are supported. Links extracted from <loc> tags are added to the current level links. Crawler looks for the sitemap file at the root directory of the web server and then caches it for the particular host.
      • Header - request headers separated by semicolon. Headers should be in format "<header_name>:<header_content>", separated by semicolon.
      • Referer - to include "Referer: URL" header in http request.
      • EnableCookies - enable or disable cookies for crawling process (true or false)
      • UserAgent - element used to identify crawler to the server as a specific user agent origination the request. The UserAgent string generated looks like the following: Name/Version (Description, Url, Email)
        • Name (required)
        • Version
        • Description
        • URL
        • Email
      • Robotstxt element used for supporting robots.txt information. The so-named Robots Exclusion Standard tells crawler how to crawl a website – or rather which resources should not be crawled. See [[1]]
        • Policy: there are five types of policies offered on how to deal with robots.txt rules:
          1. Classic. Simply obey the robots.txt rules. Recommended unless you have special permission to collect a site more aggressively.
          2. Ignore. Completely ignore robots.txt rules.
          3. Custom. Obey your own, custom, robots.txt instead of those discovered on the relevant site. The attribute Value must contain the path to a locally available robots.txt file in this case.
          4. Set. Limit robots names which rules are followed to the given set. Value attribute must handle robots names separated by semicolon in this case.
        • Value: specifies the filename with the robots.txt rules for Custom policy and set of agent names for the Set policy.
        • AgentNames: specifies the list of agents we advertise. This list should be started with the same name as UserAgent Name (for example: crawler user-agent name that is used for the crawl job)
      • CrawlingModel: there are two models available:
        • Type: the model type (MaxBreadth or MaxDepth)
        1. MaxBreadth: crawling a web site through a limited number of links.
        2. MaxDepth: crawling a web site with specifying the maximum crawling depth.
        • Value: parameter (Integer)
      • CrawlScope: decides for each discovered URI if it is within the scope of the current crawl.
      • Type: following scope are provided:
        1. Broad: accept all. This scope does not impose any limits on the hosts, domains, or URI paths crawled.
        2. Domain: accept if on same 'domain' as seeds (start URL). This scope limits discovered URIs to the set of domains defined by the provided seeds. That is any URI discovered belonging to a domain from which one of the seed came is within scope. Using the seed 'brox.de', a domain scope will fetch 'bugs.brox.de', 'confluence.brox.de', etc. It will fetch all discovered URIs from 'brox.de' and from any subdomain of 'brox.de'.
        3. Host: accept if on exact host as seeds. This scope limits discovered URIs to the set of hosts defined by the provided seeds. If the seed is 'www.brox.de', then we'll only fetch items discovered on this host. The crawler will not go to 'bugs.brox.de'.
        4. Path: accept if on same host and a shared path-prefix as seeds. This scope goes yet further and limits the discovered URIs to a section of paths on hosts defined by the seeds. Of course any host that has a seed **:pointing at its root (i.e. www.sample.com/index.html) will be included in full where as a host whose only seed is www.sample2.com/path/index.html **:will be limited to URIs under /path/.
        • Filters: every scope can have additional filters to select URI that will be considered to be within or out of scope ( see the section Filters for details)
      • CrawlLimits: In addition to limits imposed on the scope of the crawl it is possible to enforce arbitrary limits on the duration and extent of the crawling process with the following setting:
        • SizeLimits:
          • MaxBytesDownload: stop after a fixed number of bytes have been downloaded (0 means unlimited).
          • MaxDocumentDownload: stop after downloading a fixed number of documents (0 means unlimited).
          • MaxTimeSec: stop after a certain number of seconds have elapsed (0 means unlimited).

These are not supposed to be hard limits. Once one of these limits is reached, it will trigger a graceful termination of the crawl job, which means that URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.

          • MaxLengthBytes: maximum number of bytes to download per document. Will truncate file once this limit is reached.
        • TimeoutLimits: Whenever crawler connects to or reads from a remote host, it checks the timeouts and aborts the operation if any is exceeded. This prevents anomalous occurrences such as hanging reads or infinite connects.
          • Timeout: This limit is the total time need to connect and get the download website, and such represents the total of a ConnectTimeout plus a ReadTimeout.
          • ConnectTimeout: Connect timeout in seconds. TCP connections that take longer to establish will be aborted.
          • ReadTimeout: Read (and write) timeout in seconds. Reads that take longer will fail. The default value for read timeout is 900 seconds.

See also