Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Web Crawler"

m
m
Line 512: Line 512:
  
 
Every scope can have additional filters to select URIs that will be considered to be within or out of scope (see the section Filters for details). For example:
 
Every scope can have additional filters to select URIs that will be considered to be within or out of scope (see the section Filters for details). For example:
 +
<source lang="xml">
 +
<CrawlScope Type="Broad">
 +
    <Filters>
 +
    <Filter Type="BeginningPath" WorkType="Select" Value="/level3.html"/>
 +
    </Filters>
 +
</CrawlScope>
 +
</source>

Revision as of 04:58, 15 August 2008

XML Index Order

Following an example of a Webcrawler Index Order:

<?xml version="1.0" encoding="UTF-8"?>
<IndexOrderConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <DataSourceID>Web_TEST</DataSourceID>
  <SchemaID>org.eclipse.eilf.connectivity.framework.crawler.web</SchemaID>
  <DataConnectionID>
    <Crawler>MyWebCrawler</Crawler>
  </DataConnectionID>
  <CompoundHandling>No</CompoundHandling>
  <Attributes>
    <Attribute Type="String" Name="Url" KeyAttribute="true">
      <FieldAttribute>Url</FieldAttribute>
    </Attribute>
    <Attribute Type="String" Name="Title">
      <FieldAttribute>Title</FieldAttribute>
    </Attribute>
    <Attribute Type="String" Name="Content" HashAttribute="true" Attachment="true" MimeTypeAttribute="Content">
      <FieldAttribute>Content</FieldAttribute>
    </Attribute>
    <Attribute Type="String" Name="MetaData" Attachment="false">
      <MetaAttribute Type="MetaData"/>
    </Attribute>
    <Attribute Type="String" Name="ResponseHeader" Attachment="false">
      <MetaAttribute Type="ResponseHeader">
        <MetaName>Date</MetaName>
        <MetaName>Server</MetaName>
      </MetaAttribute>
    </Attribute>
    <Attribute Type="String" Name="MetaDataWithResponseHeaderFallBack" Attachment="false">
      <MetaAttribute Type="MetaDataWithResponseHeaderFallBack"/>
    </Attribute>
  </Attributes>
  <Process>
    <WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="http://myReferer">
      <UserAgent Name="Crawler" Version="1.0" Description="Test crawler" Url="http://www.softaria.com" Email="crawler@example.com"/>
        <CrawlingModel Type="MaxIterations" Value="20"/>
        <CrawlScope Type="Broad">
          <Filters>
            <Filter Type="BeginningPath" WorkType="Select" Value="/test.html"/>
          </Filters>
        </CrawlScope>
        <CrawlLimits>
          <SizeLimits MaxBytesDownload="0" MaxDocumentDownload="10" MaxTimeSec="3600" MaxLengthBytes="1000000" />
          <TimeoutLimits Timeout="10000" />
          <WaitLimits Wait="0" RandomWait="false" MaxRetries="8" WaitRetry="0"/>
        </CrawlLimits>
        <Seeds FollowLinks="NoFollow">
          <Seed>http://www.brox.de</Seed>
        </Seeds>
        <Filters>
          <Filter Type="BeginningPath" WorkType="Unselect" Value="/something/">
            <Refinements>
              <TimeOfDay From="09:00:00" To="23:00:00"/>
              <Port Number="80"/>
            </Refinements>
          </Filter>
          <Filter Type="RegExp" WorkType="Unselect" Value="news"/>
          <Filter Type="ContentType" WorkType="Unselect" Value="image/jpeg"/>
        </Filters>
        <MetaTagFilters>
          <MetaTagFilter Type="Name" Name="author" Content="Blocked Author" WorkType="Unselect"/>
        </MetaTagFilters>
    </WebSite>
  </Process>
</IndexOrderConfiguration>

XSD Schema used for Web Crawler

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 1.0.8.0 (http://www.liquid-technologies.com) -->
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:redefine schemaLocation="RootIndexOrderConfiguration.xsd">
    <xs:complexType name="Attribute">
      <xs:annotation>
        <xs:documentation>Attribute Specification</xs:documentation>
      </xs:annotation>
      <xs:complexContent mixed="false">
        <xs:extension base="Attribute">
          <xs:choice>
            <xs:element name="FieldAttribute" type="FieldAttributeType" />
            <xs:element name="MetaAttribute" type="MetaAttributeType" />
          </xs:choice>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
    <xs:complexType name="Process">
      <xs:annotation>
        <xs:documentation>Process Specification</xs:documentation>
      </xs:annotation>
      <xs:complexContent mixed="false">
        <xs:extension base="Process">
          <xs:sequence>
            <xs:element minOccurs="0" maxOccurs="unbounded" name="WebSite" type="WebSite" />
          </xs:sequence>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
  </xs:redefine>
  <xs:simpleType name="CrawlScope">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Broad" />
      <xs:enumeration value="Domain" />
      <xs:enumeration value="Host" />
      <xs:enumeration value="Path" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="FollowLinksType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Follow" />
      <xs:enumeration value="NoFollow" />
      <xs:enumeration value="FollowLinksWithCorrespondingSelectFilter" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="FilterType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="BeginningPath" />
      <xs:enumeration value="RegExp" />
      <xs:enumeration value="ContentType" />
      <xs:enumeration value="CrawlScope" />
      <xs:enumeration value="HtmlMetaTag" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="FilterWorkType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Select" />
      <xs:enumeration value="Unselect" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="ModelType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="MaxIterations" />
      <xs:enumeration value="MaxDepth" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="FieldAttributeType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Url" />
      <xs:enumeration value="Title" />
      <xs:enumeration value="Content" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="MetaType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="MetaData" />
      <xs:enumeration value="ResponseHeader" />
      <xs:enumeration value="MetaDataWithResponseHeaderFallBack" />
    </xs:restriction>
  </xs:simpleType>
  <xs:complexType name="MetaAttributeType">
    <xs:sequence>
      <xs:element name="MetaName" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>      
    </xs:sequence>
    <xs:attribute name="Type" type="MetaType" use="required" />
    <!-- xs:attribute name="MetaName" type="xs:string" use="optional" / -->
  </xs:complexType>
  <xs:simpleType name="Robotstxt">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Classic" />
      <xs:enumeration value="Ignore" />
      <xs:enumeration value="Custom" />
      <xs:enumeration value="Set" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="HttpMethod">
    <xs:restriction base="xs:string">
      <xs:enumeration value="GET" />
      <xs:enumeration value="POST" />
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="HtmlMetaTagType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Name" />
      <xs:enumeration value="HttpEquiv" />
    </xs:restriction>
  </xs:simpleType>
  <xs:complexType name="WebSite">
    <xs:sequence>
      <xs:element minOccurs="0" name="UserAgent">
        <xs:complexType>
          <xs:attribute name="Name" type="xs:string" use="required" />
          <xs:attribute name="Version" type="xs:string" use="optional" />
          <xs:attribute name="Description" type="xs:string" use="optional" />
          <xs:attribute name="Url" type="xs:string" use="optional" />
          <xs:attribute name="Email" type="xs:string" use="optional" />
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="Robotstxt">
        <xs:complexType>
          <xs:attribute default="Classic" name="Policy" type="Robotstxt" use="optional" />
          <xs:attribute default="" name="Value" type="xs:string" use="optional" />
          <xs:attribute default="" name="AgentNames" type="xs:string" use="optional" />
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="CrawlingModel">
        <xs:complexType>
          <xs:attribute name="Type" type="ModelType" use="required" />
          <xs:attribute name="Value" type="xs:positiveInteger" use="required" />
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="CrawlScope">
        <xs:complexType>
          <xs:sequence>
            <xs:element minOccurs="0" name="Filters">
              <xs:complexType>
                <xs:sequence>
                  <xs:element maxOccurs="unbounded" name="Filter">
                    <xs:complexType>
                      <xs:complexContent mixed="false">
                        <xs:extension base="Filter" />
                      </xs:complexContent>
                    </xs:complexType>
                  </xs:element>
                </xs:sequence>
              </xs:complexType>
            </xs:element>
          </xs:sequence>
          <xs:attribute default="Host" name="Type" type="CrawlScope" use="optional" />
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="CrawlLimits">
        <xs:complexType>
          <xs:sequence>
            <xs:element minOccurs="0" name="SizeLimits">
              <xs:complexType>
                <xs:attribute default="0" name="MaxBytesDownload" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="MaxDocumentDownload" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="MaxTimeSec" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="MaxLengthBytes" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="LimitRate" type="xs:integer" use="optional" />
              </xs:complexType>
            </xs:element>
            <xs:element minOccurs="0" name="TimeoutLimits">
              <xs:complexType>
                <xs:attribute default="0" name="Timeout" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="DnsTimeout" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="ConnectTimeout" type="xs:integer" use="optional" />
                <xs:attribute default="900" name="ReadTimeout" type="xs:integer" use="optional" />
              </xs:complexType>
            </xs:element>
            <xs:element minOccurs="0" name="WaitLimits">
              <xs:complexType>
                <xs:attribute default="0" name="Wait" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="RandomWait" type="xs:boolean" use="optional" />
                <xs:attribute default="0" name="WaitRetry" type="xs:integer" use="optional" />
                <xs:attribute default="0" name="MaxRetries" type="xs:integer" use="optional" />
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="Proxy">
        <xs:complexType>
          <xs:choice>
            <xs:element name="ProxyServer">
              <xs:complexType>
                <xs:attribute name="Host" type="xs:string" use="required" />
                <xs:attribute name="Port" type="xs:string" use="required" />
                <xs:attribute default="" name="Login" type="xs:string" use="optional" />
                <xs:attribute default="" name="Password" type="xs:string" use="optional" />
              </xs:complexType>
            </xs:element>
            <xs:element name="AutomaticConfiguration">
              <xs:complexType>
                <xs:attribute name="Address" type="xs:string" use="required" />
              </xs:complexType>
            </xs:element>
          </xs:choice>
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="Authentication">
        <xs:complexType>
          <xs:sequence>
            <xs:element minOccurs="0" maxOccurs="unbounded" name="Rfc2617">
              <xs:complexType>
                <xs:attribute name="Host" type="xs:string" use="required" />
                <xs:attribute name="Port" type="xs:string" use="required" />
                <xs:attribute name="Realm" type="xs:string" use="required" />
                <xs:attribute name="Login" type="xs:string" use="required" />
                <xs:attribute name="Password" type="xs:string" use="required" />
              </xs:complexType>
            </xs:element>
            <xs:element minOccurs="0" maxOccurs="unbounded" name="HtmlForm">
              <xs:complexType>
                <xs:sequence>
                  <xs:element name="FormElements">
                    <xs:complexType>
                      <xs:sequence>
                        <xs:element maxOccurs="unbounded" name="FormElement">
                          <xs:complexType>
                            <xs:attribute name="Key" type="xs:string" use="required" />
                            <xs:attribute name="Value" type="xs:string" use="required" />
                          </xs:complexType>
                        </xs:element>
                      </xs:sequence>
                    </xs:complexType>
                  </xs:element>
                </xs:sequence>
                <xs:attribute name="CredentialDomain" type="xs:string" use="required" />
                <xs:attribute name="LoginUri" type="xs:string" use="required" />
                <xs:attribute name="HttpMethod" type="HttpMethod" use="required" />
              </xs:complexType>
            </xs:element>
            <xs:element minOccurs="0" maxOccurs="unbounded" name="SslCertificate">
              <xs:complexType>
                <xs:attribute name="ProtocolName" type="xs:string" use="required" />
                <xs:attribute name="Port" type="xs:string" use="required" />
                <xs:attribute name="TruststoreUrl" type="xs:string" use="required" />
                <xs:attribute default="" name="TruststorePassword" type="xs:string" use="optional" />
                <xs:attribute name="KeystoreUrl" type="xs:string" use="required" />
                <xs:attribute default="" name="KeystorePassword" type="xs:string" use="optional" />
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="Ssl">
        <xs:complexType>
          <xs:attribute name="TruststoreUrl" type="xs:string" use="required" />
          <xs:attribute default="" name="TruststorePassword" type="xs:string" use="optional" />
        </xs:complexType>
      </xs:element>
      <xs:element name="Seeds">
        <xs:complexType>
          <xs:sequence>
            <xs:element maxOccurs="unbounded" name="Seed" type="xs:string" />
          </xs:sequence>
          <xs:attribute default="Follow" name="FollowLinks" type="FollowLinksType" use="optional" />
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="Filters">
        <xs:complexType>
          <xs:sequence>
            <xs:element maxOccurs="unbounded" name="Filter">
              <xs:complexType>
                <xs:complexContent mixed="false">
                  <xs:extension base="Filter">
                    <xs:sequence>
                      <xs:element minOccurs="0" name="Refinements">
                        <xs:complexType>
                          <xs:sequence>
                            <xs:element minOccurs="0" name="TimeOfDay">
                              <xs:complexType>
                                <xs:attribute name="From" type="xs:time" use="required" />
                                <xs:attribute name="To" type="xs:time" use="required" />
                              </xs:complexType>
                            </xs:element>
                            <xs:element minOccurs="0" name="Port">
                              <xs:complexType>
                                <xs:attribute name="Number" type="xs:integer" use="required" />
                              </xs:complexType>
                            </xs:element>
                          </xs:sequence>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:extension>
                </xs:complexContent>
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
      <xs:element minOccurs="0" name="MetaTagFilters">
        <xs:complexType>
          <xs:sequence>
            <xs:element maxOccurs="unbounded" name="MetaTagFilter">
              <xs:complexType>
                <xs:attribute name="Type" type="HtmlMetaTagType" use="required" />
                <xs:attribute name="Name" type="xs:string" use="required" />
                <xs:attribute name="Content" type="xs:string" use="required" />
                <xs:attribute name="WorkType" type="FilterWorkType" use="required" />
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
    <xs:attribute name="ProjectName" type="xs:string" use="required" />
    <xs:attribute default="false" name="Sitemaps" type="xs:boolean" use="optional" />
    <xs:attribute default="" name="Header" type="xs:string" use="optional" />
    <xs:attribute default="" name="Referer" type="xs:string" use="optional" />
    <xs:attribute default="true" name="EnableCookies" type="xs:boolean" use="optional" />
  </xs:complexType>
  <xs:complexType name="Filter">
    <xs:attribute name="WorkType" type="FilterWorkType" use="required" />
    <xs:attribute name="Value" type="xs:string" use="required" />
    <xs:attribute name="Type" type="FilterType" use="required" />
  </xs:complexType>
</xs:schema>

Attribute element

Field Attributes. The FiledAttribute element describes the web page information that should be included in the index. Following options exists:

  1. URL: the URL to the web page.
  2. Title: the web page's title is exported.
  3. Content: the content of the web page is emitted as byte[].
<Attribute Type="String" Name="Url" KeyAttribute="true">
  <FieldAttribute>Url</FieldAttribute>
</Attribute>
<Attribute Type="String" Name="Title">
  <FieldAttribute>Title</FieldAttribute>
</Attribute>
<Attribute Type="String" Name="Content" HashAttribute="true" Attachment="true" MimeTypeAttribute="Content">
  <FieldAttribute>Content</FieldAttribute>
</Attribute>

Metadata Attributes. MetaAttribute element describes meta information that should be included in the index like HTML Metadata and HTTP response header.

  1. MetaData
    The MetaData element describes the meta-tag information that should be included in the index. Well known meta-tags are for example:
    description
    keywords
    ...
  2. ResponseHeader
    The ResponseHeader element describes the response header information that should be included in the index. Well known response headers are for #: #: example:
    accept-ranges
    server
    location
    ...
  3. MetaDataWithResponseHeaderFallBack
    The MetaDataWithResponseHeaderFallBack element describes meta-tag or response header information that should be included in the index.
<Attribute Type="String" Name="MetaData" Attachment="false">
  <MetaAttribute Type="MetaData"/>
</Attribute>
<Attribute Type="String" Name="ResponseHeader" Attachment="false">
  <MetaAttribute Type="ResponseHeader">
    <MetaName>Date</MetaName>
    <MetaName>Server</MetaName>
  </MetaAttribute>
</Attribute>
<Attribute Type="String" Name="MetaDataWithResponseHeaderFallBack" Attachment="false">
  <MetaAttribute Type="MetaDataWithResponseHeaderFallBack"/>
</Attribute>

Process element

The Process element is responsible for selecting data. The schema definition of the process element and its subelements look like the following picture:

WebCrawler-Process.gif

Crawling configurations are defined for each website to be crawled separately. Crawling order is the same as WebSite elements order. Only Seeds element in the WebSite configuration is required to start crawling.

WebSite

The WebSite element contains all important information for accessing and crawling web site. The list of available attributes are:

  • ProjectName: defines project name.
  • Sitemaps: for supporting Google site maps. sitemap.xml, sitemap.xml.gz and sitemap.gz formats are supported. Links extracted from <loc> tags are added to the current level links. Crawler looks for the sitemap file at the root directory of the web server and then caches it for the particular host.
  • Header: request headers separated by semicolon. Headers should be in format "<header_name>:<header_content>", separated by semicolon.
  • Referer: include 'Referer: URL' header in HTTP request.
  • EnableCookies: enable or disable cookies for crawling process.

UserAgent

UserAgent element is used to identify crawler to the server as a specific user agent originating the request.

  • Name: agent name, the only required attribute.
  • Version
  • Description
  • Url
  • Email

The generated User-Agent string looks like following: Name/Version (Description, Url, Email).

Robotstxt

Robotstxt element is used for supporting robots.txt information.

  • Policy: there are five types of policies offered on how to deal with robots.txt rules:
 1. Classic
    Simply obey the robots.txt rules. Recommended unless you have special permission to collect a site more aggressively.
 2. Ignore
    Completely ignore robots.txt rules.
 3. Custom
    Obey user set, custom, robots.txt rules instead of those discovered on the relevant site. The attribute Value must handle path to custom robots.txt file in this case.
 4. Set
    Limit robots names which rules are followed to the given set. Value attribute must handle robots names separated by semicolon in this case.
  • Value: specifies the filename with the robots.txt rules for Custom policy and set of agent names for the Set policy.
  • AgentNames: specifies the list of agents we advertise. This list should be started with the same name as UserAgent Name (ie. crawler user-agent name that is used for the crawl job).

CrawlingModel

Two crawling models available:

 1. Max iterations: crawling a web site through a limited number of links.
 2. Max depth: crawling a web site with specifying the maximum crawling depth.
  • Type: the model type, "MaxIterations" or "MaxDepth".
  • Value; parameter (integer value).

CrawlScope

A crawl scope decides for each discovered URI if it is within the scope of the current crawl.

  • Type: following scopes are provided:
    • Broad : accept all
      This scope does not impose any limits on the hosts, domains, or URI paths crawled.
    • Domain: accept if on same 'domain' (for some definition) as seeds
      This scope limits discovered URIs to the set of domains defined by the provided seeds. That is any URI discovered belonging to a domain from which one of the seed came is within scope. Using the seed 'brox.de', a domain scope will fetch 'bugs.brox.de', 'confluence.brox.de', etc. It will fetch all discovered URIs from 'brox.de' and from any subdomain of 'brox.de'.
    • Host: accept if on exact host as seeds
      This scope limits discovered URIs to the set of hosts defined by the provided seeds.
      If the seed is 'www.brox.de', then we'll only fetch items discovered on this host. The crawler will not go to 'bugs.brox.de'.
    • Path: accept if on same host and a shared path-prefix as seeds
      This scope goes yet further and limits the discovered URIs to a section of paths on hosts defined by the seeds. Of course any host that has a seed **:pointing at its root (i.e. www.sample.com/index.html) will be included in full where as a host whose only seed is www.sample2.com/path/index.html **:will be limited to URIs under /path/.

Every scope can have additional filters to select URIs that will be considered to be within or out of scope (see the section Filters for details). For example:

<CrawlScope Type="Broad">
    <Filters>
	    <Filter Type="BeginningPath" WorkType="Select" Value="/level3.html"/>
    </Filters>
</CrawlScope>

Back to the top