SMILA/Documentation/Web Crawler

What does Web Crawler do

A WebCrawler collects data from the internet. Starting with an initial URL it recursively crawls all linked Websites. Due to the manifold capabilities of webpage structures and much linking to other pages, the configuration of this crawler enables you to limit the downloaded data to match your needs.

Crawling configuration

Defining Schema: org.eclipse.smila.connectivitiy.framework.crawler.web/schemas/WebIndexOrder.xsd

Crawling configuration explanation

The root element of crawling configuration is IndexOrderConfiguration and contains the following sub elements:

DataSourceID – the identification of a data source.
SchemaID – specify the schema for a crawler job.
DataConnectionID – describes which agent crawler should be used.
- Crawler – implementation class of a Crawler.
- Agent – implementation class of an Agent.
CompoundHandling – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
Attributes – list all attributes which describe a website.
- FieldAttribute (URL, Title, Content):
  - Type (required) – the data type (String, Integer or Date).
  - Name (required) – attributes name.
  - HashAttribute – specify if a hash should be created (true or false).
  - KeyAttribute – creates a key for this object, for example for record id (true or false).
  - Attachment – specify if the attribute return the data as attachment of record.
- MetaAttribute (MetaData, ResponseHeader, MetaDataWithResponseGeaderFallBack, MimeType):
  - Type (required) – the data type (String)
  - Name (required) – attributes name
  - Attachment - specify if the attribute return the data as attachment of record.
    - ReturnType – structure the metadata will be returned
      - MetaDataString – default structure, metadata is returned as single string, for example:

<A n="ResponseHeaser">
  <L>
    <V>Content-type: text/html</V>
  </L>
  ...
</A>

MetaDataValue – only values of metadata are returned, for example:

<A n="ResponseHeader">
  <L>
    <V>text/html</V>
  <L>
</A>

MetaDataMObject – metadata is returned as MObject containing attributes with metadata names and values, for example:

<A n="ResponseHeader">
  <O>
    <A n="Content-Type">
      <L>
        <V>text/html</V>
      </L>
    </A>
    ...
  </O>
</A>

Process – this element is responsible for selecting data
- Website - contains all important information for accessing and crawling a website.
  - ProjectName - defines project name
  - Sitemaps - for supporting Google site maps. sitemap.xml, sitemap.xml.gz and sitemap.gz formats are supported. Links extracted from <loc> tags are added to the current level links. Crawler looks for the sitemap file at the root directory of the web server and then caches it for the particular host.
  - Header - request headers separated by semicolon. Headers should be in format "<header_name>:<header_content>", separated by semicolon.
  - Referer - to include "Referer: URL" header in http request.
  - EnableCookies - enable or disable cookies for crawling process (true or false)

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/Web Crawler

Contents

What does Web Crawler do

Crawling configuration

Crawling configuration explanation

See also

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Documentation/Web Crawler

Contents

What does Web Crawler do

Crawling configuration

Crawling configuration explanation

See also