Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

(What does FileSystemCrawler do)
 
(20 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
{{note|This is deprecated for SMILA 1.0, the connectivity framework has been replaced by the new [[SMILA/Documentation#Importing | Importing framework]].}}
 +
 
== Overview ==
 
== Overview ==
  
The FileSystemCrawler recursively fetches all files from a given directory. Besides providing the content of files it may also gather any file's metadata from the following list:
+
The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:
  
 
* full path
 
* full path
Line 12: Line 14:
 
== Crawling configuration ==
 
== Crawling configuration ==
  
The configuration file can be found at <tt>configuration/org.eclipse.smila.framework/file</tt>.
+
The example configuration file is located at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>.
Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd</tt>.  
+
 
 +
Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd</tt>.
  
 
== Crawling configuration explanation ==
 
== Crawling configuration explanation ==
  
The root element of crawling configuration is <tt>CrawlJob</tt> and contains the following sub elements:
+
See [[SMILA/Documentation/Crawler#Configuration]] for the generic parts of the configuration file.
 +
 
 +
The root element of crawling configuration is <tt>DataSourceConnectionConfig</tt> and contains the following sub elements:
  
 
* <tt>DataSourceID</tt> – the identification of a data source
 
* <tt>DataSourceID</tt> – the identification of a data source
Line 24: Line 29:
 
** <tt>Crawler</tt> – implementation class of a Crawler
 
** <tt>Crawler</tt> – implementation class of a Crawler
 
** <tt>Agent</tt> – implementation class of an Agent
 
** <tt>Agent</tt> – implementation class of an Agent
* <tt>CompoundHandling</tt> – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
+
* <tt>CompoundHandling</tt> – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
* <tt>Attributes</tt> – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
+
* <tt>Attributes</tt> – list all attributes which describe a file.
 
** <tt>Attribute</tt>
 
** <tt>Attribute</tt>
*** <tt>Type</tt> (required) – the data type (String, Integer or Date).
+
*** attributes:
*** <tt>Name</tt> (required) – attributes name.
+
**** <tt>Type</tt> (required) – the data type (String, Integer or Date).
*** <tt>HashAttribute</tt> – specify if a hash should be created (true or false).
+
**** <tt>Name</tt> (required) – attributes name.
*** <tt>KeyAttribute</tt> – creates a key for this object, for example for record id (true or false).
+
**** <tt>HashAttribute</tt> – specify if the attribute is used for the hash used for delta indexing (''true'' or ''false''). Must be true for at least one attribute which must always have a value. Usually the attribute containing the ''LastModifiedDate'' will be a good candidate to set this to ''true'' for.
*** <tt>Attachment</tt> – specify if the attribute return the data as attachment of record.
+
**** <tt>KeyAttribute</tt> – specify if the attribute is used for creating the record ID (''true'' or ''false''). Must be true for at least one attribute. All key attributes must identify the file uniquely, so usually you will set it ''true'' for the attribute containing ''Path'' FileAttribute.
 
+
**** <tt>Attachment</tt> – specify if the attribute return the data as attachment of record.
 +
*** sub elements:
 +
**** <tt>FileAttributes</tt> - specify the file attribute to write into the target attribute. The content of the element must be one of
 +
***** ''Name'': name of file, without the directory path
 +
***** ''Path'': complete path including file name.
 +
***** ''Size'': size in bytes.
 +
***** ''LastModifiedDate'': Date of last modification
 +
***** ''Content'': Content of file. Unconverted binary if written to an attachment. Else the crawler tries to detect the encoding and converts the content to a string (with fallbacks to UTF-8 or default encoding of the operating system).
 +
***** ''FileExtension'': The part of the filename after the last "." character (without the dot). An empty string if the filename does not contain a dot.
 
* <tt>Process</tt> – contains parameters for gathering data.
 
* <tt>Process</tt> – contains parameters for gathering data.
 
** <tt>BaseDir</tt> – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
 
** <tt>BaseDir</tt> – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
Line 46: Line 59:
  
 
<source lang="xml">
 
<source lang="xml">
<CrawlJob
+
<DataSourceConnectionConfig
   xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
+
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd"
+
   xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd">
>
+
 
   <DataSourceID>file</DataSourceID>
 
   <DataSourceID>file</DataSourceID>
 
   <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
 
   <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
Line 83: Line 95:
 
       <Include Name="*.html"/>
 
       <Include Name="*.html"/>
 
       <Include Name="*.xml"/>     
 
       <Include Name="*.xml"/>     
      <!--
 
      <Include Name="*.pdf"/>
 
      -->             
 
 
     </Filter>
 
     </Filter>
 
   </Process>
 
   </Process>
</CrawlJob>
+
</DataSourceConnectionConfig>
 
</source>
 
</source>
  
Line 96: Line 105:
  
 
<source lang="xml">
 
<source lang="xml">
<Record xmlns="http://www.eclipse.org/smila/record" version="1.0">
+
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
   <Id xmlns="http://www.eclipse.org/smila/id" version="1.0">
+
   <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
    <!-- Element name must be Source, not _Source, it's made due to syntax coloring problem in wiki -->
+
   <Val key="_source">file</Val>
    <_Source>file</_Source>
+
   <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
    <Key name="Path">c:\data\crawler.txt</Key>
+
   <Val key="Path">c:\data\crawler.txt</Val>
   </Id>
+
   <Val key="Filename">crawler.txt</Val>
   <A n="LastModifiedDate">
+
   <Val key="Extension">txt</Val>
    <L>
+
   <Val key="Size" type="long">36</Val>
      <V t="datetime">2009-02-25 17:44:46.541</V>
+
   <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
    </L>
+
   </A>
+
  <A n="Path">
+
    <L>
+
      <V>c:\data\crawler.txt</V>
+
    </L>
+
  </A>
+
   <A n="Filename">
+
    <L>
+
      <V>crawler.txt</V>
+
    </L>
+
  </A>
+
   <A n="Extension">
+
    <L>
+
      <V>txt</V>
+
    </L>
+
   </A>
+
  <A n="Size">
+
    <L>
+
      <V t="int">36</V>
+
    </L>
+
   </A>
+
  <A n="_HASH_TOKEN">
+
    <L>
+
      <V>
+
        66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7
+
      </V>
+
    </L>
+
  </A>
+
 
   <Attachment>Content</Attachment>
 
   <Attachment>Content</Attachment>
 
</Record>
 
</Record>
 
</source>
 
</source>
 +
 +
== Additional performance counters ==
 +
 +
The FileSystemCrawler adds some specific counters to the common counters:
 +
* files: number of files visited
 +
* folders: number of directories visited
 +
* producerExceptions: number of filesystem related errors
  
 
== See also ==
 
== See also ==

Latest revision as of 11:30, 28 October 2014

Note.png
This is deprecated for SMILA 1.0, the connectivity framework has been replaced by the new Importing framework.


Overview

The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:

  • full path
  • file name only
  • file size
  • last modified date
  • file content
  • file extension

Crawling configuration

The example configuration file is located at configuration/org.eclipse.smila.connectivity.framework/file.xml.

Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd.

Crawling configuration explanation

See SMILA/Documentation/Crawler#Configuration for the generic parts of the configuration file.

The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:

  • DataSourceID – the identification of a data source
  • SchemaID – specifies the schema for a crawler job
  • DataConnectionID – describes which agent crawler should be used
    • Crawler – implementation class of a Crawler
    • Agent – implementation class of an Agent
  • CompoundHandling – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
  • Attributes – list all attributes which describe a file.
    • Attribute
      • attributes:
        • Type (required) – the data type (String, Integer or Date).
        • Name (required) – attributes name.
        • HashAttribute – specify if the attribute is used for the hash used for delta indexing (true or false). Must be true for at least one attribute which must always have a value. Usually the attribute containing the LastModifiedDate will be a good candidate to set this to true for.
        • KeyAttribute – specify if the attribute is used for creating the record ID (true or false). Must be true for at least one attribute. All key attributes must identify the file uniquely, so usually you will set it true for the attribute containing Path FileAttribute.
        • Attachment – specify if the attribute return the data as attachment of record.
      • sub elements:
        • FileAttributes - specify the file attribute to write into the target attribute. The content of the element must be one of
          • Name: name of file, without the directory path
          • Path: complete path including file name.
          • Size: size in bytes.
          • LastModifiedDate: Date of last modification
          • Content: Content of file. Unconverted binary if written to an attachment. Else the crawler tries to detect the encoding and converts the content to a string (with fallbacks to UTF-8 or default encoding of the operating system).
          • FileExtension: The part of the filename after the last "." character (without the dot). An empty string if the filename does not contain a dot.
  • Process – contains parameters for gathering data.
    • BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
      • Filter – select file type and crawling mode.
        • Recursive – (true or false).
        • CaseSensitive – true or false
      • Include – file to crawl.
        • Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
      • Exclude – files to leave out while crawling.
        • Name – String e.g. "*test*" (leave out all text files which have test in the filename).

Crawling configuration example

<DataSourceConnectionConfig
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd">
  <DataSourceID>file</DataSourceID>
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
  <DataConnectionID>
    <Crawler>FileSystemCrawlerDS</Crawler>
  </DataConnectionID>
  <CompoundHandling>Yes</CompoundHandling>
  <Attributes>
    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
      <FileAttributes>LastModifiedDate</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Filename">
      <FileAttributes>Name</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Path" KeyAttribute="true">
      <FileAttributes>Path</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Content" Attachment="true">
      <FileAttributes>Content</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Extension"
      <FileAttributes>FileExtension</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Size">
      <FileAttributes>Size</FileAttributes>
    </Attribute>    
  </Attributes>
  <Process>
    <BaseDir>c:\data</BaseDir>
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
      <Include Name="*.htm"/>
      <Include Name="*.html"/>
      <Include Name="*.xml"/>     
    </Filter>
  </Process>
</DataSourceConnectionConfig>

Output example for default configuration

For a text file named crawler.txt located in c:/data the crawler will create the following record:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
  <Val key="_source">file</Val>
  <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
  <Val key="Path">c:\data\crawler.txt</Val>
  <Val key="Filename">crawler.txt</Val>
  <Val key="Extension">txt</Val>
  <Val key="Size" type="long">36</Val>
  <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
  <Attachment>Content</Attachment>
</Record>

Additional performance counters

The FileSystemCrawler adds some specific counters to the common counters:

  • files: number of files visited
  • folders: number of directories visited
  • producerExceptions: number of filesystem related errors

See also

Back to the top