Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

 
(45 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Filesystem Index Order ==
+
{{note|This is deprecated for SMILA 1.0, the connectivity framework is still functional but will aimed to be replaced by scalable import based on SMILAs job management.}}
Following is an example of a Filesystem Index Order:
+
 
 +
== Overview ==
 +
 
 +
The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:
 +
 
 +
* full path
 +
* file name only
 +
* file size
 +
* last modified date
 +
* file content
 +
* file extension
 +
 
 +
== Crawling configuration ==
 +
 
 +
The example configuration file is located at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>.
 +
 
 +
Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd</tt>.
 +
 
 +
== Crawling configuration explanation ==
 +
 
 +
See [[SMILA/Documentation/Crawler#Configuration]] for the generic parts of the configuration file.
 +
 
 +
The root element of crawling configuration is <tt>DataSourceConnectionConfig</tt> and contains the following sub elements:
 +
 
 +
* <tt>DataSourceID</tt> – the identification of a data source
 +
* <tt>SchemaID</tt> – specifies the schema for a crawler job
 +
* <tt>DataConnectionID</tt> – describes which agent crawler should be used
 +
** <tt>Crawler</tt> – implementation class of a Crawler
 +
** <tt>Agent</tt> – implementation class of an Agent
 +
* <tt>CompoundHandling</tt> – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
 +
* <tt>Attributes</tt> – list all attributes which describe a file.
 +
** <tt>Attribute</tt>
 +
*** attributes:
 +
**** <tt>Type</tt> (required) – the data type (String, Integer or Date).
 +
**** <tt>Name</tt> (required) – attributes name.
 +
**** <tt>HashAttribute</tt> – specify if the attribute is used for the hash used for delta indexing (''true'' or ''false''). Must be true for at least one attribute which must always have a value. Usually the attribute containing the ''LastModifiedDate'' will be a good candidate to set this to ''true'' for.
 +
**** <tt>KeyAttribute</tt> – specify if the attribute is used for creating the record ID (''true'' or ''false''). Must be true for at least one attribute. All key attributes must identify the file uniquely, so usually you will set it ''true'' for the attribute containing ''Path'' FileAttribute.
 +
**** <tt>Attachment</tt> – specify if the attribute return the data as attachment of record.
 +
*** sub elements:
 +
**** <tt>FileAttributes</tt> - specify the file attribute to write into the target attribute. The content of the element must be one of
 +
***** ''Name'': name of file, without the directory path
 +
***** ''Path'': complete path including file name.
 +
***** ''Size'': size in bytes.
 +
***** ''LastModifiedDate'': Date of last modification
 +
***** ''Content'': Content of file. Unconverted binary if written to an attachment. Else the crawler tries to detect the encoding and converts the content to a string (with fallbacks to UTF-8 or default encoding of the operating system).
 +
***** ''FileExtension'': The part of the filename after the last "." character (without the dot). An empty string if the filename does not contain a dot.
 +
* <tt>Process</tt> – contains parameters for gathering data.
 +
** <tt>BaseDir</tt> – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
 +
*** <tt>Filter</tt> – select file type and crawling mode.
 +
**** <tt>Recursive</tt> – (true or false).
 +
**** <tt>CaseSensitive</tt> – true or false
 +
*** <tt>Include</tt> – file to crawl.
 +
**** <tt>Name</tt> - String e.g. <tt>"*.txt"</tt> (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
 +
*** <tt>Exclude</tt> – files to leave out while crawling.
 +
**** <tt>Name</tt> – String e.g. <tt>"*test*"</tt> (leave out all text files which have <tt>test</tt> in the filename).
 +
 
 +
== Crawling configuration example ==
 +
 
 
<source lang="xml">
 
<source lang="xml">
<IndexOrderConfiguration
+
<DataSourceConnectionConfig
 
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd"
+
   xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd">
>
+
 
   <DataSourceID>file</DataSourceID>
 
   <DataSourceID>file</DataSourceID>
 
   <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
 
   <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
Line 25: Line 81:
 
       <FileAttributes>Content</FileAttributes>
 
       <FileAttributes>Content</FileAttributes>
 
     </Attribute>
 
     </Attribute>
     <Attribute Type="String" Name="Extension">
+
     <Attribute Type="String" Name="Extension"
 
       <FileAttributes>FileExtension</FileAttributes>
 
       <FileAttributes>FileExtension</FileAttributes>
 
     </Attribute>
 
     </Attribute>
Line 31: Line 87:
 
       <FileAttributes>Size</FileAttributes>
 
       <FileAttributes>Size</FileAttributes>
 
     </Attribute>     
 
     </Attribute>     
    <Attribute Type="String" Name="AccessTreeNotExpanded">
 
      <AccessTree ExpandAccounts="false"/>
 
    </Attribute>
 
    <Attribute Type="String" Name="AccessTreeExpanded">
 
      <AccessTree ExpandAccounts="true"/>
 
    </Attribute>
 
    <Attribute Type="String" Name="AccessListNotExpanded">
 
      <AccessList ExpandAccounts="false" Mask=" W "/>
 
    </Attribute>
 
    <Attribute Type="String" Name="AccessListExpanded">
 
      <AccessList ExpandAccounts="true" Mask=" W "/>
 
    </Attribute>
 
 
   </Attributes>
 
   </Attributes>
 
   <Process>
 
   <Process>
Line 50: Line 94:
 
       <Include Name="*.htm"/>
 
       <Include Name="*.htm"/>
 
       <Include Name="*.html"/>
 
       <Include Name="*.html"/>
       <Include Name="*.xml"/>    
+
       <Include Name="*.xml"/>    
 
     </Filter>
 
     </Filter>
 
   </Process>
 
   </Process>
</IndexOrderConfiguration>
+
</DataSourceConnectionConfig>
 
</source>
 
</source>
  
== XSD Schema used for Filesystem Crawler ==
+
== Output example for default configuration ==  
<source lang="xml">
+
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
+
  <xs:redefine schemaLocation="../../org.eclipse.smila.connectivity.framework.indexorder/schemas/RootIndexOrderConfiguration.xsd">
+
    <xs:complexType name="Process">
+
      <xs:annotation>
+
        <xs:documentation>Process Specification</xs:documentation>
+
      </xs:annotation>
+
      <xs:complexContent>
+
        <xs:extension base="Process">
+
          <xs:sequence maxOccurs="unbounded">
+
            <xs:element name="BaseDir" type="xs:string"/>
+
            <xs:element name="Filter">
+
              <xs:complexType>
+
                <xs:sequence>
+
                  <xs:element name="Include" minOccurs="0" maxOccurs="unbounded">
+
                    <xs:complexType>
+
                      <xs:attribute name="Name" type="xs:string" use="required"/>
+
                      <xs:attribute name="DateFrom" type="xs:dateTime" use="optional"/>
+
                      <xs:attribute name="DateTo" type="xs:dateTime" use="optional"/>
+
                    </xs:complexType>
+
                  </xs:element>
+
                  <xs:element name="Exclude" minOccurs="0" maxOccurs="unbounded">
+
                    <xs:complexType>
+
                      <xs:attribute name="Name" type="xs:string" use="required"/>
+
                    </xs:complexType>
+
                  </xs:element>
+
                </xs:sequence>
+
                <xs:attribute name="CaseSensitive" type="xs:boolean" use="optional" default="false"/>
+
                <xs:attribute name="Recursive" type="xs:boolean" use="optional" default="true"/>
+
              </xs:complexType>
+
            </xs:element>
+
          </xs:sequence>
+
        </xs:extension>
+
      </xs:complexContent>
+
    </xs:complexType>
+
    <xs:complexType name="Attribute">
+
      <xs:complexContent>
+
        <xs:extension base="Attribute">
+
          <xs:choice>
+
            <xs:element name="FileAttributes" type="FileAttributesType" />
+
            <xs:element name="AccessTree" type="AccessTreeType" />
+
            <xs:element name="AccessList" type="AccessListType" />
+
          </xs:choice>
+
        </xs:extension>
+
      </xs:complexContent>
+
    </xs:complexType>
+
  </xs:redefine>
+
 
+
 
+
  <!-- simple types -->
+
  <xs:simpleType name="FileAttributesType">
+
    <xs:restriction base="xs:string">
+
      <xs:enumeration value="Name"/>
+
      <xs:enumeration value="Path"/>
+
      <xs:enumeration value="Size"/>
+
      <xs:enumeration value="LastModifiedDate"/>
+
      <xs:enumeration value="Content"/>
+
      <xs:enumeration value="FileExtension"/>
+
    </xs:restriction>
+
  </xs:simpleType>
+
  <xs:simpleType name="AuthorityType">
+
    <xs:restriction base="xs:string">
+
      <xs:enumeration value="USERS"/>
+
      <xs:enumeration value="GROUPS"/>
+
    </xs:restriction>
+
  </xs:simpleType>
+
  <xs:simpleType name="MaskType">
+
    <xs:restriction base="xs:string">
+
      <xs:pattern value="(R|\s)(W|\s)(X|\s)" />
+
    </xs:restriction>
+
  </xs:simpleType>
+
  
 +
For a text file named <tt>crawler.txt</tt> located in <tt>c:/data</tt> the crawler will create the following record:
  
  <!-- complex types -->
 
  <xs:complexType name="AccessTreeType">
 
    <xs:attribute name="ExpandAccounts" type="xs:boolean" use="required"/>
 
  </xs:complexType>
 
 
 
  <xs:complexType name="AccessListType">
 
    <xs:complexContent>
 
      <xs:extension base="AccessTreeType">
 
        <xs:attribute name="Mask" type="MaskType" use="required"/>
 
        <xs:attribute name="AuthorityFilter" type="AuthorityType" use="optional"/>
 
      </xs:extension>
 
    </xs:complexContent>
 
  </xs:complexType>
 
 
</xs:schema>
 
</source>
 
 
== Attribute element ==
 
'''FileAttributes.'''
 
The FileAttributes element describes the file simple information that should be crawled. There are options to configure:
 
# Name: the file name.
 
# Path: the file complete path.
 
# FileExtension: the file extension.
 
# Size: the file size.
 
# LastModifiedDate: the file last modification date.
 
# Content: the content of the file is emitted as byte[].
 
 
<source lang="xml">
 
<source lang="xml">
<Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
+
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
   <FileAttributes>LastModifiedDate</FileAttributes>
+
   <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
</Attribute>
+
  <Val key="_source">file</Val>
<Attribute Type="String" Name="Filename">
+
  <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
  <FileAttributes>Name</FileAttributes>
+
  <Val key="Path">c:\data\crawler.txt</Val>
</Attribute>
+
   <Val key="Filename">crawler.txt</Val>
<Attribute Type="String" Name="Path" KeyAttribute="true">
+
  <Val key="Extension">txt</Val>
   <FileAttributes>Path</FileAttributes>
+
  <Val key="Size" type="long">36</Val>
</Attribute>
+
  <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
<Attribute Type="String" Name="Content" Attachment="true">
+
   <Attachment>Content</Attachment>
  <FileAttributes>Content</FileAttributes>
+
</Record>
</Attribute>
+
<Attribute Type="String" Name="Extension">
+
  <FileAttributes>FileExtension</FileAttributes>
+
</Attribute>
+
<Attribute Type="String" Name="Size">
+
   <FileAttributes>Size</FileAttributes>
+
</Attribute>  
+
 
</source>
 
</source>
  
'''Security information'''
+
== Additional performance counters ==
  
'''AccessTree.'''
+
The FileSystemCrawler adds some specific counters to the common counters:
The AccessTree element used to extract raw access control list (ACL) information from file. Security information separated to access rights information MObjects ( read/write/execute , allow/deny type ) and security account information MObjects ( SID, domain/computer, authentication's name ). There is only one boolean parameter to configure '''ExpandAccounts'''. If it's configured to true, then security account groups will be expanded - sub-accounts will be extracted too as sub-MObjects.
+
* files: number of files visited
 +
* folders: number of directories visited
 +
* producerExceptions: number of filesystem related errors
  
For example, file is accessible only by Administrators group.
+
== See also ==
  
Configuration sample with ExpandAccounts="false"
+
* [[SMILA/Documentation/Crawler|Crawler]]
<source lang="xml">
+
* [[SMILA/Documentation/Web Crawler|Web Crawler]]
<Attribute Type="String" Name="AccessTreeNotExpanded">
+
* [[SMILA/Documentation/JDBC Crawler|JDBC Crawler]]
  <AccessTree ExpandAccounts="false"/>
+
</Attribute>
+
</source>
+
  
Extracted attribute sample with ExpandAccounts="false"
+
__FORCETOC__
<source lang="xml">
+
<A n="AccessTreeNotExpanded">
+
  <O>
+
    <A n="type">
+
      <L>
+
        <V>ALLOW</V>
+
      </L>
+
    </A>
+
    <A n="mask">
+
      <L>
+
        <V>RWX</V>
+
      </L>
+
    </A>
+
    <A n="account">
+
      <O>
+
        <A n="sid">
+
          <L>
+
            <V>S-1-5-32-544</V>
+
          </L>
+
        </A>
+
        <A n="type">
+
          <L>
+
            <V>ALIAS</V>
+
          </L>
+
        </A>
+
        <A n="domain">
+
          <L>
+
            <V>BUILTIN</V>
+
          </L>
+
        </A>
+
        <A n="auth">
+
          <L>
+
            <V>Administrators</V>
+
          </L>
+
        </A>
+
      </O>
+
    </A>
+
  </O>
+
</A>
+
</source>
+
  
Top MObject corresponds to ACL object.
+
[[Category:SMILA]]
 
+
There are three attributes:
+
'''type''' - ACL rule type, may be ALLOW or DENY
+
'''mask''' - ACL rule mask, R - Read, W - Write , X - eXecute
+
'''account''' - reference to security account MObject
+
 
+
Security account MObject atributes:
+
'''sid'''  - security identifier.
+
'''type''' - account type
+
'''domain''' - account domain/computer name ( 1st level authentication's name )
+
'''auth''' - account name ( 2nd level authentication's name )
+
 
+
 
+
Configuration sample with ExpandAccounts="true"
+
<source lang="xml">
+
<Attribute Type="String" Name="AccessTreeExpanded">
+
  <AccessTree ExpandAccounts="true"/>
+
</Attribute>
+
</source>
+
 
+
Extracted attribute sample with ExpandAccounts="true"
+
<source lang="xml">
+
<A n="AccessTreeExpanded">
+
  <O>
+
    <A n="type">
+
      <L>
+
        <V>ALLOW</V>
+
      </L>
+
    </A>
+
    <A n="mask">
+
      <L>
+
        <V>RWX</V>
+
      </L>
+
    </A>
+
    <A n="account">
+
      <O>
+
        <A n="sid">
+
          <L>
+
            <V>S-1-5-32-544</V>
+
          </L>
+
        </A>
+
        <A n="type">
+
          <L>
+
            <V>ALIAS</V>
+
          </L>
+
        </A>
+
        <A n="domain">
+
          <L>
+
            <V>BUILTIN</V>
+
          </L>
+
        </A>
+
        <A n="auth">
+
          <L>
+
            <V>Administrators</V>
+
          </L>
+
        </A>
+
        <A n="sub">
+
          <O>
+
            <A n="sid">
+
              <L>
+
                <V>S-1-5-21-2105471877-1027867990-1527921536-500</V>
+
              </L>
+
            </A>
+
            <A n="type">
+
              <L>
+
                <V>USER</V>
+
              </L>
+
            </A>
+
            <A n="domain">
+
              <L>
+
                <V>Ivan</V>
+
              </L>
+
            </A>
+
            <A n="auth">
+
              <L>
+
                <V>Administrator</V>
+
              </L>
+
            </A>
+
          </O>
+
          <O>
+
            <A n="sid">
+
              <L>
+
                <V>S-1-5-21-2105471877-1027867990-1527921536-1000</V>
+
              </L>
+
            </A>
+
            <A n="type">
+
              <L>
+
                <V>USER</V>
+
              </L>
+
            </A>
+
            <A n="domain">
+
              <L>
+
                <V>Ivan</V>
+
              </L>
+
            </A>
+
            <A n="auth">
+
              <L>
+
                <V>Ivanhoe</V>
+
              </L>
+
            </A>
+
          </O>
+
        </A>
+
      </O>
+
    </A>
+
  </O>
+
</A>
+
</source>
+
Two accounts info extracted for group Administrators.
+
Group '''Administrators''' account now have additional attribute
+
'''sub''' to group sub-accounts
+
 
+
 
+
'''AccessList.'''
+
AccessList attribute configuration. This attribute used to extract ready/flat accounts list correspondent to ACL. Additional parameters used to filter required accounts.
+
 
+
'''ExpandAccounts''' - should we process sub-accounts or not.
+
'''AuthorityFilter''' - GROUPS/USERS return only groups or users - optional.
+
'''Mask''' - rights filter: Read, Write, eXecute (RWX)
+
 
+
For example, we have to extract accounts and for all of them it's allowed to execute this file.
+
 
+
Sample Configuration for accounts directly linked to file ACL:
+
<source lang="xml">
+
<Attribute Type="String" Name="itsAllowed2Write">
+
  <AccessList ExpandAccounts="false" Mask="  X"/>
+
</Attribute>
+
<source>
+
Sample Result:
+
<source lang="xml">
+
<A n="itsAllowed2Write">
+
  <O>
+
    <A n="sid">
+
      <L>
+
        <V>S-1-5-32-544</V>
+
      </L>
+
    </A>
+
    <A n="type">
+
      <L>
+
        <V>ALIAS</V>
+
      </L>
+
    </A>
+
    <A n="domain">
+
      <L>
+
        <V>BUILTIN</V>
+
      </L>
+
    </A>
+
    <A n="auth">
+
      <L>
+
        <V>Administrators</V>
+
      </L>
+
    </A>
+
  </O>
+
</A>
+
</source>
+
 
+
Sample Configuration for all accounts and sub/accounts
+
<source lang="xml">
+
<Attribute Type="String" Name="itsAllowed2Write_ALL">
+
  <AccessList ExpandAccounts="true" Mask="  X"/>
+
</Attribute>
+
<source>
+
Sample Result:
+
<source lang="xml">
+
<A n="itsAllowed2Write_ALL">
+
  <O>
+
    <A n="sid">
+
      <L>
+
        <V>S-1-5-32-544</V>
+
      </L>
+
    </A>
+
    <A n="type">
+
      <L>
+
        <V>ALIAS</V>
+
      </L>
+
    </A>
+
    <A n="domain">
+
      <L>
+
        <V>BUILTIN</V>
+
      </L>
+
    </A>
+
    <A n="auth">
+
      <L>
+
        <V>Administrators</V>
+
      </L>
+
    </A>
+
  </O>
+
  <O>
+
    <A n="sid">
+
      <L>
+
        <V>S-1-5-21-2105471877-1027867990-1527921536-500</V>
+
      </L>
+
    </A>
+
    <A n="type">
+
      <L>
+
        <V>USER</V>
+
      </L>
+
    </A>
+
    <A n="domain">
+
      <L>
+
        <V>Ivan</V>
+
      </L>
+
    </A>
+
    <A n="auth">
+
      <L>
+
        <V>Administrator</V>
+
      </L>
+
    </A>
+
  </O>
+
  <O>
+
    <A n="sid">
+
      <L>
+
        <V>S-1-5-21-2105471877-1027867990-1527921536-1000</V>
+
      </L>
+
    </A>
+
    <A n="type">
+
      <L>
+
        <V>USER</V>
+
      </L>
+
    </A>
+
    <A n="domain">
+
      <L>
+
        <V>Ivan</V>
+
      </L>
+
    </A>
+
    <A n="auth">
+
      <L>
+
        <V>Ivanhoe</V>
+
      </L>
+
    </A>
+
  </O>
+
</A>
+
</source>
+

Latest revision as of 05:38, 24 January 2012

Note.png
This is deprecated for SMILA 1.0, the connectivity framework is still functional but will aimed to be replaced by scalable import based on SMILAs job management.


Overview

The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:

  • full path
  • file name only
  • file size
  • last modified date
  • file content
  • file extension

Crawling configuration

The example configuration file is located at configuration/org.eclipse.smila.connectivity.framework/file.xml.

Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd.

Crawling configuration explanation

See SMILA/Documentation/Crawler#Configuration for the generic parts of the configuration file.

The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:

  • DataSourceID – the identification of a data source
  • SchemaID – specifies the schema for a crawler job
  • DataConnectionID – describes which agent crawler should be used
    • Crawler – implementation class of a Crawler
    • Agent – implementation class of an Agent
  • CompoundHandling – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
  • Attributes – list all attributes which describe a file.
    • Attribute
      • attributes:
        • Type (required) – the data type (String, Integer or Date).
        • Name (required) – attributes name.
        • HashAttribute – specify if the attribute is used for the hash used for delta indexing (true or false). Must be true for at least one attribute which must always have a value. Usually the attribute containing the LastModifiedDate will be a good candidate to set this to true for.
        • KeyAttribute – specify if the attribute is used for creating the record ID (true or false). Must be true for at least one attribute. All key attributes must identify the file uniquely, so usually you will set it true for the attribute containing Path FileAttribute.
        • Attachment – specify if the attribute return the data as attachment of record.
      • sub elements:
        • FileAttributes - specify the file attribute to write into the target attribute. The content of the element must be one of
          • Name: name of file, without the directory path
          • Path: complete path including file name.
          • Size: size in bytes.
          • LastModifiedDate: Date of last modification
          • Content: Content of file. Unconverted binary if written to an attachment. Else the crawler tries to detect the encoding and converts the content to a string (with fallbacks to UTF-8 or default encoding of the operating system).
          • FileExtension: The part of the filename after the last "." character (without the dot). An empty string if the filename does not contain a dot.
  • Process – contains parameters for gathering data.
    • BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
      • Filter – select file type and crawling mode.
        • Recursive – (true or false).
        • CaseSensitive – true or false
      • Include – file to crawl.
        • Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
      • Exclude – files to leave out while crawling.
        • Name – String e.g. "*test*" (leave out all text files which have test in the filename).

Crawling configuration example

<DataSourceConnectionConfig
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd">
  <DataSourceID>file</DataSourceID>
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
  <DataConnectionID>
    <Crawler>FileSystemCrawlerDS</Crawler>
  </DataConnectionID>
  <CompoundHandling>Yes</CompoundHandling>
  <Attributes>
    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
      <FileAttributes>LastModifiedDate</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Filename">
      <FileAttributes>Name</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Path" KeyAttribute="true">
      <FileAttributes>Path</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Content" Attachment="true">
      <FileAttributes>Content</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Extension"
      <FileAttributes>FileExtension</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Size">
      <FileAttributes>Size</FileAttributes>
    </Attribute>    
  </Attributes>
  <Process>
    <BaseDir>c:\data</BaseDir>
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
      <Include Name="*.htm"/>
      <Include Name="*.html"/>
      <Include Name="*.xml"/>     
    </Filter>
  </Process>
</DataSourceConnectionConfig>

Output example for default configuration

For a text file named crawler.txt located in c:/data the crawler will create the following record:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
  <Val key="_source">file</Val>
  <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
  <Val key="Path">c:\data\crawler.txt</Val>
  <Val key="Filename">crawler.txt</Val>
  <Val key="Extension">txt</Val>
  <Val key="Size" type="long">36</Val>
  <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
  <Attachment>Content</Attachment>
</Record>

Additional performance counters

The FileSystemCrawler adds some specific counters to the common counters:

  • files: number of files visited
  • folders: number of directories visited
  • producerExceptions: number of filesystem related errors

See also