Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

m (Crawling Configuration explanation)
m
Line 14: Line 14:
  
 
The configuration file can be found at <tt>configuration/org.eclipse.smila.framework/file</tt>.
 
The configuration file can be found at <tt>configuration/org.eclipse.smila.framework/file</tt>.
 +
Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd</tt>.
  
 
== Crawling configuration explanation ==
 
== Crawling configuration explanation ==
Line 43: Line 44:
 
**** <tt>Name</tt> – String e.g. <tt>"*test*"</tt> (leave out all text files which have <tt>test</tt> in the filename).
 
**** <tt>Name</tt> – String e.g. <tt>"*test*"</tt> (leave out all text files which have <tt>test</tt> in the filename).
  
 +
== Crawling configuration example ==
 +
 +
<source lang="xml">
 +
<CrawlJob
 +
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
 +
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd"
 +
>
 +
  <DataSourceID>file</DataSourceID>
 +
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
 +
  <DataConnectionID>
 +
    <Crawler>FileSystemCrawlerDS</Crawler>
 +
  </DataConnectionID>
 +
  <CompoundHandling>Yes</CompoundHandling>
 +
  <Attributes>
 +
    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
 +
      <FileAttributes>LastModifiedDate</FileAttributes>
 +
    </Attribute>
 +
    <Attribute Type="String" Name="Filename">
 +
      <FileAttributes>Name</FileAttributes>
 +
    </Attribute>
 +
    <Attribute Type="String" Name="Path" KeyAttribute="true">
 +
      <FileAttributes>Path</FileAttributes>
 +
    </Attribute>
 +
    <Attribute Type="String" Name="Content" Attachment="true">
 +
      <FileAttributes>Content</FileAttributes>
 +
    </Attribute>
 +
    <Attribute Type="String" Name="Extension"
 +
      <FileAttributes>FileExtension</FileAttributes>
 +
    </Attribute>
 +
    <Attribute Type="String" Name="Size">
 +
      <FileAttributes>Size</FileAttributes>
 +
    </Attribute>   
 +
  </Attributes>
 +
  <Process>
 +
    <BaseDir>c:\data</BaseDir>
 +
    <Filter Recursive="true" CaseSensitive="false">
 +
      <Include Name="*.txt"/>
 +
      <Include Name="*.htm"/>
 +
      <Include Name="*.html"/>
 +
    <Include Name="*.xml"/>   
 +
      <!--
 +
      <Include Name="*.pdf"/>
 +
      <Include Name="*.doc"/>
 +
      <Include Name="*.xls"/>
 +
      <Include Name="*.ppt"/>
 +
      <Include Name="*.rtf"/>
 +
      //-->             
 +
    </Filter>
 +
  </Process>
 +
</CrawlJob>
 +
</source>
  
 
__FORCETOC__
 
__FORCETOC__

Revision as of 02:35, 19 March 2009

What does FileSystemCrawler do

The FileSystemCrawler collects all files and folders recursively starting from a given directory. Next do the content of files it may gather any file meta information from the following list:

  • size
  • full path
  • file name only
  • file size
  • last modified date
  • file content
  • file extension

Crawling configuration

The configuration file can be found at configuration/org.eclipse.smila.framework/file. Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd.

Crawling configuration explanation

The root element of crawling configuration is CrawlJob and contains the following sub elements:

  • DataSourceID – the identification of a data source
  • SchemaID – specifies the schema for a crawler job
  • DataConnectionID – describes which agent crawler should be used
    • Crawler – implementation class of a Crawler
    • Agent – implementation class of an Agent
  • CompoundHandling – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
  • Attributes – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
    • Attribute
      • Type (required) – the data type (String, Integer or Date).
      • Name (required) – attributes name.
      • HashAttribute – specify if a hash should be created (true or false).
      • KeyAttribute – creates a key for this object, for example for record id (true or false).
      • Attachment – specify if the attribute return the data as attachment of record.
  • Process – contains parameters for gathering data.
    • BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
      • Filter – select file type and crawling mode.
        • Recursive – (true or false).
        • CaseSensitive – true or false
      • Include – file to crawl.
        • Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
      • Exclude – files to leave out while crawling.
        • Name – String e.g. "*test*" (leave out all text files which have test in the filename).

Crawling configuration example

<CrawlJob
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd"
>
  <DataSourceID>file</DataSourceID>
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
  <DataConnectionID>
    <Crawler>FileSystemCrawlerDS</Crawler>
  </DataConnectionID>
  <CompoundHandling>Yes</CompoundHandling>
  <Attributes>
    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
      <FileAttributes>LastModifiedDate</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Filename">
      <FileAttributes>Name</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Path" KeyAttribute="true">
      <FileAttributes>Path</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Content" Attachment="true">
      <FileAttributes>Content</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Extension"
      <FileAttributes>FileExtension</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Size">
      <FileAttributes>Size</FileAttributes>
    </Attribute>    
  </Attributes>
  <Process>
    <BaseDir>c:\data</BaseDir>
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
      <Include Name="*.htm"/>
      <Include Name="*.html"/>
    <Include Name="*.xml"/>     
      <!--
      <Include Name="*.pdf"/>
      <Include Name="*.doc"/>
      <Include Name="*.xls"/>
      <Include Name="*.ppt"/>
      <Include Name="*.rtf"/>
      //-->               
    </Filter>
  </Process>
</CrawlJob>