Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

m (Crawling Configuration explanation)
m (Crawling Configuration explanation)
Line 32: Line 32:
 
*** <tt>KeyAttribute</tt> – creates a key for this object, for example for record id (true or false).
 
*** <tt>KeyAttribute</tt> – creates a key for this object, for example for record id (true or false).
 
*** <tt>Attachment</tt> – specify if the attribute return the data as attachment of record.
 
*** <tt>Attachment</tt> – specify if the attribute return the data as attachment of record.
 +
 +
* <tt>Process</tt> – contains parameters for gathering data.
 +
** <tt>BaseDir</tt> – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
 +
*** <tt>Filter</tt> – select file type and crawling mode.
 +
*** <tt>Recursive</tt> – (true or false).
 +
*** <tt>CaseSensitive</tt> – true or false
  
 
__FORCETOC__
 
__FORCETOC__

Revision as of 03:27, 19 March 2009

What does FileSystemCrawler do

The FileSystemCrawler collects all files and folders recursively starting from a given directory. Next do the content of files it may gather any file meta information from the following list:

  • size
  • full path
  • file name only
  • file size
  • last modified date
  • file content
  • file extension

Crawling configuration

The configuration file can be found at configuration/org.eclipse.smila.framework/file.

Crawling Configuration explanation

The root element of crawling configuration is CrawlJob and contains the following sub elements:

  • DataSourceID – the identification of a data source
  • SchemaID – specifies the schema for a crawler job
  • DataConnectionID – describes which agent crawler should be used
    • Crawler – implementation class of a Crawler
    • Agent – implementation class of an Agent
  • CompoundHandling – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
  • Attributes – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
    • Attribute
      • Type (required) – the data type (String, Integer or Date).
      • Name (required) – attributes name.
      • HashAttribute – specify if a hash should be created (true or false).
      • KeyAttribute – creates a key for this object, for example for record id (true or false).
      • Attachment – specify if the attribute return the data as attachment of record.
  • Process – contains parameters for gathering data.
    • BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
      • Filter – select file type and crawling mode.
      • Recursive – (true or false).
      • CaseSensitive – true or false

Copyright © Eclipse Foundation, Inc. All Rights Reserved.