Jump to: navigation, search

SMILA/Documentation/Filesystem Crawler


The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:

  • full path
  • file name only
  • file size
  • last modified date
  • file content
  • file extension

Crawling configuration

The example configuration file is located at configuration/org.eclipse.smila.connectivity.framework/file.xml.

Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd.

Crawling configuration explanation

The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:

  • DataSourceID – the identification of a data source
  • SchemaID – specifies the schema for a crawler job
  • DataConnectionID – describes which agent crawler should be used
    • Crawler – implementation class of a Crawler
    • Agent – implementation class of an Agent
  • CompoundHandling – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
  • Attributes – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
    • Attribute
      • Type (required) – the data type (String, Integer or Date).
      • Name (required) – attributes name.
      • HashAttribute – specify if a hash should be created (true or false).
      • KeyAttribute – creates a key for this object, for example for record id (true or false).
      • Attachment – specify if the attribute return the data as attachment of record.
  • Process – contains parameters for gathering data.
    • BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
      • Filter – select file type and crawling mode.
        • Recursive – (true or false).
        • CaseSensitive – true or false
      • Include – file to crawl.
        • Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
      • Exclude – files to leave out while crawling.
        • Name – String e.g. "*test*" (leave out all text files which have test in the filename).

Crawling configuration example

    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
    <Attribute Type="String" Name="Filename">
    <Attribute Type="String" Name="Path" KeyAttribute="true">
    <Attribute Type="String" Name="Content" Attachment="true">
    <Attribute Type="String" Name="Extension"
    <Attribute Type="String" Name="Size">
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
      <Include Name="*.htm"/>
      <Include Name="*.html"/>
      <Include Name="*.xml"/>     

Output example for default configuration

For a text file named crawler.txt located in c:/data the crawler will create the following record:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
  <Val key="_source">file</Val>
  <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
  <Val key="Path">c:\data\crawler.txt</Val>
  <Val key="Filename">crawler.txt</Val>
  <Val key="Extension">txt</Val>
  <Val key="Size" type="long">36</Val>
  <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>

See also