Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

Revision as of 10:33, 20 April 2011

Overview

The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:

full path
file name only
file size
last modified date
file content
file extension

Crawling configuration

The example configuration file is located at configuration/org.eclipse.smila.connectivity.framework/file.xml.

Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd.

Crawling configuration explanation

The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:

DataSourceID – the identification of a data source
SchemaID – specifies the schema for a crawler job
DataConnectionID – describes which agent crawler should be used
- Crawler – implementation class of a Crawler
- Agent – implementation class of an Agent
CompoundHandling – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
Attributes – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
- Attribute
  - Type (required) – the data type (String, Integer or Date).
  - Name (required) – attributes name.
  - HashAttribute – specify if a hash should be created (true or false).
  - KeyAttribute – creates a key for this object, for example for record id (true or false).
  - Attachment – specify if the attribute return the data as attachment of record.

Process – contains parameters for gathering data.
- BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
  - Filter – select file type and crawling mode.
    - Recursive – (true or false).
    - CaseSensitive – true or false
  - Include – file to crawl.
    - Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
  - Exclude – files to leave out while crawling.
    - Name – String e.g. "*test*" (leave out all text files which have test in the filename).

Crawling configuration example

<DataSourceConnectionConfig
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd">
  <DataSourceID>file</DataSourceID>
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
  <DataConnectionID>
    <Crawler>FileSystemCrawlerDS</Crawler>
  </DataConnectionID>
  <CompoundHandling>Yes</CompoundHandling>
  <Attributes>
    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
      <FileAttributes>LastModifiedDate</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Filename">
      <FileAttributes>Name</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Path" KeyAttribute="true">
      <FileAttributes>Path</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Content" Attachment="true">
      <FileAttributes>Content</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Extension"
      <FileAttributes>FileExtension</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Size">
      <FileAttributes>Size</FileAttributes>
    </Attribute>    
  </Attributes>
  <Process>
    <BaseDir>c:\data</BaseDir>
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
      <Include Name="*.htm"/>
      <Include Name="*.html"/>
      <Include Name="*.xml"/>     
    </Filter>
  </Process>
</DataSourceConnectionConfig>

Output example for default configuration

For a text file named crawler.txt located in c:/data the crawler will create the following record:

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
  <Val key="_source">file</Val>
  <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
  <Val key="Path">c:\data\crawler.txt</Val>
  <Val key="Filename">crawler.txt</Val>
  <Val key="Extension">txt</Val>
  <Val key="Size" type="long">36</Val>
  <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
  <Attachment>Content</Attachment>
</Record>

@@ Line 1: / Line 1: @@
 == Overview ==
-The FileSystemCrawler recursively fetches all files from a given directory. Besides providing the content of files it may also gather any file's metadata from the following list:
+The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:
 * full path
@@ Line 12: / Line 12: @@
 == Crawling configuration ==
-The example configuration file called "file.xml" is located at <tt>configuration/org.eclipse.smila.connectivity.framework</tt>.
+The example configuration file is located at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>.
 Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd</tt>.
@@ Line 25: / Line 25: @@
 ** <tt>Crawler</tt> – implementation class of a Crawler
 ** <tt>Agent</tt> – implementation class of an Agent
-* <tt>CompoundHandling</tt> – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
+* <tt>CompoundHandling</tt> – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
 * <tt>Attributes</tt> – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
 ** <tt>Attribute</tt>
@@ Line 93: / Line 93: @@
 <source lang="xml">
-<Record xmlns="http://www.eclipse.org/smila/record" version="1.0">
+<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
-   <Id xmlns="http://www.eclipse.org/smila/id" version="1.0">
+   <Val key="_recordid">file:&lt;Path=c:\data\crawler.txt&gt;</Val>
-    <!-- Element name must be Source, not _Source, it's made due to syntax coloring problem in wiki -->
+   <Val key="_source">file</Val>
-    <_Source>file</_Source>
+   <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
-    <Key name="Path">c:\data\crawler.txt</Key>
+   <Val key="Path">c:\data\crawler.txt</Val>
-   </Id>
+   <Val key="Filename">crawler.txt</Val>
-   <A n="LastModifiedDate">
+   <Val key="Extension">txt</Val>
-    <L>
+   <Val key="Size" type="long">36</Val>
-      <V t="datetime">2009-02-25 17:44:46.541</V>
+   <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
-    </L>
-   </A>
-  <A n="Path">
-    <L>
-      <V>c:\data\crawler.txt</V>
-    </L>
-  </A>
-   <A n="Filename">
-    <L>
-      <V>crawler.txt</V>
-    </L>
-  </A>
-   <A n="Extension">
-    <L>
-      <V>txt</V>
-    </L>
-   </A>
-  <A n="Size">
-    <L>
-      <V t="int">36</V>
-    </L>
-   </A>
-  <A n="_HASH_TOKEN">
-    <L>
-      <V>
-f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7
-      </V>
-    </L>
-  </A>
    <Attachment>Content</Attachment>
 </Record>

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

Revision as of 10:33, 20 April 2011

Contents

Overview

Crawling configuration

Crawling configuration explanation

Crawling configuration example

Output example for default configuration

See also

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Filesystem Crawler"

Revision as of 10:33, 20 April 2011

Contents

Overview

Crawling configuration

Crawling configuration explanation

Crawling configuration example

Output example for default configuration

See also