Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/Filesystem Crawler"
< SMILA | Documentation
(→Crawling configuration example) |
|||
Line 1: | Line 1: | ||
== Overview == | == Overview == | ||
− | The | + | The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list: |
* full path | * full path | ||
Line 12: | Line 12: | ||
== Crawling configuration == | == Crawling configuration == | ||
− | The example configuration file | + | The example configuration file is located at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>. |
Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd</tt>. | Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd</tt>. | ||
Line 25: | Line 25: | ||
** <tt>Crawler</tt> – implementation class of a Crawler | ** <tt>Crawler</tt> – implementation class of a Crawler | ||
** <tt>Agent</tt> – implementation class of an Agent | ** <tt>Agent</tt> – implementation class of an Agent | ||
− | * <tt>CompoundHandling</tt> – specify if packed data (like a | + | * <tt>CompoundHandling</tt> – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO). |
* <tt>Attributes</tt> – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size) | * <tt>Attributes</tt> – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size) | ||
** <tt>Attribute</tt> | ** <tt>Attribute</tt> | ||
Line 93: | Line 93: | ||
<source lang="xml"> | <source lang="xml"> | ||
− | <Record xmlns="http://www.eclipse.org/smila/record" version=" | + | <Record xmlns="http://www.eclipse.org/smila/record" version="2.0"> |
− | < | + | <Val key="_recordid">file:<Path=c:\data\crawler.txt></Val> |
− | + | <Val key="_source">file</Val> | |
− | + | <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val> | |
− | + | <Val key="Path">c:\data\crawler.txt</Val> | |
− | </ | + | <Val key="Filename">crawler.txt</Val> |
− | < | + | <Val key="Extension">txt</Val> |
− | + | <Val key="Size" type="long">36</Val> | |
− | + | <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val> | |
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | < | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
<Attachment>Content</Attachment> | <Attachment>Content</Attachment> | ||
</Record> | </Record> |
Revision as of 10:33, 20 April 2011
Contents
Overview
The file system crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:
- full path
- file name only
- file size
- last modified date
- file content
- file extension
Crawling configuration
The example configuration file is located at configuration/org.eclipse.smila.connectivity.framework/file.xml.
Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd.
Crawling configuration explanation
The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:
- DataSourceID – the identification of a data source
- SchemaID – specifies the schema for a crawler job
- DataConnectionID – describes which agent crawler should be used
- Crawler – implementation class of a Crawler
- Agent – implementation class of an Agent
- CompoundHandling – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
- Attributes – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
- Attribute
- Type (required) – the data type (String, Integer or Date).
- Name (required) – attributes name.
- HashAttribute – specify if a hash should be created (true or false).
- KeyAttribute – creates a key for this object, for example for record id (true or false).
- Attachment – specify if the attribute return the data as attachment of record.
- Attribute
- Process – contains parameters for gathering data.
- BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
- Filter – select file type and crawling mode.
- Recursive – (true or false).
- CaseSensitive – true or false
- Include – file to crawl.
- Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
- Exclude – files to leave out while crawling.
- Name – String e.g. "*test*" (leave out all text files which have test in the filename).
- Filter – select file type and crawling mode.
- BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
Crawling configuration example
<DataSourceConnectionConfig xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd"> <DataSourceID>file</DataSourceID> <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID> <DataConnectionID> <Crawler>FileSystemCrawlerDS</Crawler> </DataConnectionID> <CompoundHandling>Yes</CompoundHandling> <Attributes> <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true"> <FileAttributes>LastModifiedDate</FileAttributes> </Attribute> <Attribute Type="String" Name="Filename"> <FileAttributes>Name</FileAttributes> </Attribute> <Attribute Type="String" Name="Path" KeyAttribute="true"> <FileAttributes>Path</FileAttributes> </Attribute> <Attribute Type="String" Name="Content" Attachment="true"> <FileAttributes>Content</FileAttributes> </Attribute> <Attribute Type="String" Name="Extension" <FileAttributes>FileExtension</FileAttributes> </Attribute> <Attribute Type="String" Name="Size"> <FileAttributes>Size</FileAttributes> </Attribute> </Attributes> <Process> <BaseDir>c:\data</BaseDir> <Filter Recursive="true" CaseSensitive="false"> <Include Name="*.txt"/> <Include Name="*.htm"/> <Include Name="*.html"/> <Include Name="*.xml"/> </Filter> </Process> </DataSourceConnectionConfig>
Output example for default configuration
For a text file named crawler.txt located in c:/data the crawler will create the following record:
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0"> <Val key="_recordid">file:<Path=c:\data\crawler.txt></Val> <Val key="_source">file</Val> <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val> <Val key="Path">c:\data\crawler.txt</Val> <Val key="Filename">crawler.txt</Val> <Val key="Extension">txt</Val> <Val key="Size" type="long">36</Val> <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val> <Attachment>Content</Attachment> </Record>