Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/Filesystem Crawler"
< SMILA | Documentation
m (→Crawling Configuration explanation) |
m |
||
Line 14: | Line 14: | ||
The configuration file can be found at <tt>configuration/org.eclipse.smila.framework/file</tt>. | The configuration file can be found at <tt>configuration/org.eclipse.smila.framework/file</tt>. | ||
+ | Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd</tt>. | ||
== Crawling configuration explanation == | == Crawling configuration explanation == | ||
Line 43: | Line 44: | ||
**** <tt>Name</tt> – String e.g. <tt>"*test*"</tt> (leave out all text files which have <tt>test</tt> in the filename). | **** <tt>Name</tt> – String e.g. <tt>"*test*"</tt> (leave out all text files which have <tt>test</tt> in the filename). | ||
+ | == Crawling configuration example == | ||
+ | |||
+ | <source lang="xml"> | ||
+ | <CrawlJob | ||
+ | xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance | ||
+ | xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd" | ||
+ | > | ||
+ | <DataSourceID>file</DataSourceID> | ||
+ | <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID> | ||
+ | <DataConnectionID> | ||
+ | <Crawler>FileSystemCrawlerDS</Crawler> | ||
+ | </DataConnectionID> | ||
+ | <CompoundHandling>Yes</CompoundHandling> | ||
+ | <Attributes> | ||
+ | <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true"> | ||
+ | <FileAttributes>LastModifiedDate</FileAttributes> | ||
+ | </Attribute> | ||
+ | <Attribute Type="String" Name="Filename"> | ||
+ | <FileAttributes>Name</FileAttributes> | ||
+ | </Attribute> | ||
+ | <Attribute Type="String" Name="Path" KeyAttribute="true"> | ||
+ | <FileAttributes>Path</FileAttributes> | ||
+ | </Attribute> | ||
+ | <Attribute Type="String" Name="Content" Attachment="true"> | ||
+ | <FileAttributes>Content</FileAttributes> | ||
+ | </Attribute> | ||
+ | <Attribute Type="String" Name="Extension" | ||
+ | <FileAttributes>FileExtension</FileAttributes> | ||
+ | </Attribute> | ||
+ | <Attribute Type="String" Name="Size"> | ||
+ | <FileAttributes>Size</FileAttributes> | ||
+ | </Attribute> | ||
+ | </Attributes> | ||
+ | <Process> | ||
+ | <BaseDir>c:\data</BaseDir> | ||
+ | <Filter Recursive="true" CaseSensitive="false"> | ||
+ | <Include Name="*.txt"/> | ||
+ | <Include Name="*.htm"/> | ||
+ | <Include Name="*.html"/> | ||
+ | <Include Name="*.xml"/> | ||
+ | <!-- | ||
+ | <Include Name="*.pdf"/> | ||
+ | <Include Name="*.doc"/> | ||
+ | <Include Name="*.xls"/> | ||
+ | <Include Name="*.ppt"/> | ||
+ | <Include Name="*.rtf"/> | ||
+ | //--> | ||
+ | </Filter> | ||
+ | </Process> | ||
+ | </CrawlJob> | ||
+ | </source> | ||
__FORCETOC__ | __FORCETOC__ |
Revision as of 03:35, 19 March 2009
Contents
What does FileSystemCrawler do
The FileSystemCrawler collects all files and folders recursively starting from a given directory. Next do the content of files it may gather any file meta information from the following list:
- size
- full path
- file name only
- file size
- last modified date
- file content
- file extension
Crawling configuration
The configuration file can be found at configuration/org.eclipse.smila.framework/file. Defining Schema: org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd.
Crawling configuration explanation
The root element of crawling configuration is CrawlJob and contains the following sub elements:
- DataSourceID – the identification of a data source
- SchemaID – specifies the schema for a crawler job
- DataConnectionID – describes which agent crawler should be used
- Crawler – implementation class of a Crawler
- Agent – implementation class of an Agent
- CompoundHandling – specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
- Attributes – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
- Attribute
- Type (required) – the data type (String, Integer or Date).
- Name (required) – attributes name.
- HashAttribute – specify if a hash should be created (true or false).
- KeyAttribute – creates a key for this object, for example for record id (true or false).
- Attachment – specify if the attribute return the data as attachment of record.
- Attribute
- Process – contains parameters for gathering data.
- BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
- Filter – select file type and crawling mode.
- Recursive – (true or false).
- CaseSensitive – true or false
- Include – file to crawl.
- Name - String e.g. "*.txt" (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
- Exclude – files to leave out while crawling.
- Name – String e.g. "*test*" (leave out all text files which have test in the filename).
- Filter – select file type and crawling mode.
- BaseDir – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
Crawling configuration example
<CrawlJob xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/filesystemIndexOrder.xsd" > <DataSourceID>file</DataSourceID> <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID> <DataConnectionID> <Crawler>FileSystemCrawlerDS</Crawler> </DataConnectionID> <CompoundHandling>Yes</CompoundHandling> <Attributes> <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true"> <FileAttributes>LastModifiedDate</FileAttributes> </Attribute> <Attribute Type="String" Name="Filename"> <FileAttributes>Name</FileAttributes> </Attribute> <Attribute Type="String" Name="Path" KeyAttribute="true"> <FileAttributes>Path</FileAttributes> </Attribute> <Attribute Type="String" Name="Content" Attachment="true"> <FileAttributes>Content</FileAttributes> </Attribute> <Attribute Type="String" Name="Extension" <FileAttributes>FileExtension</FileAttributes> </Attribute> <Attribute Type="String" Name="Size"> <FileAttributes>Size</FileAttributes> </Attribute> </Attributes> <Process> <BaseDir>c:\data</BaseDir> <Filter Recursive="true" CaseSensitive="false"> <Include Name="*.txt"/> <Include Name="*.htm"/> <Include Name="*.html"/> <Include Name="*.xml"/> <!-- <Include Name="*.pdf"/> <Include Name="*.doc"/> <Include Name="*.xls"/> <Include Name="*.ppt"/> <Include Name="*.rtf"/> //--> </Filter> </Process> </CrawlJob>