|
|
Line 1: |
Line 1: |
− | == Overview ==
| + | #REDIRECT [[SMILA/Documentation/Filesystem_Crawler]] |
− | | + | |
− | The File System crawler recursively fetches all files from a given directory. Besides providing the content of files, it may also gather any file's metadata from the following list:
| + | |
− | | + | |
− | * full path
| + | |
− | * file name only
| + | |
− | * file size
| + | |
− | * last modified date
| + | |
− | * file content
| + | |
− | * file extension
| + | |
− | | + | |
− | == Crawling configuration ==
| + | |
− | | + | |
− | The example configuration file is located at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>.
| + | |
− | | + | |
− | Defining Schema: <tt>org.eclipse.smila.connectivits.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd</tt>.
| + | |
− | | + | |
− | == Crawling configuration explanation ==
| + | |
− | | + | |
− | The root element of crawling configuration is <tt>DataSourceConnectionConfig</tt> and contains the following sub elements:
| + | |
− | | + | |
− | * <tt>DataSourceID</tt> – the identification of a data source
| + | |
− | * <tt>SchemaID</tt> – specifies the schema for a crawler job
| + | |
− | * <tt>DataConnectionID</tt> – describes which agent crawler should be used
| + | |
− | ** <tt>Crawler</tt> – implementation class of a Crawler
| + | |
− | ** <tt>Agent</tt> – implementation class of an Agent
| + | |
− | * <tt>CompoundHandling</tt> – specify if packed data (like a ZIP containing files) should be unpack and files within should be crawled (YES or NO).
| + | |
− | * <tt>Attributes</tt> – list all attributes which describe a file. (LastModifiedDate, Filename, Path, Content, Extension, Size)
| + | |
− | ** <tt>Attribute</tt>
| + | |
− | *** <tt>Type</tt> (required) – the data type (String, Integer or Date).
| + | |
− | *** <tt>Name</tt> (required) – attributes name.
| + | |
− | *** <tt>HashAttribute</tt> – specify if a hash should be created (true or false).
| + | |
− | *** <tt>KeyAttribute</tt> – creates a key for this object, for example for record id (true or false).
| + | |
− | *** <tt>Attachment</tt> – specify if the attribute return the data as attachment of record.
| + | |
− | | + | |
− | * <tt>Process</tt> – contains parameters for gathering data.
| + | |
− | ** <tt>BaseDir</tt> – the directory the crawling process begin (if is null, cannot be found/access or is not a directory a CrawlerCriticalException will be thrown).
| + | |
− | *** <tt>Filter</tt> – select file type and crawling mode.
| + | |
− | **** <tt>Recursive</tt> – (true or false).
| + | |
− | **** <tt>CaseSensitive</tt> – true or false
| + | |
− | *** <tt>Include</tt> – file to crawl.
| + | |
− | **** <tt>Name</tt> - String e.g. <tt>"*.txt"</tt> (crawl all text files). Everything that is not included is excluded automatically. You could use a star * as wildcard.
| + | |
− | *** <tt>Exclude</tt> – files to leave out while crawling.
| + | |
− | **** <tt>Name</tt> – String e.g. <tt>"*test*"</tt> (leave out all text files which have <tt>test</tt> in the filename).
| + | |
− | | + | |
− | == Crawling configuration example ==
| + | |
− | | + | |
− | <source lang="xml">
| + | |
− | <DataSourceConnectionConfig
| + | |
− | xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
| + | |
− | xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.filesystem/schemas/FileSystemDataSourceConnectionConfigSchema.xsd">
| + | |
− | <DataSourceID>file</DataSourceID>
| + | |
− | <SchemaID>org.eclipse.smila.connectivity.framework.crawler.filesystem</SchemaID>
| + | |
− | <DataConnectionID>
| + | |
− | <Crawler>FileSystemCrawlerDS</Crawler>
| + | |
− | </DataConnectionID>
| + | |
− | <CompoundHandling>Yes</CompoundHandling>
| + | |
− | <Attributes>
| + | |
− | <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true">
| + | |
− | <FileAttributes>LastModifiedDate</FileAttributes>
| + | |
− | </Attribute>
| + | |
− | <Attribute Type="String" Name="Filename">
| + | |
− | <FileAttributes>Name</FileAttributes>
| + | |
− | </Attribute>
| + | |
− | <Attribute Type="String" Name="Path" KeyAttribute="true">
| + | |
− | <FileAttributes>Path</FileAttributes>
| + | |
− | </Attribute>
| + | |
− | <Attribute Type="String" Name="Content" Attachment="true">
| + | |
− | <FileAttributes>Content</FileAttributes>
| + | |
− | </Attribute>
| + | |
− | <Attribute Type="String" Name="Extension"
| + | |
− | <FileAttributes>FileExtension</FileAttributes>
| + | |
− | </Attribute>
| + | |
− | <Attribute Type="String" Name="Size">
| + | |
− | <FileAttributes>Size</FileAttributes>
| + | |
− | </Attribute>
| + | |
− | </Attributes>
| + | |
− | <Process>
| + | |
− | <BaseDir>c:\data</BaseDir>
| + | |
− | <Filter Recursive="true" CaseSensitive="false">
| + | |
− | <Include Name="*.txt"/>
| + | |
− | <Include Name="*.htm"/>
| + | |
− | <Include Name="*.html"/>
| + | |
− | <Include Name="*.xml"/>
| + | |
− | </Filter>
| + | |
− | </Process>
| + | |
− | </DataSourceConnectionConfig>
| + | |
− | </source>
| + | |
− | | + | |
− | == Output example for default configuration ==
| + | |
− | | + | |
− | For a text file named <tt>crawler.txt</tt> located in <tt>c:/data</tt> the crawler will create the following record:
| + | |
− | | + | |
− | <source lang="xml">
| + | |
− | <Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
| + | |
− | <Val key="_recordid">file:<Path=c:\data\crawler.txt></Val>
| + | |
− | <Val key="_source">file</Val>
| + | |
− | <Val key="LastModifiedDate" type="datetime">2009-02-25T17:44:46+0100</Val>
| + | |
− | <Val key="Path">c:\data\crawler.txt</Val>
| + | |
− | <Val key="Filename">crawler.txt</Val>
| + | |
− | <Val key="Extension">txt</Val>
| + | |
− | <Val key="Size" type="long">36</Val>
| + | |
− | <Val key="_HASH_TOKEN">66f373e6f13498a65c7f5f1cf185611e94ab45630c825cc2028dda38e8245c7</Val>
| + | |
− | <Attachment>Content</Attachment>
| + | |
− | </Record>
| + | |
− | </source>
| + | |
− | | + | |
− | == See also ==
| + | |
− | | + | |
− | * [[SMILA/Documentation/Crawler|Crawler]]
| + | |
− | * [[SMILA/Documentation/Web Crawler|Web Crawler]]
| + | |
− | * [[SMILA/Documentation/JDBC Crawler|JDBC Crawler]]
| + | |
− | | + | |
− | __FORCETOC__
| + | |
− | | + | |
− | [[Category:SMILA]]
| + | |