Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

From Eclipsepedia

Jump to: navigation, search
(New page: Currently, the file system workers are implemented very simplistic so that we can test the importing framework. A sophisticated implementation will follow soon. === File Crawler === * W...)
 
(File Crawler)
Line 18: Line 18:
 
* <tt>file.name</tt>
 
* <tt>file.name</tt>
 
* <tt>file.folder</tt>
 
* <tt>file.folder</tt>
* <tt>file.path</tt>
+
* <tt>file.path</tt> (also set as record ID)
 
* <tt>file.extension</tt>
 
* <tt>file.extension</tt>
 
* <tt>file.size</tt>
 
* <tt>file.size</tt>
 
* <tt>file.last-modified</tt> (also written to attribute <tt>_deltaHash</tt> for delta checking)
 
* <tt>file.last-modified</tt> (also written to attribute <tt>_deltaHash</tt> for delta checking)
  
The _recordid is the same the file.path currently, and _sourceid is set from the dataSource task parameter.
+
The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently.
  
 
=== File Fetcher ===
 
=== File Fetcher ===

Revision as of 13:23, 29 November 2011

Currently, the file system workers are implemented very simplistic so that we can test the importing framework. A sophisticated implementation will follow soon.

File Crawler

  • Worker name: fileCrawler
  • Parameters:
    • dataSource
    • rootFolder
  • Input slots:
    • directoriesToCrawl
  • Output slots:
    • directoriesToCrawl
    • filesToCrawl

The File Crawler starts crawling in the rootFolder and produces one record for each subdirectory in the bucket connected to directoriesToCrawl (and each record goes to an own bulk), and one record per file in the folder in bucket connected to filesToCrawl (a new bulk is started each 1000 files). The bucket in slot directoriesToCrawl should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:

  • file.name
  • file.folder
  • file.path (also set as record ID)
  • file.extension
  • file.size
  • file.last-modified (also written to attribute _deltaHash for delta checking)

The attribute _source is set from the task parameter dataSource which has no further meaning currently.

File Fetcher

  • Worker name: fileCrawler
  • Parameters: none
  • Input slots:
    • filesToFetch
  • Output slots:
    • files

For each input record, reads the file referenced in attribute file.path and adds the content as attachment file.content