Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

From Eclipsepedia

Jump to: navigation, search
(File Extractor Worker)
Line 19: Line 19:
 
The File Crawler starts crawling in the <tt>rootFolder</tt> and produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> (and each record goes to an own bulk), and one record per file in the folder in bucket connected to <tt>filesToCrawl</tt> (a new bulk is started each 1000 files). The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:
 
The File Crawler starts crawling in the <tt>rootFolder</tt> and produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> (and each record goes to an own bulk), and one record per file in the folder in bucket connected to <tt>filesToCrawl</tt> (a new bulk is started each 1000 files). The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:
  
* <tt>file.name</tt>
+
* <tt>fileName</tt>
* <tt>file.folder</tt>
+
* <tt>fileFolder</tt>
* <tt>file.path</tt> (also set as record ID)
+
* <tt>filePath</tt> (also set as record ID)
* <tt>file.extension</tt>
+
* <tt>fileExtension</tt>
* <tt>file.size</tt>
+
* <tt>fileSize</tt>
* <tt>file.lastModified</tt> (also written to attribute <tt>_deltaHash</tt> for delta checking)
+
* <tt>fileLastModified</tt> (also written to attribute <tt>_deltaHash</tt> for delta checking)
  
 
The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently, but it is needed by the delta service.
 
The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently, but it is needed by the delta service.
Line 57: Line 57:
 
For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):
 
For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):
  
* <tt>file.path</tt>: complete path in compound
+
* <tt>filePath</tt>: complete path in compound
* <tt>file.folder</tt>: folder of element in compound
+
* <tt>fileFolder</tt>: folder of element in compound
* <tt>file.name</tt>: filename part of path
+
* <tt>fileName</tt>: filename part of path
* <tt>file.extension</tt>: extension part of filename
+
* <tt>fileExtension</tt>: extension part of filename
* <tt>file.lastModified</tt>: last modification timestamp
+
* <tt>fileLastModified</tt>: last modification timestamp
* <tt>file.size</tt>: uncompressed size
+
* <tt>fileSize</tt>: uncompressed size
 
* <tt>_deltaHash</tt>: computed as by the FileCrawler worker
 
* <tt>_deltaHash</tt>: computed as by the FileCrawler worker
 
* <tt>_compoundRecordId</tt>: record ID of top-level compound this element was extracted from
 
* <tt>_compoundRecordId</tt>: record ID of top-level compound this element was extracted from

Revision as of 09:11, 18 May 2012

Currently, the file system workers are implemented very simplistic so that we can test the importing framework. A sophisticated implementation will follow soon.

File Crawler

  • Worker name: fileCrawler
  • Parameters:
    • dataSource
    • rootFolder
  • Task generator: runOnceTrigger
  • Input slots:
    • directoriesToCrawl
  • Output slots:
    • directoriesToCrawl
    • filesToCrawl

The File Crawler starts crawling in the rootFolder and produces one record for each subdirectory in the bucket connected to directoriesToCrawl (and each record goes to an own bulk), and one record per file in the folder in bucket connected to filesToCrawl (a new bulk is started each 1000 files). The bucket in slot directoriesToCrawl should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:

  • fileName
  • fileFolder
  • filePath (also set as record ID)
  • fileExtension
  • fileSize
  • fileLastModified (also written to attribute _deltaHash for delta checking)

The attribute _source is set from the task parameter dataSource which has no further meaning currently, but it is needed by the delta service.

If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute _isCompound set to true.

The fileCrawler is usually the first worker in the workflow and the job is started in runOnce mode.

File Fetcher

  • Worker name: fileCrawler
  • Parameters: none
  • Input slots:
    • filesToFetch
  • Output slots:
    • files

For each input record, reads the file referenced in attribute file.path and adds the content as attachment file.content

File Extractor Worker

  • Worker name: fileExtractor
  • Parameters: none
  • Input slots:
    • compounds
  • Output slots:
    • files

For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):

  • filePath: complete path in compound
  • fileFolder: folder of element in compound
  • fileName: filename part of path
  • fileExtension: extension part of filename
  • fileLastModified: last modification timestamp
  • fileSize: uncompressed size
  • _deltaHash: computed as by the FileCrawler worker
  • _compoundRecordId: record ID of top-level compound this element was extracted from
  • _isCompound: set to true for elements that are compounds themselves.
  • _compoundPath: sequence of file.path attribute values of the compound objects needed to navigate to the compound element.

If the element is not a compound itself, its content is added as attachment file.content.