Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

Revision as of 09:31, 20 February 2012

Currently, the file system workers are implemented very simplistic so that we can test the importing framework. A sophisticated implementation will follow soon.

File Crawler

Worker name: fileCrawler
Parameters:
- dataSource
- rootFolder
Task generator: runOnceTrigger
Input slots:
- directoriesToCrawl

Output slots:
- directoriesToCrawl
- filesToCrawl

The File Crawler starts crawling in the rootFolder and produces one record for each subdirectory in the bucket connected to directoriesToCrawl (and each record goes to an own bulk), and one record per file in the folder in bucket connected to filesToCrawl (a new bulk is started each 1000 files). The bucket in slot directoriesToCrawl should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:

file.name
file.folder
file.path (also set as record ID)
file.extension
file.size
file.lastModified (also written to attribute _deltaHash for delta checking)

The attribute _source is set from the task parameter dataSource which has no further meaning currently.

The fileCrawler is usually the first worker in the workflow and the job is started in runOnce mode.

File Fetcher

Worker name: fileCrawler
Parameters: none
Input slots:
- filesToFetch

Output slots:
- files

For each input record, reads the file referenced in attribute file.path and adds the content as attachment file.content

File Extractor Worker

Worker name: fileExtractor
Parameters: none
Input slots:
- compounds
Output slots:
- files

Dependency: CompoundExtractor service

For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):

file.path: complete path in compound
file.folder: folder of element in compound
file.name</tt>: filename part of path
file.extension: extension part of filename
file.lastModified: last modification date
file.size: uncompressed size
_deltaHash: derived from file.lastModified, if set.

If the element is not a compound itself, its content is added as attachment file.content.

@@ Line 39: / Line 39: @@
 For each input record, reads the file referenced in attribute <tt>file.path</tt> and adds the content as attachment <tt>file.content</tt>
+=== File Extractor Worker ===
+* Worker name: <tt>fileExtractor</tt>
+* Parameters: none
+* Input slots:
+** <tt>compounds</tt>
+* Output slots:
+** <tt>files</tt>
+* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
+For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):
+* <tt>file.path</tt>: complete path in compound
+* <tt>file.folder</tt>: folder of element in compound
+* <tt>file.name</tt></tt>: filename part of path
+* <tt>file.extension</tt>: extension part of filename
+* <tt>file.lastModified</tt>: last modification date
+* <tt>file.size</tt>: uncompressed size
+* <tt>_deltaHash</tt>: derived from file.lastModified, if set.
+If the element is not a compound itself, its content is added as attachment <tt>file.content</tt>.

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

Revision as of 09:31, 20 February 2012

File Crawler

File Fetcher

File Extractor Worker