Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

Revision as of 05:34, 21 May 2012

Currently, the file system workers are implemented very simplistic so that we can test the importing framework. A sophisticated implementation will follow soon.

Worker name: fileCrawler
Parameters:
- dataSource: (req.) value for attribute _source, needed by the delta service
- rootFolder: (req.) crawl starting point
- filters (opt.) filters with conditions to in- or exclude files and folders from import
  - maxFileSize: maximum file size, files that are bigger are filtered out
  - maxFolderDepth: starting from the root folder, this is the maximum depth to crawl into subdirectories
  - followSymbolicLinks: whether to follow symbolic links to files/folders or not
  - filePatterns: regex patterns for filtering crawled files on the basis of their file name
    - include: if include patterns are specified, at least one of them must match the file name. If no include patterns are specified, this is handled as if all file names are included.
    - exclude: if at least one exclude pattern matches the file name, the crawled file is filtered out
  - folderPatterns: regex patterns for filtering crawled folders and files on the basis of their file path
    - include: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.
    - exclude: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be crawled.
- mapping (req.) specifies how to map file properties to record attributes
  - filePath (opt.) mapping attribute for the complete file path
  - fileFolder (opt.) mapping attribute for the file folder
  - fileName (opt.) mapping attribute for the file name
  - fileExtension (opt.) mapping attribute for the file extension
  - fileSize (opt.) mapping attribute for the file size (in bytes)
  - fileLastModified (opt.) mapping attribute for the file's last modified date
  - fileContent (opt.) mapping attribute for the file's last modified date
Task generator: runOnceTrigger
Input slots:
- directoriesToCrawl
Output slots:
- directoriesToCrawl
- filesToCrawl

Processing

The File Crawler starts crawling in the rootFolder and produces one record for each subdirectory in the bucket connected to directoriesToCrawl (and each record goes to an own bulk), and one record per file in the folder in bucket connected to filesToCrawl (a new bulk is started each 1000 files). The bucket in slot directoriesToCrawl should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:

fileName
fileFolder
filePath (also set as record ID)
fileExtension
fileSize
fileLastModified (also written to attribute _deltaHash for delta checking)

The attribute _source is set from the task parameter dataSource which has no further meaning currently, but it is needed by the delta service.

Compounds

If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute _isCompound set to true.

Dependency: CompoundExtractor service

Examples

Example of a file crawl job:

 {
   "name":"crawlFileJob",
   "workflow":"fileCrawling",
   "parameters":{
     "tempStore":"temp",
     "dataSource":"files",
     "rootFolder":"/temp",
     "jobToPushTo":"buildBulks",
     "mapping":{
       "fileContent":"fileContent",
       "filePath":"filePath",
       "fileFolder":"fileFolder",
       "fileName":"fileName",
       "fileSize":"fileSize",
       "fileExtension":"fileExtension",
       "fileLastModified":"fileLastModified"
     },
     "filters":{
       "maxFileSize":1000000000,
       "maxFolderDepth":10,
       "followSymbolicLinks":true,
       "filePatterns":{
         "include":[".*"],
         "exclude":["invalid.txt", "invalid.pdf"]  
       },          
       "folderPatterns":{
         "include":[".*"],
         "exclude":["invalid-dir"]  
       }
     }
   }
 }

File Fetcher

Worker name: fileCrawler
Parameters: none
Input slots:
- filesToFetch

Output slots:
- files

For each input record, reads the file referenced in attribute filePath and adds the content as attachment fileContent

File Extractor Worker

Worker name: fileExtractor
Parameters: none
Input slots:
- compounds
Output slots:
- files

Dependency: CompoundExtractor service

For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):

filePath: complete path in compound
fileFolder: folder of element in compound
fileName: filename part of path
fileExtension: extension part of filename
fileLastModified: last modification timestamp
fileSize: uncompressed size
_deltaHash: computed as by the FileCrawler worker
_compoundRecordId: record ID of top-level compound this element was extracted from
_isCompound: set to true for elements that are compounds themselves.
_compoundPath: sequence of filePath attribute values of the compound objects needed to navigate to the compound element.

If the element is not a compound itself, its content is added as attachment fileContent.

@@ Line 2: / Line 2: @@
 === File Crawler ===
+The File Crawler imports files from a root folder and all subdirectories below.
+===== Configuration =====
+The File Crawler worker is usually the first worker in a workflow and the job is started in <tt>runOnce</tt> mode.
 * Worker name: <tt>fileCrawler</tt>
 * Parameters:
-** <tt>dataSource</tt>
+** <tt>dataSource</tt>: ''(req.)'' value for attribute <tt>_source</tt>, needed by the delta service
-** <tt>rootFolder</tt>
+** <tt>rootFolder</tt>: ''(req.)'' crawl starting point
+** <tt>filters</tt> ''(opt.)'' filters with conditions to in- or exclude files and folders from import
+*** <tt>maxFileSize</tt>: maximum file size, files that are bigger are filtered out
+*** <tt>maxFolderDepth</tt>: starting from the root folder, this is the maximum depth to crawl into subdirectories
+*** <tt>followSymbolicLinks</tt>: whether to follow symbolic links to files/folders or not
+*** <tt>filePatterns</tt>: regex patterns for filtering crawled files on the basis of their file name
+**** <tt>include</tt>: if include patterns are specified, at least one of them must match the file name. If no include patterns are specified, this is handled as if all file names are included.
+**** <tt>exclude</tt>: if at least one exclude pattern matches the file name, the crawled file is filtered out
+*** <tt>folderPatterns</tt>: regex patterns for filtering crawled folders and files on the basis of their file path
+**** <tt>include</tt>: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.
+**** <tt>exclude</tt>: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be crawled.
+** <tt>mapping</tt> ''(req.)'' specifies how to map file properties to record attributes
+*** <tt>filePath</tt> ''(opt.)'' mapping attribute for the complete file path
+*** <tt>fileFolder</tt> ''(opt.)'' mapping attribute for the file folder
+*** <tt>fileName</tt> ''(opt.)'' mapping attribute for the file name
+*** <tt>fileExtension</tt> ''(opt.)'' mapping attribute for the file extension
+*** <tt>fileSize</tt> ''(opt.)'' mapping attribute for the file size (in bytes)
+*** <tt>fileLastModified</tt> ''(opt.)'' mapping attribute for the file's last modified date
+*** <tt>fileContent</tt> ''(opt.)'' mapping attribute for the file's last modified date
 * Task generator: <tt>[[SMILA/Documentation/TaskGenerators#RunOnceTriggerTaskGenerator|runOnceTrigger]]</tt>
 * Input slots:
 ** <tt>directoriesToCrawl</tt>
 * Output slots:
 ** <tt>directoriesToCrawl</tt>
 ** <tt>filesToCrawl</tt>
-* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
+===== Processing =====
 The File Crawler starts crawling in the <tt>rootFolder</tt> and produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> (and each record goes to an own bulk), and one record per file in the folder in bucket connected to <tt>filesToCrawl</tt> (a new bulk is started each 1000 files). The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:
@@ Line 27: / Line 50: @@
 The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently, but it is needed by the delta service.
+===== Compounds =====
 If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute <tt>_isCompound</tt> set to <tt>true</tt>.
-The fileCrawler is usually the first worker in the workflow and the job is started in <tt>runOnce</tt> mode.
+* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
+===== Examples =====
+''Example of a file crawl job:''
+<pre>
+ {
+   "name":"crawlFileJob",
+   "workflow":"fileCrawling",
+   "parameters":{
+     "tempStore":"temp",
+     "dataSource":"files",
+     "rootFolder":"/temp",
+     "jobToPushTo":"buildBulks",
+     "mapping":{
+       "fileContent":"fileContent",
+       "filePath":"filePath",
+       "fileFolder":"fileFolder",
+       "fileName":"fileName",
+       "fileSize":"fileSize",
+       "fileExtension":"fileExtension",
+       "fileLastModified":"fileLastModified"
+     },
+     "filters":{
+       "maxFileSize":1000000000,
+       "maxFolderDepth":10,
+       "followSymbolicLinks":true,
+       "filePatterns":{
+         "include":[".*"],
+         "exclude":["invalid.txt", "invalid.pdf"]
+       },
+       "folderPatterns":{
+         "include":[".*"],
+         "exclude":["invalid-dir"]
+       }
+     }
+   }
+ }
+</pre>
 === File Fetcher ===

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

Revision as of 05:34, 21 May 2012

Contents

File Crawler

Configuration

Processing

Compounds

Examples

File Fetcher

File Extractor Worker

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

Revision as of 05:34, 21 May 2012

Contents

File Crawler

Configuration

Processing

Compounds

Examples

File Fetcher

File Extractor Worker