Difference between revisions of "SMILA/Documentation/Importing/Crawler/File"

From Eclipsepedia

Jump to: navigation, search
(File Crawler configuration)
(new File Crawler features)
Line 1: Line 1:
Currently, the file system workers are implemented very simplistic so that we can test the importing framework. A sophisticated implementation will follow soon.
 
  
 +
File Crawler, File Fetcher and File Extractor worker are used for importing files from a file system. For a big picture and the worker's interaction have a look at the [[SMILA/Documentation/Importing/Concept | Importing Concept]].
 +
 
=== File Crawler ===
 
=== File Crawler ===
  
The File Crawler imports files from a root folder and all subdirectories below.  
+
The File Crawler crawls files from a root folder and the subdirectories below.  
  
 
===== Configuration =====
 
===== Configuration =====
Line 11: Line 12:
 
* Worker name: <tt>fileCrawler</tt>
 
* Worker name: <tt>fileCrawler</tt>
 
* Parameters:  
 
* Parameters:  
** <tt>dataSource</tt>: ''(req.)'' value for attribute <tt>_source</tt>, needed by the delta service
+
** <tt>dataSource</tt>: ''(req.)'' value for attribute <tt>_source</tt>, needed e.g. by the delta service
 
** <tt>rootFolder</tt>: ''(req.)'' crawl starting point
 
** <tt>rootFolder</tt>: ''(req.)'' crawl starting point
 
** <tt>filters</tt> ''(opt.)'' filters with conditions to in- or exclude files and folders from import
 
** <tt>filters</tt> ''(opt.)'' filters with conditions to in- or exclude files and folders from import
Line 22: Line 23:
 
*** <tt>folderPatterns</tt>: regex patterns for filtering crawled folders and files on the basis of their file path
 
*** <tt>folderPatterns</tt>: regex patterns for filtering crawled folders and files on the basis of their file path
 
**** <tt>include</tt>: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.
 
**** <tt>include</tt>: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.
**** <tt>exclude</tt>: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be crawled.
+
**** <tt>exclude</tt>: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be imported.
 
** <tt>mapping</tt> ''(req.)'' specifies how to map file properties to record attributes
 
** <tt>mapping</tt> ''(req.)'' specifies how to map file properties to record attributes
 
*** <tt>filePath</tt> ''(opt.)'' mapping attribute for the complete file path
 
*** <tt>filePath</tt> ''(opt.)'' mapping attribute for the complete file path
Line 30: Line 31:
 
*** <tt>fileSize</tt> ''(opt.)'' mapping attribute for the file size (in bytes)
 
*** <tt>fileSize</tt> ''(opt.)'' mapping attribute for the file size (in bytes)
 
*** <tt>fileLastModified</tt> ''(opt.)'' mapping attribute for the file's last modified date
 
*** <tt>fileLastModified</tt> ''(opt.)'' mapping attribute for the file's last modified date
*** <tt>fileContent</tt> ''(opt.)'' mapping attribute for the file's last modified date
 
 
* Task generator: <tt>[[SMILA/Documentation/TaskGenerators#RunOnceTriggerTaskGenerator|runOnceTrigger]]</tt>
 
* Task generator: <tt>[[SMILA/Documentation/TaskGenerators#RunOnceTriggerTaskGenerator|runOnceTrigger]]</tt>
 
* Input slots:
 
* Input slots:
Line 40: Line 40:
 
===== Processing =====
 
===== Processing =====
  
The File Crawler starts crawling in the <tt>rootFolder</tt> and produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> (and each record goes to an own bulk), and one record per file in the folder in bucket connected to <tt>filesToCrawl</tt> (a new bulk is started each 1000 files). The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The file records do not yet contain the file content but only metadata attributes:
+
The File Crawler starts crawling in the <tt>rootFolder</tt> and produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> (and each record goes to an own bulk), and one record per file in the folder in bucket connected to <tt>filesToCrawl</tt> (a new bulk is started each 1000 files). The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The resulting records do not yet contain the file content but only metadata attributes configured in the <tt>mapping</tt>.
 
+
* <tt>fileName</tt>
+
* <tt>fileFolder</tt>
+
* <tt>filePath</tt> (also set as record ID)
+
* <tt>fileExtension</tt>
+
* <tt>fileSize</tt>
+
* <tt>fileLastModified</tt> (also written to attribute <tt>_deltaHash</tt> for delta checking)
+
  
 
The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently, but it is needed by the delta service.
 
The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently, but it is needed by the delta service.
  
===== Compounds =====
+
'''Compounds''':
  
 
If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute <tt>_isCompound</tt> set to <tt>true</tt>.
 
If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute <tt>_isCompound</tt> set to <tt>true</tt>.
Line 57: Line 50:
 
* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
 
* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
  
===== Examples =====
+
 
 +
=== File Fetcher ===
 +
 
 +
For each input record, reads the file referenced in attribute <tt>filePath</tt> and adds the content as attachment <tt>fileContent</tt>.
 +
 
 +
===== Configuration =====
 +
 
 +
* Worker name: <tt>fileCrawler</tt>
 +
* Parameters:
 +
** <tt>mapping</tt> ''(req.)'' needed to get the file path and to add the fetched file content
 +
*** <tt>filePath</tt> ''(req.)'' to read the attribute that contains the file path
 +
*** <tt>fileContent</tt> ''(req.)'' attachment name where the file content is written to
 +
* Input slots:
 +
** <tt>filesToFetch</tt>
 +
* Output slots:
 +
** <tt>files</tt>
 +
 
 +
 
 +
=== File Extractor Worker ===
 +
 
 +
Used for extracting compounds (zip, tgz, etc.) in file crawling.
 +
 
 +
===== Configuration =====
 +
 
 +
* Worker name: <tt>fileExtractor</tt>
 +
* Parameters:
 +
** <tt>filters</tt> ''(opt., see File Crawler)''
 +
*** <tt>maxFileSize</tt>: ''(opt., see File Crawler)''
 +
*** <tt>filePatterns</tt>: ''(opt., see File Crawler)''
 +
**** <tt>include</tt>: ''(opt., see File Crawler)''
 +
**** <tt>exclude</tt>: ''(opt., see File Crawler)''
 +
*** <tt>folderPatterns</tt>: ''(opt., see File Crawler)''
 +
**** <tt>include</tt>: ''(opt., see File Crawler)''
 +
**** <tt>exclude</tt>: ''(opt.)'' The behaviour is slightly different here to that of the File Crawler: If an exclude pattern matches the folder path of an extracted file, then the file is filtered out. But according to the pattern, files from subdirectories may be imported!
 +
** <tt>mapping</tt> ''(req.)''
 +
*** <tt>filePath</tt> ''(req., see File Crawler)'': needed to get the file path of the compound file to extract
 +
*** <tt>fileFolder</tt> ''(opt., see File Crawler)''
 +
*** <tt>fileName</tt> ''(opt., see File Crawler)''
 +
*** <tt>fileExtension</tt> ''(opt., see File Crawler)''
 +
*** <tt>fileSize</tt> ''(opt., see File Crawler)''
 +
*** <tt>fileLastModified</tt> ''(opt., see File Crawler)''
 +
*** <tt>fileContent</tt> ''(req., see File Fetcher)''
 +
* Input slots:
 +
** <tt>compounds</tt>
 +
* Output slots:
 +
** <tt>files</tt>
 +
 
 +
===== Processing =====
 +
 
 +
For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with the attributes set that are specified in the <tt>mapping</tt> configuration, and additionally:
 +
 
 +
* <tt>_deltaHash</tt>: computed as by the FileCrawler worker
 +
* <tt>_compoundRecordId</tt>: record ID of top-level compound this element was extracted from
 +
* <tt>_isCompound</tt>: set to <tt>true</tt> for elements that are compounds themselves.
 +
* <tt>_compoundPath</tt>: sequence of <tt>filePath</tt> attribute values of the compound objects needed to navigate to the compound element.
 +
 
 +
* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
 +
 
 +
TODO ???
 +
If the element is not a compound itself, its content is added as attachment <tt>fileContent</tt>.
 +
 
 +
=== Examples ===
  
 
''Example of a file crawl job:''
 
''Example of a file crawl job:''
Line 95: Line 149:
 
</pre>
 
</pre>
  
=== File Fetcher ===
 
 
* Worker name: <tt>fileCrawler</tt>
 
* Parameters: none
 
* Input slots:
 
** <tt>filesToFetch</tt>
 
 
* Output slots:
 
** <tt>files</tt>
 
 
For each input record, reads the file referenced in attribute <tt>filePath</tt> and adds the content as attachment <tt>fileContent</tt>
 
 
=== File Extractor Worker ===
 
 
* Worker name: <tt>fileExtractor</tt>
 
* Parameters: none
 
* Input slots:
 
** <tt>compounds</tt>
 
* Output slots:
 
** <tt>files</tt>
 
 
* Dependency: [[SMILA/Documentation/Importing/SimpleCompoundExtractorService|CompoundExtractor service]]
 
 
For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with these attributes set (if provided by the extractor):
 
 
* <tt>filePath</tt>: complete path in compound
 
* <tt>fileFolder</tt>: folder of element in compound
 
* <tt>fileName</tt>: filename part of path
 
* <tt>fileExtension</tt>: extension part of filename
 
* <tt>fileLastModified</tt>: last modification timestamp
 
* <tt>fileSize</tt>: uncompressed size
 
* <tt>_deltaHash</tt>: computed as by the FileCrawler worker
 
* <tt>_compoundRecordId</tt>: record ID of top-level compound this element was extracted from
 
* <tt>_isCompound</tt>: set to <tt>true</tt> for elements that are compounds themselves.
 
* <tt>_compoundPath</tt>: sequence of <tt>filePath</tt> attribute values of the compound objects needed to navigate to the compound element.
 
 
If the element is not a compound itself, its content is added as attachment <tt>fileContent</tt>.
 
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Revision as of 07:25, 21 May 2012

File Crawler, File Fetcher and File Extractor worker are used for importing files from a file system. For a big picture and the worker's interaction have a look at the Importing Concept.

Contents

File Crawler

The File Crawler crawls files from a root folder and the subdirectories below.

Configuration

The File Crawler worker is usually the first worker in a workflow and the job is started in runOnce mode.

  • Worker name: fileCrawler
  • Parameters:
    • dataSource: (req.) value for attribute _source, needed e.g. by the delta service
    • rootFolder: (req.) crawl starting point
    • filters (opt.) filters with conditions to in- or exclude files and folders from import
      • maxFileSize: maximum file size, files that are bigger are filtered out
      • maxFolderDepth: starting from the root folder, this is the maximum depth to crawl into subdirectories
      • followSymbolicLinks: whether to follow symbolic links to files/folders or not
      • filePatterns: regex patterns for filtering crawled files on the basis of their file name
        • include: if include patterns are specified, at least one of them must match the file name. If no include patterns are specified, this is handled as if all file names are included.
        • exclude: if at least one exclude pattern matches the file name, the crawled file is filtered out
      • folderPatterns: regex patterns for filtering crawled folders and files on the basis of their file path
        • include: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.
        • exclude: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be imported.
    • mapping (req.) specifies how to map file properties to record attributes
      • filePath (opt.) mapping attribute for the complete file path
      • fileFolder (opt.) mapping attribute for the file folder
      • fileName (opt.) mapping attribute for the file name
      • fileExtension (opt.) mapping attribute for the file extension
      • fileSize (opt.) mapping attribute for the file size (in bytes)
      • fileLastModified (opt.) mapping attribute for the file's last modified date
  • Task generator: runOnceTrigger
  • Input slots:
    • directoriesToCrawl
  • Output slots:
    • directoriesToCrawl
    • filesToCrawl
Processing

The File Crawler starts crawling in the rootFolder and produces one record for each subdirectory in the bucket connected to directoriesToCrawl (and each record goes to an own bulk), and one record per file in the folder in bucket connected to filesToCrawl (a new bulk is started each 1000 files). The bucket in slot directoriesToCrawl should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The resulting records do not yet contain the file content but only metadata attributes configured in the mapping.

The attribute _source is set from the task parameter dataSource which has no further meaning currently, but it is needed by the delta service.

Compounds:

If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute _isCompound set to true.


File Fetcher

For each input record, reads the file referenced in attribute filePath and adds the content as attachment fileContent.

Configuration
  • Worker name: fileCrawler
  • Parameters:
    • mapping (req.) needed to get the file path and to add the fetched file content
      • filePath (req.) to read the attribute that contains the file path
      • fileContent (req.) attachment name where the file content is written to
  • Input slots:
    • filesToFetch
  • Output slots:
    • files


File Extractor Worker

Used for extracting compounds (zip, tgz, etc.) in file crawling.

Configuration
  • Worker name: fileExtractor
  • Parameters:
    • filters (opt., see File Crawler)
      • maxFileSize: (opt., see File Crawler)
      • filePatterns: (opt., see File Crawler)
        • include: (opt., see File Crawler)
        • exclude: (opt., see File Crawler)
      • folderPatterns: (opt., see File Crawler)
        • include: (opt., see File Crawler)
        • exclude: (opt.) The behaviour is slightly different here to that of the File Crawler: If an exclude pattern matches the folder path of an extracted file, then the file is filtered out. But according to the pattern, files from subdirectories may be imported!
    • mapping (req.)
      • filePath (req., see File Crawler): needed to get the file path of the compound file to extract
      • fileFolder (opt., see File Crawler)
      • fileName (opt., see File Crawler)
      • fileExtension (opt., see File Crawler)
      • fileSize (opt., see File Crawler)
      • fileLastModified (opt., see File Crawler)
      • fileContent (req., see File Fetcher)
  • Input slots:
    • compounds
  • Output slots:
    • files
Processing

For each input record, an input stream to the described file is created and fed into the CompoundExtractor service. The produced records are converted to look like records produced by the file crawler, with the attributes set that are specified in the mapping configuration, and additionally:

  • _deltaHash: computed as by the FileCrawler worker
  • _compoundRecordId: record ID of top-level compound this element was extracted from
  • _isCompound: set to true for elements that are compounds themselves.
  • _compoundPath: sequence of filePath attribute values of the compound objects needed to navigate to the compound element.

TODO ??? If the element is not a compound itself, its content is added as attachment fileContent.

Examples

Example of a file crawl job:

 {
   "name":"crawlFileJob",
   "workflow":"fileCrawling",
   "parameters":{
     "tempStore":"temp",
     "dataSource":"files",
     "rootFolder":"/temp",
     "jobToPushTo":"buildBulks",
     "mapping":{
       "fileContent":"fileContent",
       "filePath":"filePath",
       "fileFolder":"fileFolder",
       "fileName":"fileName",
       "fileSize":"fileSize",
       "fileExtension":"fileExtension",
       "fileLastModified":"fileLastModified"
     },
     "filters":{
       "maxFileSize":1000000000,
       "maxFolderDepth":10,
       "followSymbolicLinks":true,
       "filePatterns":{
         "include":[".*"],
         "exclude":["invalid.txt", "invalid.pdf"]  
       },          
       "folderPatterns":{
         "include":[".*"],
         "exclude":["invalid-dir"]  
       }
     }
   }
 }