Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/CompoundExtractorService"

(CompoundExtractor Service)
(CommonsCompressCompoundExtractorService)
 
(23 intermediate revisions by 2 users not shown)
Line 9: Line 9:
 
The element records can contain the following attributes:
 
The element records can contain the following attributes:
 
* <tt>fileName</tt>: the complete name of the entry in the compound object, usually something like a filesystem path
 
* <tt>fileName</tt>: the complete name of the entry in the compound object, usually something like a filesystem path
* <tt>isCompound</tt>: true, if the element is a supported compound object itself.
+
* <tt>isCompound</tt>: set to <tt>true</tt> if the element is a supported compound object itself.
 
* <tt>size</tt>: uncompressed size of the element
 
* <tt>size</tt>: uncompressed size of the element
 
* <tt>time</tt>: last modification timestamp, as a datetime value.
 
* <tt>time</tt>: last modification timestamp, as a datetime value.
 
* <tt>compressedSize</tt>: compressed size of the element
 
* <tt>compressedSize</tt>: compressed size of the element
* <tt>comment</tt>: a comment for the element in the compound (if supported by the compound type)
+
* <tt>comment</tt>: a comment for the element in the compound
 
* <tt>isRootCompound</tt>: set to <tt>true</tt> if the record describes the processed compound object itself.
 
* <tt>isRootCompound</tt>: set to <tt>true</tt> if the record describes the processed compound object itself.
 
* <tt>compounds</tt>: a sequence of the compound files to look into to reach this element. For example, if the compound <tt>/data/compound.zip</tt> contains a file <tt>archived/subcompound.zip</tt> which contain a file <tt>x.html</tt>, the <tt>compounds</tt> list for <tt>x.html</tt> would be: <pre>[/data/compound.zip, archived/subcompound.zip]</pre>
 
* <tt>compounds</tt>: a sequence of the compound files to look into to reach this element. For example, if the compound <tt>/data/compound.zip</tt> contains a file <tt>archived/subcompound.zip</tt> which contain a file <tt>x.html</tt>, the <tt>compounds</tt> list for <tt>x.html</tt> would be: <pre>[/data/compound.zip, archived/subcompound.zip]</pre>
 +
Not all attributes need to  be set for all compound types.
  
== SimpleCompoundExtractorService ==
+
== Implementations ==
 +
 
 +
=== SimpleCompoundExtractorService ===
  
 
Bundle: <tt>org.eclipse.smila.importing.compounds.simple</tt>
 
Bundle: <tt>org.eclipse.smila.importing.compounds.simple</tt>
Line 30: Line 33:
 
If the mimetype is not provided by the caller at all or it is only <tt>application/octet-stream</tt> it uses the current [[SMILA/Documentation/MimeTypeIdentifier|MimeType Identifier service]] to recognize the real mimetype from the filename extension.  
 
If the mimetype is not provided by the caller at all or it is only <tt>application/octet-stream</tt> it uses the current [[SMILA/Documentation/MimeTypeIdentifier|MimeType Identifier service]] to recognize the real mimetype from the filename extension.  
  
For ZIP files, it creates one record for the ZIP file itself and one record for each contained element.  
+
The compound types are treated differently:
 +
* For ZIP files, it creates one record for the ZIP file itself and one record for each contained element.  
 +
* For GZ files, it creates one record for the compressed file with the original filename of the GZ file but no content, and one record for the content of the uncompressed file.
 +
 
 +
==== Configuration ====
 +
The simple extractor service can be configured by means of a properties file <tt>configuration/org.eclipse.smila.importing.compounds.simple/extractor.properties</tt>.
 +
 
 +
The configuration properties are as follows:
 +
* <tt>zip.encoding</tt> the encoding to use when extracting ZIP files that do not use UTF-8 (default: <tt>UTF-8</tt>)
 +
** example: <tt>zip.encoding=CP850</tt>
 +
** This property will only yield an effect when SMILA is run on a JRE prior to JRE 7. Since JRE 7 using non-UTF-8 ZIPs with special characters will almost always throw an IllegalArgumentException, since JRE 7's zip solution does not honor the properties used by the previous solution, but software written for java prior to java 7 has no other means on configuring another code page to use for such zip files.
 +
* <tt>tmp.dir</tt> the temporary directory to extract compounds to, per default a directory named <tt>org.eclipse.smila.importing.compounds.simple</tt> is created in the user's temporary folder (e.g. on Windows 7 something like <tt>C:\Users\<username>\AppData\Local\Temp\org.eclipse.smila.importing.compounds.simple</tt>).
 +
** example: <tt>tmp.dir=/temp/SMILA.compound.extractor/</tt>
 +
 
 +
=== CommonsCompressCompoundExtractorService ===
 +
 
 +
Bundle: <tt>org.eclipse.smila.importing.compounds.compress</tt>
 +
 
 +
Supported Compound Formats:
 +
* Archives
 +
** <tt>zip</tt>
 +
** <tt>tar</tt>
 +
** <tt>cpio</tt>
 +
** <tt>java-archive</tt>
 +
* Compressions
 +
** <tt>bzip2</tt>
 +
** <tt>gzip</tt>
 +
 
 +
==== Configuration ====
 +
The simple extractor service can be configured by means of a properties file <tt>configuration/org.eclipse.smila.importing.compounds.compress/extractor.properties</tt>.
 +
 
 +
The configuration properties are as follows:
 +
* <tt>zip.encoding</tt> the encoding to use when extracting ZIP files that do not use UTF-8 (default: <tt>UTF-8</tt>)
 +
** example: <tt>zip.encoding=CP850</tt>
 +
** This property will only yield an effect when SMILA is run on a JRE prior to JRE 7. Since JRE 7 using non-UTF-8 ZIPs with special characters will almost always throw an IllegalArgumentException, since JRE 7's zip solution does not honor the properties used by the previous solution, but software written for java prior to java 7 has no other means on configuring another code page to use for such zip files.
 +
* <tt>tmp.dir</tt> the temporary directory to extract compounds to, per default a directory named <tt>org.eclipse.smila.importing.compounds.compress</tt> is created in the user's temporary folder (e.g. on Windows 7 something like <tt>C:\Users\<username>\AppData\Local\Temp\org.eclipse.smila.importing.compounds.compress</tt>).
 +
** example: <tt>tmp.dir=/temp/SMILA.compound.extractor/</tt>
  
For GZ it creates one record with the original filename of the GZ file, but the uncompressed content.
+
[[Category:SMILA]]

Latest revision as of 08:14, 12 February 2013

CompoundExtractor Service

Interface: org.eclipse.smila.importing.CompoundExtractor

A CompoundExtractor service provides two kinds of methods:

  • check if an object's filename, URL or mimetype idenfifies it as a compound object that can be extracted by the service.
  • extract the compound: Given an InputStream with the compound content produce records for the elements.

The element records can contain the following attributes:

  • fileName: the complete name of the entry in the compound object, usually something like a filesystem path
  • isCompound: set to true if the element is a supported compound object itself.
  • size: uncompressed size of the element
  • time: last modification timestamp, as a datetime value.
  • compressedSize: compressed size of the element
  • comment: a comment for the element in the compound
  • isRootCompound: set to true if the record describes the processed compound object itself.
  • compounds: a sequence of the compound files to look into to reach this element. For example, if the compound /data/compound.zip contains a file archived/subcompound.zip which contain a file x.html, the compounds list for x.html would be:
    [/data/compound.zip, archived/subcompound.zip]

Not all attributes need to be set for all compound types.

Implementations

SimpleCompoundExtractorService

Bundle: org.eclipse.smila.importing.compounds.simple

This extractor service uses the classes provided by the JDK's java.util.zip package to extract compound objects. This means that it can currently support ZIP files and GZ files (not TAR.GZ, though).

Supported Mimetypes:

  • application/zip
  • application/x-gunzip
  • application/x-gzip

If the mimetype is not provided by the caller at all or it is only application/octet-stream it uses the current MimeType Identifier service to recognize the real mimetype from the filename extension.

The compound types are treated differently:

  • For ZIP files, it creates one record for the ZIP file itself and one record for each contained element.
  • For GZ files, it creates one record for the compressed file with the original filename of the GZ file but no content, and one record for the content of the uncompressed file.

Configuration

The simple extractor service can be configured by means of a properties file configuration/org.eclipse.smila.importing.compounds.simple/extractor.properties.

The configuration properties are as follows:

  • zip.encoding the encoding to use when extracting ZIP files that do not use UTF-8 (default: UTF-8)
    • example: zip.encoding=CP850
    • This property will only yield an effect when SMILA is run on a JRE prior to JRE 7. Since JRE 7 using non-UTF-8 ZIPs with special characters will almost always throw an IllegalArgumentException, since JRE 7's zip solution does not honor the properties used by the previous solution, but software written for java prior to java 7 has no other means on configuring another code page to use for such zip files.
  • tmp.dir the temporary directory to extract compounds to, per default a directory named org.eclipse.smila.importing.compounds.simple is created in the user's temporary folder (e.g. on Windows 7 something like C:\Users\<username>\AppData\Local\Temp\org.eclipse.smila.importing.compounds.simple).
    • example: tmp.dir=/temp/SMILA.compound.extractor/

CommonsCompressCompoundExtractorService

Bundle: org.eclipse.smila.importing.compounds.compress

Supported Compound Formats:

  • Archives
    • zip
    • tar
    • cpio
    • java-archive
  • Compressions
    • bzip2
    • gzip

Configuration

The simple extractor service can be configured by means of a properties file configuration/org.eclipse.smila.importing.compounds.compress/extractor.properties.

The configuration properties are as follows:

  • zip.encoding the encoding to use when extracting ZIP files that do not use UTF-8 (default: UTF-8)
    • example: zip.encoding=CP850
    • This property will only yield an effect when SMILA is run on a JRE prior to JRE 7. Since JRE 7 using non-UTF-8 ZIPs with special characters will almost always throw an IllegalArgumentException, since JRE 7's zip solution does not honor the properties used by the previous solution, but software written for java prior to java 7 has no other means on configuring another code page to use for such zip files.
  • tmp.dir the temporary directory to extract compounds to, per default a directory named org.eclipse.smila.importing.compounds.compress is created in the user's temporary folder (e.g. on Windows 7 something like C:\Users\<username>\AppData\Local\Temp\org.eclipse.smila.importing.compounds.compress).
    • example: tmp.dir=/temp/SMILA.compound.extractor/