Jump to: navigation, search

SMILA/Documentation/CompoundManagement

< SMILA‎ | Documentation
Revision as of 05:44, 24 January 2012 by Juergen.schumacher.attensity.com (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Note.png
This is deprecated for SMILA 1.0, the connectivity framework is still functional but planned to be replaced by scalable import based on SMILAs job management.


Overview

CompoundManagement in SMILA is an extendable set of components. The central component is the CompoundManager. It manages CompoundHandlers that are each capable of extraction elements of certain types of files (like zip or chm). Each CompoundHandler registers itself at the CompoundManager providing a list of supported mime types. The CompoundManager provides functionality to check if a given record contains a compound. It uses a MimetypeIdentifier to identify the mime type of the given record and checks if any registered CompoundHandler is capable of processing records this mime type. It then delegates the processing to the CompoundHandler which in turn creates a CompoundCrawler over the extracted elements of the comnpound record and passes the CompoundCrawler back. CompoundCrawlers are just like regular Crawlers. The difference is that they work on the given compound record only and not on an external data source.

The following chart shows all CompoundManagement components:

CompoundManagement.png


API

/**
 * The Interface CompoundManager.
 */
public interface CompoundManager {
 
  /**
   * Checks if a record is a compound object.
   * 
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return true if the record is a compound object and is extractable by this CompoundManager, false otherwise
   * @throws CompoundException
   *           if any error occurs
   */
  boolean isCompound(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
 
  /**
   * Extracts the elements of the given record and returns a Crawler over the extracted elements.
   * 
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return a Crawler interface over the extracted elements
   * @throws CompoundException
   *           if any error occurs
   */
  Crawler extract(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
 
  /**
   * Adapts the input record according to the given configuration. The record may be left unmodified, modified or even
   * set to null.
   * 
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return the adapted record
   * @throws CompoundException
   *           if any error occurs
   */
  Record adaptCompoundRecord(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
}
/**
 * The Interface CompoundHandler.
 */
public interface CompoundHandler {
 
  /**
   * Gets the mime types the CompoundHandler is capable to extract.
   * @return a Collection of mime types the CompoundHandler is capable to extract.
   */
  Collection<String> getSupportedMimeTypes();
 
  /**
   * Extracts the elements of the given record and returns a Crawler over the extracted elements.
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return a Crawler interface over the extracted elements
   * @throws CompoundException
   *           if any error occurs
   */
  Crawler extract(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
}
/**
 * The Interface CompoundCrawler.
 */
public interface CompoundCrawler extends Crawler {
 
  /**
   * Sets the compound record to extract data from.
   * 
   * @param record
   *          the compound Record
   * @throws CrawlerException
   *           if parameter record is null
   */
  void setCompoundRecord(final Record record) throws CrawlerException;
 
  /**
   * Gets the compound record.
   * 
   * @return the compound record.
   */
  Record getCompoundRecord();
}


Implementations

It is possible to provide different implementations for all components. Most important is that it is easy to extend CompoundHandling by providing new CompoundHandler implementations.

org.eclipse.smila.connectivity.framework.impl

This bundle contains the default implementation of the CompoundManager interface as well as some abstract base classes for CompoundHandlers and CompoundCrawlers.

The CrawlerController implements the general processing logic common for all types of Crawlers. Its interface is a pure management interface that can be accessed by its Java interface or its wrapping JMX interface. It has references to the following OSGi services:

  • MimeTypeIdentifier (1..1)
  • CompoundHandler (0..n)

CompoundHandlers register themselves at the CompoundManager.

The method adaptCompoundRecord() is not implemented, yet. It just returns the unmodified input record.

Configuration

There are no configuration options available for this bundle.


org.eclipse.smila.connectivity.framework.compound.zip

This bundle contains an implementation to handle zip archives. It can handle the mime types

  • application/zip
  • application/java-archive

It provides the OSGi Declarative Services ZipCompoundHandler and ZipCompoundCrawler. As with regular Crawlers the ZipCompoundCrawler is a ComponentFactory. Each time method extract(...) is called on the ZipCompoundHandler a new instance of a ZipCompoundCrawler is created. Both services don't have any dependencies to other services, except that ZipCompoundHandler references the ZipCompoundCrawler.

For Id creation the ElementAttribute Path is used, for hash creation it's ElementAttribute LastModifiedDate.

The generated records will contain a metadata element called _compounds" that contains the (ordered) path through compounds to the last compound the file is contained within.

E.g. consider the following scenario: inside the Zip /path/to/data/folder/compressed_data.zip exists another zip path within zip/second.zip and within that zip there is a file path within second zip/myfile.txt then the Record would contain (among others) the following metadata elements:

  <Val key="Path">path within second zip/myfile.txt</Val>
  <Val key="Filename">myfile.txt</Val>
  <Seq key="_compounds">
    <Val>/path/to/data/folder/compressed_data.zip</Val>
    <Val>path within zip/second.zip</Val>
  </Seq>

With that information an application could work its way through the compounds to the contained file.

Note The extract functionality is implemented using standard JDK zip file handling. Therefore only the archives must only contain filenames in UTF-8 encoding. Lot of zip tools doe not use UTF-8 but the platform default encoding. This will lead to errors for some characters (e.g. German Umlaute).

Configuration

There are no configuration options available for this bundle.

Configuration

If and how CompoundHandling works is configured within each DataSourceConnectionConfig. There is a special element CompoundHandling that contains this configuration. If this element is omitted no CompoundHandling is done (compound records are processed as single documents). In contrast to regular Crawlers the CompoundHandling configuration may not be overwritten by each CompoundCrawler, they all share the same configuration. In addition it is not configurable how compound elements keys and hashes are created. This is determined by each CompoundCrawler implementation.

CompoundHandling configuration contains the following sub elements:

MimeTypeAttribute
The name of the attribute of the compound record containing the mime type of the ContentAttachment. If no mime type is set any detected mime type by CompoundHandling is stored in an attribute using this name. This parameter is optional. If not specified then ExtensionAttribute must be set!
ExtensionAttribute
The name of the attribute of the compound record containing the file extension. This parameter is optional. If not specified then MimeTypeAttribute must be set!
ContentAttachment (required)
The name of the attachment of the compound record containing the content of the compound
CompoundAttributes
A list of CompoundAttribute to be set on extracted compound elements
CompoundAttribute
Type (required) – the data type (String, Integer or Date)
Name (required) – attributes name
Attachment – specify if the attribute returns the data as an attachment instead of an attribute
ElementAttribute
The supported ElementAttribute types are LastModifiedDate, Path, Content, Size, FileExtension, Name


Configuration example

Here is a sample snippet of a CompoundHandling configuration:

<CompoundHandling>
    <MimeTypeAttribute>MimeType</MimeTypeAttribute>
    <ExtensionAttribute>Extension</ExtensionAttribute>
    <ContentAttachment>Content</ContentAttachment>
    <CompoundAttributes>
        <CompoundAttribute Type="Date" Name="LastModifiedDate">
            <ElementAttribute>LastModifiedDate</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Path">
            <ElementAttribute>Path</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Content" Attachment="true">
            <ElementAttribute>Content</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Size">
            <ElementAttribute>Size</ElementAttribute>
        </CompoundAttribute>  		
        <CompoundAttribute Type="String" Name="Extension">
            <ElementAttribute>FileExtension</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Filename">
            <ElementAttribute>Name</ElementAttribute>
        </CompoundAttribute>  		
    </CompoundAttributes>
</CompoundHandling>

For details about the integration of this configuration part in some crawler's configuration please see Crawler documentation.