Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/CompoundManagement"

(org.eclipse.smila.connectivity.framework.compound.zip)
Line 145: Line 145:
  
 
<b>Note</b>
 
<b>Note</b>
The extract functionality is implemented using standard JDK zip file handling. Therefore only the archives must only contain filenmaes in UTF-8 encoding. Lot of zip tools doe not use UTF-8 but the platform default encoding. This will lead to errors for some characters (e.g. German Umlaute).
+
The extract functionality is implemented using standard JDK zip file handling. Therefore only the archives must only contain filenames in UTF-8 encoding. Lot of zip tools doe not use UTF-8 but the platform default encoding. This will lead to errors for some characters (e.g. German Umlaute).
  
 
;Configuration
 
;Configuration
 
There are no configuration options available for this bundle.
 
There are no configuration options available for this bundle.
 
  
 
== Configuration ==
 
== Configuration ==

Revision as of 05:08, 28 May 2009

Overview

CompoundManagement in Smila is an extendable set of components. The central component is the CompoundManager. It manages CompoundHandlers that are each capable of extraction elements of certain types of files (like zip or chm). Each CompoundHandler registers itself at the CompoundManager providing a list of supported mime types. The CompoundManager provides functionality to check if a given record contains a compound. It uses a MimetypeIdentifier to identify the mime type of the given record and checks if any registered CompoundHandler is capable of processing records this mime type. It then delegates the processing to the CompoundHandler which in turn creates a CompoundCrawler over the extracted elements of the comnpound record and passes the CompoundCrawler back. CompoundCrawlers are just like regular Crawlers. The difference is that they work on the given compound record only and not on an external data source.

The following chart shows all CompoundManagement components:

CompoundManagement.png

Note DeltaIndexing does not support handling of compound elements, yet. A second run on an unmodified data source containing compounds will lead to the deletion of all compound elements. This feature will be added in M3.


API

/**
 * The Interface CompoundManager.
 */
public interface CompoundManager {
 
  /**
   * Checks if a record is a compound object.
   * 
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return true if the record is a compound object and is extractable by this CompoundManager, false otherwise
   * @throws CompoundException
   *           if any error occurs
   */
  boolean isCompound(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
 
  /**
   * Extracts the elements of the given record and returns a Crawler over the extracted elements.
   * 
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return a Crawler interface over the extracted elements
   * @throws CompoundException
   *           if any error occurs
   */
  Crawler extract(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
 
  /**
   * Adopts the input record according to the given configuration. The record may be left unmodified, modified or even
   * set to null.
   * 
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return the adopted record
   * @throws CompoundException
   *           if any error occurs
   */
  Record adoptCompoundRecord(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
}
/**
 * The Interface CompoundHandler.
 */
public interface CompoundHandler {
 
  /**
   * Gets the mime types the CompoundHandler is capable to extract.
   * @return a Collection of mime types the CompoundHandler is capable to extract.
   */
  Collection<String> getSupportedMimeTypes();
 
  /**
   * Extracts the elements of the given record and returns a Crawler over the extracted elements.
   * @param record
   *          the Record
   * @param config
   *          the DataSourceConnectionConfig
   * @return a Crawler interface over the extracted elements
   * @throws CompoundException
   *           if any error occurs
   */
  Crawler extract(final Record record, final DataSourceConnectionConfig config) throws CompoundException;
}
/**
 * The Interface CompoundCrawler.
 */
public interface CompoundCrawler extends Crawler {
 
  /**
   * Sets the compound record to extract data from.
   * 
   * @param record
   *          the compound Record
   * @throws CrawlerException
   *           if parameter record is null
   */
  void setCompoundRecord(final Record record) throws CrawlerException;
 
  /**
   * Gets the compound record.
   * 
   * @return the compound record.
   */
  Record getCompoundRecord();
}


Implementations

It is possible to provide different implementations for all components. Most important is that it is easy to extend CompoundHandling by providing new CompoundHandler implementations.

org.eclipse.smila.connectivity.framework.impl

This bundle contains the default implementation of the CompoundManager interface as well as some abstract base classes for CompoundHandlers and CompoundCrawlers.

The CrawlerController implements the general processing logic common for all types of Crawlers. Its interface is a pure management interface that can be accessed by its Java interface or its wrapping JMX interface. It has references to the following OSGi services:

  • MimeTypeIdentifier (1..1)
  • CompoundHandler (0..n)

CompoundHandlers register themselves at the CompoundManager.

The method adoptCompoundRecord() is not implemented, yet. It just returns the unmodified input record.

Configuration

There are no configuration options available for this bundle.


org.eclipse.smila.connectivity.framework.compound.zip

This bundle contains an implementation to handle zip archives. It can handle the mime types

  • application/zip
  • application/java-archive

It provides the OSGi Declarative Services ZipCompoundHandler and ZipCompoundCrawler. As with regular Crawlers the ZipCompoundCrawler is a ComponentFactory. Each time method extract(...) is called on the ZipCompoundHandler a new instance of a ZipCompoundCrawler is created. Both services don't have any dependencies to other services, except that ZipCompoundHandler references the ZipCompoundCrawler.

For Id creation the ElementAttribute Path is used, for hash creation it's ElementAttribute LastModifiedDate.

Note The extract functionality is implemented using standard JDK zip file handling. Therefore only the archives must only contain filenames in UTF-8 encoding. Lot of zip tools doe not use UTF-8 but the platform default encoding. This will lead to errors for some characters (e.g. German Umlaute).

Configuration

There are no configuration options available for this bundle.

Configuration

If and how CompoundHandling works is configured within each DataSourceConnectionConfig. There is a special element CompoundHandling that contains this configuration. If this element is omitted no CompoundHandling is done (compound records are processed as single documents). In contrast to regular Crawlers the CompoundHandling configuration may not be overwritten by each CompoundCrawler, they all share the same configuration. In addition it is not configurable how compound elements keys and hashes are created. This is determined by each CompoundCrawler implementation.

CompoundHandling configuration contains the following sub elemnts:

MimeTypeAttribute
The name of the attribute of the compound record containing the mime type of the ContentAttachment. If no mime type is set any detected mime type by CompoundHandling is stored in an attribute using this name. This parameter is optional. If not specified then ExtensionAttribute must be set!
ExtensionAttribute
The name of the attribute of the compound record containing the file extension. This parameter is optional. If not specified then MimeTypeAttribute must be set!
ContentAttachment (required)
The name of the attachment of the compound record containing the content of the compound
CompoundAttributes
A list of CompoundAttribute to be set on extracted compound elements
CompoundAttribute
Type (required) – the data type (String, Integer or Date)
Name (required) – attributes name
Attachment – specify if the attribute returns the data as an attachment instead of an attribute
ElementAttribute
The supported ElementAttribute types are LastModifiedDate, Path, Content, Size, FileExtension, Name


Configuration example

Here is a sample snippet of a CompoundHandling configuration:

<CompoundHandling>
    <MimeTypeAttribute>MimeType</MimeTypeAttribute>
    <ExtensionAttribute>Extension</ExtensionAttribute>
    <ContentAttachment>Content</ContentAttachment>
    <CompoundAttributes>
        <CompoundAttribute Type="Date" Name="LastModifiedDate">
            <ElementAttribute>LastModifiedDate</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Path">
            <ElementAttribute>Path</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Content" Attachment="true">
            <ElementAttribute>Content</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Size">
            <ElementAttribute>Size</ElementAttribute>
        </CompoundAttribute>  		
        <CompoundAttribute Type="String" Name="Extension">
            <ElementAttribute>FileExtension</ElementAttribute>
        </CompoundAttribute>
        <CompoundAttribute Type="String" Name="Filename">
            <ElementAttribute>Name</ElementAttribute>
        </CompoundAttribute>  		
    </CompoundAttributes>
</CompoundHandling>

Back to the top