SMILA/Project Concepts/CompoundManagement

Description

Work out a concept to handle compound objects (objects that contain or can be split up into multiple objects).

Technical proposal

Overview

The CompoundManagement is responsible for extraction of elements from compound objects of various mimetypes (like zip archives, Windows Help Files (hlp), etc.). The CompoundManagement provides an Crawler interface over the extracted elements, this is identical to the ones provided by CrawlerFactories, thus it provides delta indexing support. The processing of the various types of compound objects and creation of "CompoundCrawlers" is delegated to so called CompoundHandlers. Each CompoundHandler implementation is associated with specific mimetypes.

This chart shows the architecture of the CompoundManagement: Note: The component CompoundHandlerRegistry is most likely obsolete, as it's functionality (registration of CompoundHandlers) can be achieved by using OSGi technologies.

Configuration

CompoundHandlerRegistry Configuration

At first we need a configuration for the CompoundHandlerRegistry that associates a CompoundHandler implementation (there may be multiple supporting the same mimetype) with a mimetype. This could look like this:

<CompoundHandlerRegistry>
    <CompoundHandler mimetype="application/zip" class="org.eclipse.smila.irm.compoundmanagement.ZipCompoundHandler">
    <CompoundHandler mimetype="application/mshelp" class="org.eclipse.smila.irm.compoundmanagement.HlpCompoundHandler">
    <CompoundHandler mimetype="application/java-archive" class="org.eclipse.smila.irm.compoundmanagement.ZipCompoundHandler">
    ...
</CompoundHandlerRegistry>

Sebastian Voigt: This configuration could be omit. The CompoundManagement should resolve automatically which Compound "Handler" is installed. I would call them Compound Bundle or something else because each Handler is deployed with a bundle. The CompoundManagement can use a defined extension points to find "installed" compound bundles. Extension Point can be called org.eccenca.irm.compound.

This Extension Point should offer the following Interface:

{
    String getMimeType();
    String getCompoundHandlerName() - returns a Description of the Compound Handler (used for Logging)
}

Before each indexing job the Compound Manager should retrieve all installed bundles that implement the Compound Extension point and should warn the user if there are bundles installed that address the same mimetype.

CompoundHandler Runtime Configuration

Then during runtime we have to provide a configuration to the CompoundManagement that is passed to the CompoundHandler implementations. It contains information about how to process extracted data. This could/should contain

information about working directories where to extract the data to
information about attributes that should be inherited from the compound object. During inheritance there may be special actions required, like
- replace existing values
- append to existing values
- set value, if no value exists

<inheritedAttributes>
    <Attribute name="accessRights" action="replace">
    <Attribute name="lastModified" action="replace">
    <Attribute name="abc" action="append">
    <Attribute name="xyz" action="set">
    ...
</inheritedAttributes>

information about filters. It would be great if the filters of an IRM configuration could be applied to CompoundHandlers (e.g. a filesystem is crawled and .log files are excluded, so we also want to exclude .log files contained in zips.)
- another option would be to let the Agent/Crawler Controller apply filtering logic on the Records returned by the CompoundCrawler by delegating it back to the Agent/Crawler. So filtering logic has to be part of the Crawler interface.
information on how to create Record IDs ? Or is this logic up to the implementation ?
information on how to create Delta Indexing hash key (what attributes to use)
information on what attribute contain the content to be extracted

As different CompoundHandler implementations may need different configuration we should make the configuration schema extendable as done in the IRM configuration. Some configurations will be needed in all cases (like Inheritance of attributes, delta indexing hash), some may be optional or different (like configuration of working environment and filters).

Alternative Compound Configuration

Sebastian Voigt: A CompoundHandler behave as and has the same workflow as a Crawler. Therefore it should use the same configuration file. The Compound Manager defines a Compound config schema, and each Compound Handler can redefine the Attributes and the Process Tags (like in the workflow for the irm configuration). Process can be used to define behavior like filtering etc. for the extracting Job. HashAttributes and KeyAttributes are used to build the Record and ID (build be the Controller).

The Compound configuration should contain additionally a description of an Index Job. The Configuration is only used for this index job. Thus for each CompoundHandler and for each Index Job Configuration there could be a config for the compound handler (different behavior for different index jobs)

I would not add action tags like defined above. The IRM Framework should not change attributes or the information. It is only responsible to return information from specific data source. Therefore I think Attributes should not join or replaced. Usually the Compound contents don't fit to the data source like e.g. Sharepoint and Zips. Sharepoint Objects has no path in a file system, and zips have only a sub path. There is no need to join/replace any information. The Use of Compound handler is to return further/additional Attributes that describe the Object in a Compound more.

Daniel Stucky: we agreed, that adaptation of attributes in compound elements is needed and that we should do it as early as possible - inside a CompoundHandler (the alternative was during BPEL, but as the data of the parent object is needed we would have to store both the elements as well as the parents attributes in the objects EILRecord.). This "attribute inheritance" should be implemented once (abstract base class). If we really need different actions will be seen during implementation.

Sebastian Voigt: Ok. We can adopt a IRM Configuration for this job. We need the following information:

1) which attributes has to gathered from the compound 2) where should they be stored 3) which operation is used when it is stored in an attribute that has been inherited

\--> CompoundConfiguration Attribute: Which information should be gathered from the Compound ( Compound defines with schema itself what is possible) Name: In which Attribute in the Record should be stored the Information (if this Attribute exists in the Record it will be overwritten) Attributes that should not be gathered but inherited are selected with a a <Inherited/> tag.

<CompoundConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="CompoundHandlerZip.xsd">
<IndexJob>
  FileSystemIndexJob
</IndexJob>
  <Attributes>
    <Attribute Type="Date" Name="Date">
      <FileAttributes>FileDate</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Filename">
      <FileAttributes>Name</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Path">
      <FileAttributes>Path</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="PermissionUsers">
      <Inherited/>
    </Attribute>
    <Attribute Type="StringCollection" Name="PermissionGroup">
      <Inherited>
    </Attribute>
    <Attribute Type="String" Name="Content" MimeTypeAttribute="Content">
      <FileAttributes>Content</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Extension" MimeTypeAttribute="FileExtension">
      <FileAttributes>FileExtension</FileAttributes>
    </Attribute>
  </Attributes>
  <Process>
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
    </Filter>
  </Process>
</IRMConfiguration>

e.g. The Date, the filename, path, content and the extension are gathered from the compound and the permissions are inherited from the compoound itself. \\

Interfaces

interface CompoundManagement
{
    Crawler extract( Record compound, CMConfig config, String mimetype );
}

interface CompoundHandler
{
    Crawler extract( Record compound, CMConfig config );
}

interface CompoundHandlerRegistry
{
    CompoundHandler getCompoundHandler( String mimetype );
    void register( String mimetype, Sring clazz );
    void unregister( String mimetype );
}

Implementation

CompoundManagement and CompoundHandlerRegistry are kind of fixed components that do not need to be reimplemented by SMILA users. CompoundHandler implementations do the real work and contributions are expected here. We should provide one or two sample implementations (I suggest one for zip files). Each CompoundHandler implementation is free on how to implement it's functionality. It can be done in process using Java libs or in external processes (like executing unzip.exe). There are no restrictions on these implementations.

The CompoundHandler interface could support SCA but except for the technology independence I do not see a big gain here. CompoundHandlers should not be executed remotely\!

CompoundManagement vs. Splitter

CompoundManagement and Splitter functionality basically offer the same functionality:

input: one object
output N objects

The usage of both is slightly different:

CompoundManagement
- is used in the IRM (in generall "near" the data source)
- multiple types of compounds must be processed dynamically
Splitter
- is used in BPEL to provide Chapter or Page wise indexing
- usually only a single object type is splitted, because splitting is most likely based on INSO output and not done on the raw data

Therefore we should provide a BPEL service for Splitting. This service should be configurable to support splitting of one concrete type. Internally we can reuse the concept for CompoundManagement registerung just a single CompoundHandler.

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/CompoundManagement

Contents

Description

Technical proposal

Overview

Configuration

CompoundHandlerRegistry Configuration

CompoundHandler Runtime Configuration

Alternative Compound Configuration

Interfaces

Implementation

CompoundManagement vs. Splitter

Breadcrumbs

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/CompoundManagement

Contents

Description

Technical proposal

Overview

Configuration

CompoundHandlerRegistry Configuration

CompoundHandler Runtime Configuration

Alternative Compound Configuration

Interfaces

Implementation

CompoundManagement vs. Splitter