Difference between revisions of "SMILA/Specifications/CompoundManagementDiscussion"

Revision as of 05:22, 28 May 2009

WARNING: This page is under construction by Daniel Stucky

CompoundManagement improvements

The current CompoundManagement implementation is by no means finished and final. Below are some ideas and already known issues that could or even have to be adressed in the future:

Integration in DeltaIndexingManager

Priority: SHOWSTOPPER

The current implementations of the DeltaIndexingManager do not handle compound elements correctly. We have to store dependencies between compound records and their elements. If a compound record is checked for update and DeltaIndexing determines that it needs no update then all all elements of the compound record havbe to be marked as visited as well as the compound record. This has to be done recursively for nested compounds (zip in zip in ...).

Adapting Compound Records

Priority: LOW

CompoundManager offers method adaptCompoundRecord(...) to adapt the compound record after it's elements were extracted. This is usefull for the following scenarios:

if we do not want SMILA to process and index the compound records themselves we could delete the record
if we want to index the compound record (its metadata, the content makes no sense for any search engine) we can do so but we may want to remove the big content object before sending it to the workflow engine
anything else ...

At the moment this method is not implemented, it returns the unmodified record. Of course the adaptation should be configurable. Both of the above described options should be easy to implement.

Compound Inheritance

Priority: LOW

It should be possible to "inherit" attributes, attachments and annotations from a compound record to it's elements. A good example are access rights that are associated with the compound record but are lost when the elements are processed. The inheritance should be configurable:

what attributes/attachments/annotations are inherited
how are they inherited (execution mode)
- add: the compound record values are added to existing element values
- replace: the elements values (if any exist) are replaced by the compound records values
- setIfEmpty: the values from the compound record are set on the element if no values exist

Filtering of compound elements

Priority: LOW

It should be possible to configure filters for compound elements so that certain elements of a compopund record are ignored just as within regular crawlers. It would be great if the filters of the data source the compound record originates from could be reused but I guess that Crawlers/Agents configuration may get to heterogenous. So a separate filter mechanism could be applied that works only on the common defined CompoundAttributes (those are anyway the only available attributes).

CompoundHandling for Agents

Priority: LOW

At the moment Compoundhandling is only used in the CrawlerController (to be precise in the class CrawlThread). It should also be available in the AgentController processing logic altough we currently do not have an Agent that provides compound records. Perhaps we can enhance the mock agent to send compound records if so desired to allow testing.

Dependency to MimeTypeIdentifier

Priority: HIGH

The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and the SimpleMimeTypeIdentifier service are located in bundle org.eclipse.smila.processing.pipelets.mimetype which entails dependencies to org.eclipse.smila.processing and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allow for other implementations (ApertureMimetypeidentifier will definitely come). Then the SimpleMimeTypeIdentifier should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in org.eclipse.smila.processing.pipelets and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).

Tutorial

Priority: HIGH

We should add a Tutorial on "How to implement a CompoundHandler and CompoundCrawler" as it is a common place for contributors to extend SMILA with their own functionality.

@@ Line 45: / Line 45: @@
 === Dependency to MimeTypeIdentifier ===
 ;Priority: HIGH
-The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and  the SimpleMimeTypeIdentifier service are located in bundle <tt>org.eclipse.smila.processing.pipelets.mimetype</tt> which entails dependencies to <tt>org.eclipse.smila.processing</tt> and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allowe for other implementations (ApertureMimetypeidentifier will definitely come).
+The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and  the SimpleMimeTypeIdentifier service are located in bundle <tt>org.eclipse.smila.processing.pipelets.mimetype</tt> which entails dependencies to <tt>org.eclipse.smila.processing</tt> and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allow for other implementations (ApertureMimetypeidentifier will definitely come).
 Then the <tt>SimpleMimeTypeIdentifier</tt> should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in <tt>org.eclipse.smila.processing.pipelets</tt> and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Specifications/CompoundManagementDiscussion"

Revision as of 05:22, 28 May 2009

Contents

CompoundManagement improvements

Integration in DeltaIndexingManager

Adapting Compound Records

Compound Inheritance

Filtering of compound elements

CompoundHandling for Agents

Dependency to MimeTypeIdentifier

Tutorial

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Specifications/CompoundManagementDiscussion"

Revision as of 05:22, 28 May 2009

Contents

CompoundManagement improvements

Integration in DeltaIndexingManager

Adapting Compound Records

Compound Inheritance

Filtering of compound elements

CompoundHandling for Agents

Dependency to MimeTypeIdentifier

Tutorial