Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/CompoundManagementDiscussion"

m (Dependency to MimeTypeIdentifier)
m (Dependency to MimeTypeIdentifier)
Line 45: Line 45:
 
=== Dependency to MimeTypeIdentifier ===
 
=== Dependency to MimeTypeIdentifier ===
 
;Priority: HIGH
 
;Priority: HIGH
The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and  the SimpleMimeTypeIdentifier service are located in bundle <tt>org.eclipse.smila.processing.pipelets.mimetype</tt> which entails dependencies to <tt>org.eclipse.smila.processing</tt> and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allowe for other implementations (ApertureMimetypeidentifier will definitely come).
+
The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and  the SimpleMimeTypeIdentifier service are located in bundle <tt>org.eclipse.smila.processing.pipelets.mimetype</tt> which entails dependencies to <tt>org.eclipse.smila.processing</tt> and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allow for other implementations (ApertureMimetypeidentifier will definitely come).
 
Then the <tt>SimpleMimeTypeIdentifier</tt> should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in <tt>org.eclipse.smila.processing.pipelets</tt> and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).
 
Then the <tt>SimpleMimeTypeIdentifier</tt> should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in <tt>org.eclipse.smila.processing.pipelets</tt> and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).
  

Revision as of 05:22, 28 May 2009


WARNING: This page is under construction by Daniel Stucky
 


CompoundManagement improvements

The current CompoundManagement implementation is by no means finished and final. Below are some ideas and already known issues that could or even have to be adressed in the future:

Integration in DeltaIndexingManager

Priority
SHOWSTOPPER

The current implementations of the DeltaIndexingManager do not handle compound elements correctly. We have to store dependencies between compound records and their elements. If a compound record is checked for update and DeltaIndexing determines that it needs no update then all all elements of the compound record havbe to be marked as visited as well as the compound record. This has to be done recursively for nested compounds (zip in zip in ...).


Adapting Compound Records

Priority
LOW

CompoundManager offers method adaptCompoundRecord(...) to adapt the compound record after it's elements were extracted. This is usefull for the following scenarios:

  • if we do not want SMILA to process and index the compound records themselves we could delete the record
  • if we want to index the compound record (its metadata, the content makes no sense for any search engine) we can do so but we may want to remove the big content object before sending it to the workflow engine
  • anything else ...

At the moment this method is not implemented, it returns the unmodified record. Of course the adaptation should be configurable. Both of the above described options should be easy to implement.

Compound Inheritance

Priority
LOW

It should be possible to "inherit" attributes, attachments and annotations from a compound record to it's elements. A good example are access rights that are associated with the compound record but are lost when the elements are processed. The inheritance should be configurable:

  • what attributes/attachments/annotations are inherited
  • how are they inherited (execution mode)
    • add: the compound record values are added to existing element values
    • replace: the elements values (if any exist) are replaced by the compound records values
    • setIfEmpty: the values from the compound record are set on the element if no values exist


Filtering of compound elements

Priority
LOW

It should be possible to configure filters for compound elements so that certain elements of a compopund record are ignored just as within regular crawlers. It would be great if the filters of the data source the compound record originates from could be reused but I guess that Crawlers/Agents configuration may get to heterogenous. So a separate filter mechanism could be applied that works only on the common defined CompoundAttributes (those are anyway the only available attributes).


CompoundHandling for Agents

Priority
LOW

At the moment Compoundhandling is only used in the CrawlerController (to be precise in the class CrawlThread). It should also be available in the AgentController processing logic altough we currently do not have an Agent that provides compound records. Perhaps we can enhance the mock agent to send compound records if so desired to allow testing.

Dependency to MimeTypeIdentifier

Priority
HIGH

The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and the SimpleMimeTypeIdentifier service are located in bundle org.eclipse.smila.processing.pipelets.mimetype which entails dependencies to org.eclipse.smila.processing and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allow for other implementations (ApertureMimetypeidentifier will definitely come). Then the SimpleMimeTypeIdentifier should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in org.eclipse.smila.processing.pipelets and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).

Tutorial

Priority
HIGH

We should add a Tutorial on "How to implement a CompoundHandler and CompoundCrawler" as it is a common place for contributors to extend SMILA with their own functionality.

Back to the top