Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/CompoundManagementDiscussion"

m (Dependency to MimeTypeIdentifier)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{|
 
| style="background:red;" | <br>WARNING: This page is under construction by [[User:Daniel.stucky.empolis.com|Daniel Stucky]]<br> &nbsp;
 
|}
 
 
 
 
== CompoundManagement improvements ==
 
== CompoundManagement improvements ==
  
Line 10: Line 5:
 
=== Integration in DeltaIndexingManager ===
 
=== Integration in DeltaIndexingManager ===
 
;Priority: SHOWSTOPPER
 
;Priority: SHOWSTOPPER
The current implementations of the DeltaIndexingManager do not handle compound elements correctly. We have to store dependencies between compound records and their elements. If a compound record is checked for update and DeltaIndexing determines that it needs no update then all all elements of the compound record havbe to be marked as visited as well as the compound record. This has to be done recursively for nested compounds (zip in zip in ...).
+
;STATUS: DONE see [https://bugs.eclipse.org/bugs/show_bug.cgi?id=278360 https://bugs.eclipse.org/bugs/show_bug.cgi?id=278360]
 +
The current implementations of the DeltaIndexingManager do not handle compound elements correctly. We have to store dependencies between compound records and their elements. If a compound record is checked for update and DeltaIndexing determines that it needs no update then all elements of the compound record have to be marked as visited as well as the compound record. This has to be done recursively for nested compounds (zip in zip in ...).
 +
 
 +
We have to store two more information with the DeltaIndexingManager:
 +
* <tt>boolean isCompound</tt>: a flag that specifies if an entry is a compound record (true) or not (false)
 +
* <tt>String parentIdHash</tt>: the hash of the parentId. This is only set for compound elements that reference their direct parent compound record. For top level compounds or non compound records it is set to NULL.
 +
 
 +
To speed up DeltaIndexing we should not set the VISITED flag for all elements of a compound record, but only for the ones that are containers themselves. In this way we can save lots of modifications on existing entries (especially useful for JPA implementation). We also need an additional flag <tt>MODIFIED</tt> to differentiate between unchanged and changed compound objects. Therefore we have to spend some more logic when determining the records for DeltaIndexing Delete. Here we have to select only those records whose visited flag is false and that either don't have a parentId (they are not part of a compound hierachy) or a parentId who's MODIFIED flag is set to true.  
  
  
 
=== Adapting Compound Records ===
 
=== Adapting Compound Records ===
 
;Priority: LOW
 
;Priority: LOW
CompoundManager offers method <tt>adaptCompoundRecord(...)</tt> to adapt the compound record after it's elements were extracted. This is usefull for the following scenarios:
+
;STATUS: OPEN
 +
CompoundManager offers method <tt>adaptCompoundRecord(...)</tt> to adapt the compound record after it's elements were extracted. This is useful for the following scenarios:
 
* if we do not want SMILA to process and index the compound records themselves we could delete the record
 
* if we do not want SMILA to process and index the compound records themselves we could delete the record
 
* if we want to index the compound record (its metadata, the content makes no sense for any search engine) we can do so but we may want to remove the big content object before sending it to the workflow engine
 
* if we want to index the compound record (its metadata, the content makes no sense for any search engine) we can do so but we may want to remove the big content object before sending it to the workflow engine
Line 24: Line 27:
 
=== Compound Inheritance ===
 
=== Compound Inheritance ===
 
;Priority: LOW
 
;Priority: LOW
 +
;STATUS: OPEN
 
It should be possible to "inherit" attributes, attachments and annotations from a compound record to it's elements. A good example are access rights that are associated with the compound record but are lost when the elements are processed. The inheritance should be configurable:
 
It should be possible to "inherit" attributes, attachments and annotations from a compound record to it's elements. A good example are access rights that are associated with the compound record but are lost when the elements are processed. The inheritance should be configurable:
 
* what attributes/attachments/annotations are inherited
 
* what attributes/attachments/annotations are inherited
Line 35: Line 39:
 
=== Filtering of compound elements ===
 
=== Filtering of compound elements ===
 
;Priority: LOW
 
;Priority: LOW
It should be possible to configure filters for compound elements so that certain elements of a compopund record are ignored just as within regular crawlers. It would be great if the filters of the data source the compound record originates from could be reused but I guess that Crawlers/Agents configuration may get to heterogenous. So a separate filter mechanism could be applied that works only on the common defined CompoundAttributes (those are anyway the only available attributes).
+
;STATUS: OPEN
 +
It should be possible to configure filters for compound elements so that certain elements of a compound record are ignored just as within regular crawlers. It would be great if the filters of the data source the compound record originates from could be reused but I guess that Crawlers/Agents configuration may get to heterogeneous. So a separate filter mechanism could be applied that works only on the common defined CompoundAttributes (those are anyway the only available attributes).
  
  
Line 41: Line 46:
 
=== CompoundHandling for Agents ===
 
=== CompoundHandling for Agents ===
 
;Priority: LOW
 
;Priority: LOW
At the moment Compoundhandling is only used in the CrawlerController (to be precise in the class CrawlThread). It should also be available in the AgentController processing logic altough we currently do not have an Agent that provides compound records. Perhaps we can enhance the mock agent to send compound records if so desired to allow testing.
+
;STATUS: OPEN
 +
At the moment Compoundhandling is only used in the CrawlerController (to be precise in the class CrawlThread). It should also be available in the AgentController processing logic although we currently do not have an Agent that provides compound records. Perhaps we can enhance the mock agent to send compound records if so desired to allow testing.
  
 
=== Dependency to MimeTypeIdentifier ===
 
=== Dependency to MimeTypeIdentifier ===
 
;Priority: HIGH
 
;Priority: HIGH
The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide wether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and  the SimpleMimeTypeIdentifier service are located in bundle <tt>org.eclipse.smila.processing.pipelets.mimetype</tt> which entails dependencies to <tt>org.eclipse.smila.processing</tt> and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allowe for other implementations (ApertureMimetypeidentifier will definitely come).
+
;STATUS: OPEN
 +
The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide whether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and  the SimpleMimeTypeIdentifier service are located in bundle <tt>org.eclipse.smila.processing.pipelets.mimetype</tt> which entails dependencies to <tt>org.eclipse.smila.processing</tt> and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allow for other implementations (ApertureMimetypeidentifier will definitely come).
 
Then the <tt>SimpleMimeTypeIdentifier</tt> should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in <tt>org.eclipse.smila.processing.pipelets</tt> and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).
 
Then the <tt>SimpleMimeTypeIdentifier</tt> should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in <tt>org.eclipse.smila.processing.pipelets</tt> and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).
  
 
=== Tutorial ===
 
=== Tutorial ===
 
;Priority: HIGH
 
;Priority: HIGH
 +
;STATUS: OPEN
 
We should add a Tutorial on "How to implement a CompoundHandler and CompoundCrawler" as it is a common place for contributors to extend SMILA with their own functionality.
 
We should add a Tutorial on "How to implement a CompoundHandler and CompoundCrawler" as it is a common place for contributors to extend SMILA with their own functionality.

Latest revision as of 10:58, 2 June 2009

CompoundManagement improvements

The current CompoundManagement implementation is by no means finished and final. Below are some ideas and already known issues that could or even have to be adressed in the future:

Integration in DeltaIndexingManager

Priority
SHOWSTOPPER
STATUS
DONE see https://bugs.eclipse.org/bugs/show_bug.cgi?id=278360

The current implementations of the DeltaIndexingManager do not handle compound elements correctly. We have to store dependencies between compound records and their elements. If a compound record is checked for update and DeltaIndexing determines that it needs no update then all elements of the compound record have to be marked as visited as well as the compound record. This has to be done recursively for nested compounds (zip in zip in ...).

We have to store two more information with the DeltaIndexingManager:

  • boolean isCompound: a flag that specifies if an entry is a compound record (true) or not (false)
  • String parentIdHash: the hash of the parentId. This is only set for compound elements that reference their direct parent compound record. For top level compounds or non compound records it is set to NULL.

To speed up DeltaIndexing we should not set the VISITED flag for all elements of a compound record, but only for the ones that are containers themselves. In this way we can save lots of modifications on existing entries (especially useful for JPA implementation). We also need an additional flag MODIFIED to differentiate between unchanged and changed compound objects. Therefore we have to spend some more logic when determining the records for DeltaIndexing Delete. Here we have to select only those records whose visited flag is false and that either don't have a parentId (they are not part of a compound hierachy) or a parentId who's MODIFIED flag is set to true.


Adapting Compound Records

Priority
LOW
STATUS
OPEN

CompoundManager offers method adaptCompoundRecord(...) to adapt the compound record after it's elements were extracted. This is useful for the following scenarios:

  • if we do not want SMILA to process and index the compound records themselves we could delete the record
  • if we want to index the compound record (its metadata, the content makes no sense for any search engine) we can do so but we may want to remove the big content object before sending it to the workflow engine
  • anything else ...

At the moment this method is not implemented, it returns the unmodified record. Of course the adaptation should be configurable. Both of the above described options should be easy to implement.

Compound Inheritance

Priority
LOW
STATUS
OPEN

It should be possible to "inherit" attributes, attachments and annotations from a compound record to it's elements. A good example are access rights that are associated with the compound record but are lost when the elements are processed. The inheritance should be configurable:

  • what attributes/attachments/annotations are inherited
  • how are they inherited (execution mode)
    • add: the compound record values are added to existing element values
    • replace: the elements values (if any exist) are replaced by the compound records values
    • setIfEmpty: the values from the compound record are set on the element if no values exist


Filtering of compound elements

Priority
LOW
STATUS
OPEN

It should be possible to configure filters for compound elements so that certain elements of a compound record are ignored just as within regular crawlers. It would be great if the filters of the data source the compound record originates from could be reused but I guess that Crawlers/Agents configuration may get to heterogeneous. So a separate filter mechanism could be applied that works only on the common defined CompoundAttributes (those are anyway the only available attributes).


CompoundHandling for Agents

Priority
LOW
STATUS
OPEN

At the moment Compoundhandling is only used in the CrawlerController (to be precise in the class CrawlThread). It should also be available in the AgentController processing logic although we currently do not have an Agent that provides compound records. Perhaps we can enhance the mock agent to send compound records if so desired to allow testing.

Dependency to MimeTypeIdentifier

Priority
HIGH
STATUS
OPEN

The CompoundManager needs a MimeTypeIdentifier service to be able to identify the mime type of an incoming object and to decide whether it's a compound or not. This already works fine. However, the MimeTypeidentifier interface and the SimpleMimeTypeIdentifier service are located in bundle org.eclipse.smila.processing.pipelets.mimetype which entails dependencies to org.eclipse.smila.processing and some of its sub-bundles. We should move the MimeTypeIdentifier interface and the SimpleMimeTypeIdentifier into different packages outside of processing. Perhaps utils is a good place, but we have to separate interface and implementation to allow for other implementations (ApertureMimetypeidentifier will definitely come). Then the SimpleMimeTypeIdentifier should also be separated into a pure service and a ProcessingService. The ProcessingService should be located in org.eclipse.smila.processing.pipelets and it should be independent of the MimeTypeIdentifier service used. It should work with any MimeTypeIdentifier and contain appropriate logic to find mime type and/or extension information about the files (e.g. the web crawler metadata).

Tutorial

Priority
HIGH
STATUS
OPEN

We should add a Tutorial on "How to implement a CompoundHandler and CompoundCrawler" as it is a common place for contributors to extend SMILA with their own functionality.

Back to the top