Difference between revisions of "SMILA/Specifications/Partitioning Storages"

From Eclipsepedia

Jump to: navigation, search
m
m
Line 1: Line 1:
 
== Partitioning both Storages for Backup and Reuse/Recrawling ==
 
== Partitioning both Storages for Backup and Reuse/Recrawling ==
# Use case.
+
===== Use case. =====
#:After each DFP it should be possible to store records in a separate partition, that can be accessed later.  
+
After each DFP it should be possible to store records in a separate partition, that can be accessed later.  
# Changes in XML and Bin storage.
+
===== Changes in XML and Bin storage. =====
#:To support partitioning both XML- and Bin- storages must be able to store data to partitions. Thus XML- and Bin- storage APIs should be extended in such way that partition information will be accepted as an additional parameter when saving data. With binstorage, binary attachments can have quite a big size, thus if attachment wasn't changed from one partition to another, it's worth not to copy attachment's data for each partition but store only reference to actual attachment.
+
To support partitioning both XML- and Bin- storages must be able to store data to partitions and load data from partitions. Thus XML- and Bin- storage APIs should be extended in such way that partition information will be accepted as an additional parameter when saving or loading data. With binstorage, binary attachments can have quite a big size, thus if attachment wasn't changed from one partition to another, it's worth not to copy attachment's data for each partition but store only reference to actual attachment.
# Partitioning configuration and changes in Blackboard API.
+
For example, methods that manage attachments as streams should be extended to:
#:Partition information (partition name) can be configured into Listener Rule.
+
store(Id, InputStream, Partition);
#:There are following ways of how to pass partition information:
+
fetchAsStream(Id, Partition);
## Partition information is passed as a record Id property.
+
Also it should be possible to obtain data from the default (latest) partition when no partition information is provided. So ''fetchAsStream(Id)'' should return attachment saved to the latest partition.
##:Listener gets record from the queue and sets partition information to the record's Id property.
+
===== Partitioning configuration and changes in Blackboard API. =====
##:Blackboard uses partition information in load and commit operations.
+
Partition information (partition name) can be configured into Listener Rule.
##:In this case no significant changes are required to the blackboard API because partition information is encapsulated into record Id.
+
There are following ways of how to pass partition information:
## Partition information is passed separately from record as a JMS property.
+
====== 1. Partition information is passed as a record Id property. ======
##:Listener reads JMS property from the queue and makes it available for other components that will use blackboard (like processing).
+
Listener gets record from the queue, and sets partition information to the record's Id property.
##:In this case there are two possibilities:
+
Blackboard uses partition information in load and commit operations.
##: - record is loaded (and committed) to storages using partition information. After that record can be accessed by Id from the internal blackboard cache, until commit is invoked. Implementing this way will require first to solve problems when some process wants to access already committed record.
+
In this case no significant changes are required to the blackboard API because partition information is encapsulated into record Id. Blackboard should be only extended to support loading and committing records to particular partitions:
##: - all methods from blackboard API should be duplicated to handle partition name as a second parameter. In this case if record is missing into internal blackboard cache, it can be first loaded from storages using provided partition information.
+
load(Id, Partition)
#:
+
commit(Id, Partition)
#:
+
====== 2. Partition information is passed separately from record as a JMS property. ======
#:The variant 1 seems to be more useful and easy to implement because it allows to keep partition information directly into the record, thus easily pass and receive it between distributed components and also requires almost no blackboard API modifications.
+
Listener reads JMS property from the queue and makes it available for other components that will use blackboard (like processing).
 +
In this case there are two possibilities:
 +
 
 +
- record is loaded (and committed) to storages using partition information. After that record can be accessed by Id from the internal blackboard cache, until commit is invoked. Implementing this way will require first to solve problems when some process wants to access already committed record, that exist now. After those problems get solved, the only changes required for blackboard API will be to implement load and commit operations that handle partitions.
 +
 
 +
- all methods from blackboard API should be duplicated to handle partition information as a second parameter. In this case if record is missing into internal blackboard cache, it can be first loaded from storages using provided partition information. This way eliminates problems mentioned above but requires a lot of modifications to blackboard API to handle partition information in every method:
 +
createLiteral(Id)
 +
createLiteral(Id, Partition)
 +
createAnnotation(Id)
 +
createAnnotation(Id, Partition)
 +
...
 +
 
 +
 
 +
The first option seems to be more convenient and easy to implement because it allows to keep partition information directly into the record, thus easily pass and receive it between distributed components and requires almost no blackboard API modifications.

Revision as of 07:08, 14 November 2008

Contents

Partitioning both Storages for Backup and Reuse/Recrawling

Use case.

After each DFP it should be possible to store records in a separate partition, that can be accessed later.

Changes in XML and Bin storage.

To support partitioning both XML- and Bin- storages must be able to store data to partitions and load data from partitions. Thus XML- and Bin- storage APIs should be extended in such way that partition information will be accepted as an additional parameter when saving or loading data. With binstorage, binary attachments can have quite a big size, thus if attachment wasn't changed from one partition to another, it's worth not to copy attachment's data for each partition but store only reference to actual attachment. For example, methods that manage attachments as streams should be extended to:

store(Id, InputStream, Partition);
fetchAsStream(Id, Partition);

Also it should be possible to obtain data from the default (latest) partition when no partition information is provided. So fetchAsStream(Id) should return attachment saved to the latest partition.

Partitioning configuration and changes in Blackboard API.

Partition information (partition name) can be configured into Listener Rule. There are following ways of how to pass partition information:

1. Partition information is passed as a record Id property.

Listener gets record from the queue, and sets partition information to the record's Id property. Blackboard uses partition information in load and commit operations. In this case no significant changes are required to the blackboard API because partition information is encapsulated into record Id. Blackboard should be only extended to support loading and committing records to particular partitions:

load(Id, Partition)
commit(Id, Partition)
2. Partition information is passed separately from record as a JMS property.

Listener reads JMS property from the queue and makes it available for other components that will use blackboard (like processing). In this case there are two possibilities:

- record is loaded (and committed) to storages using partition information. After that record can be accessed by Id from the internal blackboard cache, until commit is invoked. Implementing this way will require first to solve problems when some process wants to access already committed record, that exist now. After those problems get solved, the only changes required for blackboard API will be to implement load and commit operations that handle partitions.

- all methods from blackboard API should be duplicated to handle partition information as a second parameter. In this case if record is missing into internal blackboard cache, it can be first loaded from storages using provided partition information. This way eliminates problems mentioned above but requires a lot of modifications to blackboard API to handle partition information in every method:

createLiteral(Id) 
createLiteral(Id, Partition) 
createAnnotation(Id)
createAnnotation(Id, Partition)
...


The first option seems to be more convenient and easy to implement because it allows to keep partition information directly into the record, thus easily pass and receive it between distributed components and requires almost no blackboard API modifications.