Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Project Concepts/Controlling Tasks Order Concept
Controlling Order of Tasks in the Workflow / Race Conditions
Our current Workflow allows that more than one record regarding the same Data Source Entry is processed by the framework at once.
Therefore we have several Problems cases regarding our open structure. Could the current System run with different records that are for the same Data Source Entry?
And when yes: A newer record could be processed faster (because he is transfered through less GFPs (BPel-Processes) or less Queue.
Simple Scenarios: - add Record is sent to framework and a short later arrives a delete Record. Because the Delete Record has not been processed by BPEL it would executed before the result of the add record is ready to put into the index - two add records are sent to the framework, we would do the processing twice, but we have no advantage of it, we could purge the first record
The problem of “two records with the same ID but different data or initial operation”.
Current workflow process designed for exclusive and consequent record processing. Its assumed that there is “start of processing” ( record crawled ), some business processes executing consequently for record and there is a “process finish” ( record stored into index ). Record between processing is stored in the Blackboard cache ( also finally its stored in XmlStorage and BinStorage ). From the other side execution of business processes is asynchronous (via queue Listener). Blackboard based workflow scheme is unable to work normally with asynchronous processes. Its “assumed” that we process the same record until “process finish”.
- To block new record from processing until previously put record with the same id processing finished.
- Its required some special additional storage for delayed records.
- its not clear then previous record is “finished processing”
- To avoid Blackboard usage and put record completely into queue.
- To stop/reject records processing if timestamp is older the last one.
- Really minimum changes in current workflow
- Its required some additional but simple service for generating/validating timestamps
The main advantage of the first solution that every record modification will be processed. The main disadvantage that is makes record processing synchronous. And there is a problem that if processing of some record failed, it may totally stops future processing of records with this ID.
For current functionality I prefer the last one (stop/reject records processing by timestamp), because its more effective (asynchronous) and safe. Unfortunately some record changes may be lost. Now we don't need them but we may easily imagine some new pipelet that stores/tracks record changes.
It's suggested to add "timestamp" field into Id and to compare Ids by two operations equals and equivalent
More complex solution
I may try to suggest base for more complex solution. The main idea is to adopt Blackboard for editing multiple record versions. The following list of requirements represents the idea. But I'm not shure that its required now.
- “Timestamp service” used for generating/validating record timestamps.
- Blackboard supports editing of records with multiple versions (separated by timestamp).
- Attachments saved into BinStorage with timestamp during processing.
- When some process wants to commit record (from Blackboard into XmlStorage),it will commit only if its the last one
- Other behavior is to store into XmlStorage all record versions with timestamps.
Last thoughts about
It may be two types of solutions based on one key statement. This statement may be is shortly described by one question.
When Record object passed into "Processor", is it contains complete Record data or it may be partial?
Sample of partial data may be explained on the next sample.
Two agents collects data from database tables for one Record
table [person] (id, name) - trigger on update linked with Agent A table [person_address] (id, person_id, address) - - trigger on update linked with Agent B
Agents A and B collects tables changes and send it to processing, both of them collects data for one object "Person". when Record contains partial data for Person.
I'm not sure that partial records supporting required.
If its not required, and Record contains complete data, then it possible to use timestamp for rejecting old records.
Otherwise records for one ID should be processed synchronously one-by-one. Organizing of locks for synchronous one-by-one processing will be performance blocker and its may cause some dead-locks on Records. And, imho, almost all MQ asynchronous processing benefits will be lost.
Any ideas, opinions?