Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Specifications/ProcessingMessageResequencer"
(→Idea 1 - Handle Resequenceing in Connectivity: buffer concept as discussed with SV) |
m (→Idea 1 - Handle Resequencing in Connectivity) |
||
Line 48: | Line 48: | ||
# during the time in the buffer, additional processing requests are consolidated to reduce load | # during the time in the buffer, additional processing requests are consolidated to reduce load | ||
− | CON | + | ===== CON ===== |
* lag<br>all messages will have the lag of length X before the index is updated. for mass crawling this might be acceptable but in combination with agents this is in most cases not acceptable but will depend on the nature of application SMILA is used for. | * lag<br>all messages will have the lag of length X before the index is updated. for mass crawling this might be acceptable but in combination with agents this is in most cases not acceptable but will depend on the nature of application SMILA is used for. | ||
* no guarantee <br>delaying processing will reduce the chances of mishaps but there is no guarantee that this is really so. especially when the processing chain is more complex such as in a cluster setup to spread processing load over several nodes. in such a scenario we will also need to take into account that some nodes may be down temporarily while retaining the records that were assigned to them. | * no guarantee <br>delaying processing will reduce the chances of mishaps but there is no guarantee that this is really so. especially when the processing chain is more complex such as in a cluster setup to spread processing load over several nodes. in such a scenario we will also need to take into account that some nodes may be down temporarily while retaining the records that were assigned to them. | ||
− | PRO | + | ===== PRO ===== |
* simple to implement and has no effect on the API or other logic | * simple to implement and has no effect on the API or other logic | ||
Revision as of 10:05, 2 October 2009
Contents
The Core Problem
MQs are inherently asynchronous but in the end we need to be sure that the final processing target reflects the correct state of the data source at any given time.
The needs of the final processing target might differ in their requirements. At this time we will only treat the case of an full text retrieval engine like lucene.
Indexing Requirements
- the order needs only to be maintained on a per record base
- older messages are always superseded by newer messages for a given resource and can therefore be removed from the MQ w/o further processing.
Operation N | Operation N+1 | expected index State after N+1 |
ADD A,t1 | ADD A,t2 | A,t2 |
ADD A,t1 | DELETE A,t2 | A doesn't exist |
DELETE A,t1 | ADD A,t2 | A exists |
(Special) Cases to be Covered
this section names the cases that need to be covered that dont come to mind imediately but still need to be considered nonetheless:
- splitting of records
this is more complicated. see the SRS class in code comment for its treatment - overflow of the SN
a reset signal must be sent to SRS - what to do with recods that miss needed config data?
handling depends on the process mode:- register: error
- resequence: error for now, but outcome could be config'able
- handling of unknown IDs
Idea 1 - Handle Resequencing in Connectivity
There was an idea to handle this case in the connectivity directly with the help of a buffer:
- each incoming operation for a requested resource is buffered for a period of time X.
X is at minimum as long as the longest processing path takes for any given record. In the beginning this value is certainly chosen manually but with evaluating Performance Counters it should be possible to get X automatically or adjust it. - during the time in the buffer, additional processing requests are consolidated to reduce load
CON
- lag
all messages will have the lag of length X before the index is updated. for mass crawling this might be acceptable but in combination with agents this is in most cases not acceptable but will depend on the nature of application SMILA is used for. - no guarantee
delaying processing will reduce the chances of mishaps but there is no guarantee that this is really so. especially when the processing chain is more complex such as in a cluster setup to spread processing load over several nodes. in such a scenario we will also need to take into account that some nodes may be down temporarily while retaining the records that were assigned to them.
PRO
- simple to implement and has no effect on the API or other logic
Idea 2 - Full Resequencer
Synopis: The Resequencer will update the lucene index in the exact order as the crawler or agent adds records to the queue.
The processing will be like so:
- the router will feed Q1 with ADD and DELETE ops. For the resequencer to know the order a new meta info needs to be added -- the sequence number. it must be generated in the agent or by the agentcontroler
- the processing piplines are as normal but
- w/o the step of calling the lucene service
- they add the result to a new queue, Q2
- the Resequencer will listen on Q2 and picks up all messages
- starting with the first record, feed consecutive chunks of IDs to the lucene service
- when messages get lost in the DLQ then a timeout can solve this problem
PRO
- no processing target can ask for more and correct result is always possible
- it is possible to add a note into the index for records ending up in the DLQ, ie. record was not indexed due to processing error.
CON
- overkill, b/c there is no need to resequence all messages, only those with same ID
Idea 3 - Smart Resequencer
Synopis: The Resequencer will process only the most recent operation per resource added by the agent, suppressing the older ones
The processing will be similar to the Full Resequencer and is then like so:
- the router will feed Q1 with ADD and DELETE ops. For the resequencer to know the order a new meta info needs to be added -- the sequence number. it must be generated in the agent or by the agentcontroler
- the Resequencer will
- subscripe to/peek into Q1 w/o taking messages from the Q. It remembers in a Map the id and seq#.
- when the map contained already an entry for the ID then it may remove messages for this ID from Qs stoping them from being processed further. this needs howevrt the hash tp be present as a JMX property.
- the processing piplines are as normal but
- w/o the step of calling the lucene service
- they add the result to a new queue, Q2
- the Resequencer will listen on Q2 and pick up all messages. for each message:
- IF its seq# matches the one in the map THEN
call the index service for the record
ELSE
ignore and skip operation
FI - remove ID from map
- IF its seq# matches the one in the map THEN
PRO
- smarter ;)
CON
- ossilating resources (that constantly change) will never make it into the index
Appendix
Abreviations
Abrev | Meaning |
SN | Sequence Number |
SRS | Smart Resquecer Service |
Ideas
- replace the SN with a more general ComparableObject