Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/ProcessingMessageResequencer"

(The Core Problem)
(The Core Problem: added status note)
Line 1: Line 1:
 
[[Category:SMILA]]
 
[[Category:SMILA]]
 +
 +
{{Note|Status| this page is very much a WIP and discussion is still happening on the [http://dev.eclipse.org/mhonarc/lists/smila-dev/msg00608.html dev list].
 +
 +
as the concept matures during the discussion this page will be updated in certain intervals.
 +
 +
this enhancement is tracked thru {{Bug|289995}}
 +
 +
for the development i opened a new branch @ https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/branches/2009-09-23_r608_resequencer}}
  
 
=== The Core Problem  ===
 
=== The Core Problem  ===
Line 5: Line 13:
 
MQs are inherently asynchronous but in the end we need to be sure that the final processing target reflects the correct state of the data source at any given time.  
 
MQs are inherently asynchronous but in the end we need to be sure that the final processing target reflects the correct state of the data source at any given time.  
  
The needs of the final processing target might differ in their requirements. At this time we will only treat the case of an full text retrieval engine like lucene.  
+
The needs of the final processing target might differ in their requirements. At this time we will only treat the case of an full text retrieval engine like lucene.
 
+
this enhancement is tracked tru https://bugs.eclipse.org/bugs/show_bug.cgi?id=289995
+
  
 
=== Indexing  Requirements ===
 
=== Indexing  Requirements ===

Revision as of 17:27, 1 October 2009


Note.png
Status
this page is very much a WIP and discussion is still happening on the dev list.

as the concept matures during the discussion this page will be updated in certain intervals.

this enhancement is tracked thru bug 289995

for the development i opened a new branch @ https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/branches/2009-09-23_r608_resequencer


The Core Problem

MQs are inherently asynchronous but in the end we need to be sure that the final processing target reflects the correct state of the data source at any given time.

The needs of the final processing target might differ in their requirements. At this time we will only treat the case of an full text retrieval engine like lucene.

Indexing Requirements

- the order needs only to be maintained on a per record base - older messages are superseded by newer messages for a resource and can be removed from the MQ w/o further processing


Operation N Operation N+1 expected index State after N+1
ADD A,t1 ADD A,t2 A,t2
ADD A,t1 DELETE A,t2 A doesn't exist
DELETE A,t1 ADD A,t2 A exists


Idea 1 - Handle Resequenceing in Connectivity

There was an original idea to handle this case in the connectivity directly with the help of a buffer.

I dont persue this idea any further because we would need some configuration to tell connectivity where to remove messages and provide necessary API in the modules down the processing chain which seems very messy to me.


Idea 2 - Full Resequencer

Synopis: The Resequencer will update the lucene index in the exact order as the crawler or agent adds records to the queue.

The processing will be like so:

  1. the router will feed Q1 with ADD and DELETE ops. For the resequencer to know the order a new meta info needs to be added -- the sequence number. it must be generated in the agent or by the agentcontroler
  2. the processing piplines are as normal but
    1. w/o the step of calling the lucene service
    2. they add the result to a new queue, Q2
  3. the Resequencer will listen on Q2 and picks up all messages
  4. starting with the first record, feed consecutive chunks of IDs to the lucene service
  5. when messages get lost in the DLQ then a timeout can solve this problem


PRO

  • no processing target can ask for more and correct result is always possible
  • it is possible to add a note into the index for records ending up in the DLQ, ie. record was not indexed due to processing error.

CON

  • overkill, b/c there is no need to resequence all messages, only those with same ID


Idea 3 - Smart Resequencer

Synopis: The Resequencer will process only the most recent operation per resource added by the agent, suppressing the older ones

The processing will be similar to the Full Resequencer and is then like so:

  1. the router will feed Q1 with ADD and DELETE ops. For the resequencer to know the order a new meta info needs to be added -- the sequence number. it must be generated in the agent or by the agentcontroler
  2. the Resequencer will
    1. subscripe to/peek into Q1 w/o taking messages from the Q. It remembers in a Map the id and seq#.
    2. when the map contained already an entry for the ID then it may remove messages for this ID from Qs stoping them from being processed further. this needs howevrt the hash tp be present as a JMX property.
  3. the processing piplines are as normal but
    1. w/o the step of calling the lucene service
    2. they add the result to a new queue, Q2
  4. the Resequencer will listen on Q2 and pick up all messages. for each message:
    1. IF its seq# matches the one in the map THEN
      call the index service for the record
      ELSE
      ignore and skip operation
      FI
    2. remove ID from map

PRO

- smarter ;)

CON

- ossilating resources (that constantly change) will never make it into the index

Back to the top