SMILA/Specifications/Processing Message Resequencer/Smart Resequencer

From Eclipsepedia

Jump to: navigation, search


Contents

Smart Resequencer (SRS)

Synopsis: The Smart Resequencer will only resequence the operation on a per record basis.
Rationale: in most cases records are independent of each other and so there is no need of ordering all records.

working principle

  1. the SRS follows the registry and resequencer pattern
  2. the SRS defines boundaries between which, processing of multiple PRs for one given resource can be done concurrently and in any order
  3. it guarantees that when leaving its boundary the PRs are consolidated or at least in proper order
  4. the first thing that needs to happen for a PR, is to be REGISTERED with the SRS. by this the PR crosses the boundary entering the realm of the SRS.
  5. for the SRS to determine the correct order or consolidation of PRs, a PR must declare on REGISTRATION a sequence number (SN) that reflects the order of PRs as created by the agent/crawler
  6. before a PR is passed to a PT it must be RESEQUENCED by the SRS, which just checks if the PR's SN equals the SN of the latest REGISTED PR. if so, then it is added to the output if not, it is not added the output. a RESQUENCed PR crosses the 2nd boundery and thus leaves the realm of the SRS.

non-concurrent REGISTERING setup

  1. the router will deliver all PRs to Q1 where the SRS has a sole listener that registers all incoming messages
  2. the output of the SRS can then be processed as normal by multiple pipelines
  3. each pipeline that calls the PT must inject a call to the SRS with the RESQUENCE command before the call to the PT.

a down side to this setup is that REGISTERING adds execution time to the critical path for all records and can only be proced by one thread. question: is there a setup that is better w/o introducing an error?

concurrent REGISTERING setup with call in pipelines

instead of creating a sole pipeline for the SRS, this idea proposes to add the REGISTER call to each pipeline at the very beginning. by this, concurrent processing of all PRs is fostered from the start.

this will work safely:

  1. listener L1 and L2 work concurrently on the same item but with diff. SNs. L1 on PR1.SN1 and L2 on PR2.SN2.
  2. the critical case is when SRS doesnt know of the more recent PR2 when being asked to RESEQUENCE PR1, like so:
    1. If L1 calls SRS with RESEQUENCE before L2 gets to call REGISTER, SRS doesnt know that an SN2 is out there and lets PR with SN1 pass thru.
    2. this is not an error b/c the order of processing is maintained. it has a drawback however, which is that SRS is unable to supress PR1.
  3. i cannot think of another case where this setup leads to an aerror , b/c all PRs are first registered before any further processing may take place this setup works in all cases IMO. (please prove me wrong)

pro: reduced overhead due to one less Q and pipeline

con: it might be easy to forget to call the SRS in each pipeline.


concurrent REGISTERING setup with mirror Queue

in this setup a copy of the PR with its SN is sent to an additional Q2 in parallel to Q1 (two send tasks in router). SRS is the only listener on Q2 and REGISTERS all the PRs.

this will introduce an error in case that a PR is RESEQUENCed while for the same item there is a new PR waiting in Q2.

the problem here is generally that REGISTERing happens asynchronously to processing and hence cannot be safe.

fix: the error introcuced by this setup can be fixed by demanding that the SRS is first to process all PRs on Q2 before RESEQUENCing PRs. however, this may cause PRs only to be added to the index as long as the agent produces PRs. in that case it wont be better than the non-concurent setup.

Basics Impl. Ideas

  • implemented as ProcessingService
  • records are sent to it with the command/process mode REGISTER and SEQUENCE
  • SN and process mode are given as annotation on the record, called the Config Annotation (CA). this is the same way as with the lucen service. (i first wanted to do this as JMS props but they are not accessible in a ProcessingService)
  • map may be in memory or a persisting solution may be implemented/chosen.
    IMO the amount of records held im momory should be relativly small, only to the amount of what is in the processing chain. (hm, that can be a lot, since connectivity is not pausing crawlers and agents (yet) if there is much in the MQ)


Meeting Requirements

The general ones should be sufficeiently clear from the functional description of the SRS. here come now the further ones:

split records

compound and aggregation are handled the same way, like so:

the processing step splitting the record is responsible for the the following:

  • all descendants inherit the SN from their root
  • if internal ordering:
    order of PRs for descendants is noted in their respective ConfigAnnotation (e.g. a link to the ID of the preceding or succeeding resource or such)
  • register the split records with the SRS
  • possibly deregister the root and/or intermediate PRs if these are not processed further

the SRS will

  • collect all PRs belonging to the tree of split PRs until it is complete
    • missing PRs:
      • timeout
      • config on how to continue with non-complete trees: {all or nothing, sequence incomplete}

>1 Processing Targets

this can be supported in diff. ways. both have in common:

  1. sending the PR to any of the PTs is done thru the SRS by calling it with the RESEQUENCE command (this is just a generalization of the basic concept and repeated here for clarity)
  2. the SRS needs to know how many (potential) PTs there are for a resource (determine by the ID) and when processing really has finished for a given ID.
  3. each RESEQUENCE and UNREGISTER command will reduce the count, when it reaches 0 all PRs have reached their PT and the ID can be removed from the map.
idea 1 - SRS ID rules
  • the config of the SRS contains rules or conditions that determine the count.
  • it starts with that count wich is computed on REGISTRATION.
idea 2 - processing steps control counter
  • processing steps take care of in- and decerementing the counter in the normal processing chain by using the REGISTER and UNREGISTER commands to reflect additional or obsolte PTs
Opinion

i like idea 2 better b/c it puts the config of the SRS in the same place that also controls the flow of PRs anyhow. it is just a matter of including an SRS call with the respective command.
in contrast, idea 1 would mean that we have to:

  • duplicate the processing chain logic in some other place
  • implement a rule/condition engine and config.

clustering, complex processing chain

  • complex processing chains are possible as already described in other places. the SRS just needs to be placed in front of the PT and called in the flow of things
  • resequencing in a cluster scenario works OOB just the setup/config changes.
    • SRS is run on several nodes and shares a custer capable map OR
    • if the router sends PRs for the same item always to the same processing node, then the SRS can be local to the processing nodes and setup as normal OR
    • SRS runs on just one node. then
      • all messages from the router need to be send to the SRS node first for REGISTERING
        it makes sense to have th SRS and router on the same node to avoid chnaging nodes for the first step.
      • all processing nodes dont call he PT directly in their pipeline but send their result to SRS node
      • the SRS RESEQUENCE pipeline will call the PT.

Note
i think the SRS will even work if the assumption that each PT has only one instance and node that solely accesses the PT holds not true. a setup like this will then segment all PRs by some scheme that depends on the ID and then the SRS has only to resequence the one segement and thus: all is well.

single point of failure

hm. this is a tough one as the single'nes is inherent. i have no clue yet, how to solve this, other than to use a fail-over solution. i guess, just as the PT itself, it needs to be monitored closely to detect malfunctions.

scalability and performance

there is some performance degradation to be expected because

  1. SRS increases the number of threads
  2. SRS is inserted at least 2 times into the processing flow, namely at the beginning and end.
    1. when it registers an item
    2. when it resquences it

the internal workings of these steps are fairly simple and should not take much time compared to the rest of the processing, albeit in a highly concurrent scenario the synchronization will take its toll.

Cases introduced thru this solution

this section lists cases and problems that need to be covered that are introduced thru the solution itself. some of the items listed here will also apply to the FSR!!

  • handling of unregistered records
    what happens when SEQUENCING a PR that is already @ count 0 /not existing.
  • what to do with recods that miss needed config data?
    handling depends on the process mode:
    • SEQUENCE: error as default , but outcome could be config'able such as : DLQ, any other Q
    • REGISTER: error
  • overflow of the SN
    a reset signal must be sent to SRS


PRO

  • smarter ;) than FRS
  • no change to APIs are needed, implementation of agents/crawlers (controller) needs to add the SN as an annotation to each record. which can be turned on or off via config.
  • unobstrutive, SRS can be used or not.

CON

  • see also almost all CONS @ FRS
  • oscillating resources (that constantly change) will never make it into the index.