Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/UpdatePusher"

(Worker description)
(Worker description)
(One intermediate revision by the same user not shown)
Line 3: Line 3:
 
* Name: <tt>updatePusher</tt>
 
* Name: <tt>updatePusher</tt>
 
* Parameters:  
 
* Parameters:  
** <tt>jobToPushTo</tt>: The job to push the crawled records to. If set, A job with this worker will not be started, if the <tt>jobToPushTo</tt> is not running.
+
** <tt>jobToPushTo</tt>: The job to push the crawled records to. If set, a job with this worker will not be started, if the <tt>jobToPushTo</tt> is not running in the same SMILA instance.
 
** <tt>remote</tt>: A description of a SMILA REST API to push the records to. This parameter must hold a map that contains:
 
** <tt>remote</tt>: A description of a SMILA REST API to push the records to. This parameter must hold a map that contains:
 
*** <tt>endpoints</tt>: A list of strings with the hosts and ports of the SMILA servers to push to. Usually the UpdatePusher will try for each task to send the records to the first endpoint first, until this one cannot be reached anymore. It will not return to the first host as long as the failover host is working. In this case it will failover to the second host and so on. If for one record none of the endpoint hosts can be reached, the task will fail with an recoverable error so it can be retried later, until the maximum number of retries has been reached. The UpdatePusher uses the [[SMILA/Documentation/HowTo/How_to_access_the_REST_API_with_the_RestClient#Interfaces_and_default_implementations|SMILA FailoverRestClient]] for sending the records and handling the failover.
 
*** <tt>endpoints</tt>: A list of strings with the hosts and ports of the SMILA servers to push to. Usually the UpdatePusher will try for each task to send the records to the first endpoint first, until this one cannot be reached anymore. It will not return to the first host as long as the failover host is working. In this case it will failover to the second host and so on. If for one record none of the endpoint hosts can be reached, the task will fail with an recoverable error so it can be retried later, until the maximum number of retries has been reached. The UpdatePusher uses the [[SMILA/Documentation/HowTo/How_to_access_the_REST_API_with_the_RestClient#Interfaces_and_default_implementations|SMILA FailoverRestClient]] for sending the records and handling the failover.
Line 26: Line 26:
 
     "remote": {
 
     "remote": {
 
       "endpoints": [ "smila-host-1:8080", "smila-host-2:8080" ],
 
       "endpoints": [ "smila-host-1:8080", "smila-host-2:8080" ],
       "urlPath": "smila/job/indexUpdate/record"
+
       "urlPath": "/smila/job/indexUpdate/record"
 
     },
 
     },
 
     ...
 
     ...

Revision as of 09:55, 3 April 2013

Worker description

  • Name: updatePusher
  • Parameters:
    • jobToPushTo: The job to push the crawled records to. If set, a job with this worker will not be started, if the jobToPushTo is not running in the same SMILA instance.
    • remote: A description of a SMILA REST API to push the records to. This parameter must hold a map that contains:
      • endpoints: A list of strings with the hosts and ports of the SMILA servers to push to. Usually the UpdatePusher will try for each task to send the records to the first endpoint first, until this one cannot be reached anymore. It will not return to the first host as long as the failover host is working. In this case it will failover to the second host and so on. If for one record none of the endpoint hosts can be reached, the task will fail with an recoverable error so it can be retried later, until the maximum number of retries has been reached. The UpdatePusher uses the SMILA FailoverRestClient for sending the records and handling the failover.
      • urlPath: The REST API to talk to, usually this will be something like /smila/job/indexUpdate/record, i.e. the BulkBuilder REST API. But it's possible send added and updated records to every REST API that accepts POST requests with a JSON record in request body. Deleted records are sent as DELETE requests to <urlPath>?_recordid=<recordid>, so to be able to do delta importing including deletes, the target resource must acceppt such requests, too. The BulkBuilder REST API is currently the only API that accepts such requests. There is no check if the URL path specifies a valid resource or a running job, so you can start the crawl job with an invalid URI and it will fail in an unspecified way during execution.
Idea.png
You need to specify either the jobToPushTo parameter or the remote section or the job definition will be rejected. For example, these two fragments would be OK for a valid job definition:
{
  "parameters": {
    ...
    "jobToPushTo": "indexUpdate",
    ...
  }
}

OR

{
  "parameters": {
    ...
    "remote": {
      "endpoints": [ "smila-host-1:8080", "smila-host-2:8080" ],
      "urlPath": "/smila/job/indexUpdate/record"
    },
    ...
  }
}

    • deltaImportStrategy: Configure usage of the DeltaService by this worker. There are four possible values, two of them have the same effect on this worker (see DeltaDelete for an overview):
      • none: do not record delta information, do not perform delta-delete in the completion phase of the job.
      • initial or additive: record delta information, but do not perform delta-delete in the completion phase of the job.
      • full: default mode, record delta information and perform delta-delete in the completion phase of the job.
  • Input Slots:
    • recordToPush: a bucket of type recordBulks containing the records produced by the crawl workflow.
  • Output Slots:
    • pushedRecords: (optional) the records that could be successfully submitted to the destination job. Usually not set, but may be used to trigger further actions on submitted records.


The UpdatePusher takes each record from the input, sends it to a bulkbuilder service. If an output bucket is connected the record is written to it. If the record contains a _deltaHash attribute value, the worker checks with DeltaService if the record has not been pushed yet to prevent duplicates, and marks it updated afterwards (if enabled, see above). If the _deltaHash attribute is empty, the record is pushed always and not marked as updated in DeltaService.

Exception handling of bulkbuilder errors:

  • If an InvalidRecordException is thrown by Bulkbuilder it is logged and the record is skipped (and is also not added to the output bulk, if set).
  • Other BulkbuilderExceptions are not catched. If they are marked as recoverable they should lead to an retry of the task, else the task will fail fatal.

If enabled (parameter deltaImportStrategy="full" or not set), the worker scans the DeltaService in the completion phase of the job run for records that must be sent to the BulkBuilder as "deleted records" and removes these entries from the DeltaServer afterwards. See DeltaDelete for details.