Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/TaskManager

< SMILA‎ | Documentation
Revision as of 05:31, 11 July 2011 by Drazen.cindric.attensity.com (Talk | contribs) (Configuration)

Job Controlled Taskmanager

Configuration

Taskmanager provides configuration named taskmanager.properties. By default it looks like this:

# time before an in-progress task is rolled back when no keepAlives are sent anymore.
task.inprogress.timeToLive=120

task.inprogress.timeToLive:Time in seconds before an in-progress task is finished with a recoverable error result when it is not kept alive by the worker. Default is 120 seconds. Whether the task is repeated, committed or fails completely depends on the worker description and job manager configuration.

Tasks

JSON format of task

{
  "taskId": "...",
  "workerName": "...",
  <"qualifier": "...">,
  "properties": { 
    "name": "value", ...
  }
  "parameters": { 
    "name": "value", ...
  }
  "input": {
    "slot1": [ { "bucket": "...", "store": "...", "id": "..." } , ... ],
    "slot2": [ { "bucket": "...", "store": "...", "id": "..." } , ... ],
    ...
  }
  "output": {
    "slot1": [ { "bucket": "...", "store": "...", "id": "..." } , ... ],
    "slot2": [ { "bucket": "...", "store": "...", "id": "..." } , ... ],
    ...
  }
}

A task consists of

  • task ID, workerName, qualifier: needed by the taskmanager to manage the task.
  • properties: used by the jobmanager to associate the task to its job/workflow run. The taskmanager may also add properties.
    • "recoverable": if set to 'false' this task will not be retried after a timeout or after a recoverable worker error
    • "workerHost": the host where the worker is running that requested the task (only set if worker sets host as request parameter
    • "createdTime", "startTime", "taskAge": The time the task was created, retrieved by the worker and the difference in milliseconds. Apart from statistical purposes this can for example be used by a worker to decide that a task must be processed instead of being postponed because some age limit has been reached.
  • parameters: worker parameters as defined in workflow and job definition.
  • input: Lists of bulk infos associated the worker's input slots describing which data has to be processed to complete this task. May be empty for "initial tasks" (e.g. BulkBuilder).
  • output: Lists of bulk infos associated to the worker's output slots describing where to put the results of the task (currently it's always a single bulk info per slot). May be empty completely (e.g. for the HSSI record deleter worker).

The "bucket" name in the bulk info is usually irrelevant to the worker, it just needs to read the "store" and "id" to be able to find and create data objects.

JSON format of qualifier condition

{ 
  "qualifier": [ "parts/abcd", "parts/1234", ... ]
  <, "workerHost": "...">
}

The qualifier is used in POST requests to /taskmanager/<worker> to get only tasks that have one of the given set as the "qualifier" field. This is used for example by the HSSI record delete worker to receive only tasks for certain partitions. The qualifier can be set in a task by adding the mode "qualifier" to the the input slot of a worker in which case the "id" value of the bulk in this slot is used as the qualifier. Note that the qualifier will be the complete object ID path, not only the UUID part. The workerHost must be set by scaling workers so that taskmanager can control the maximum number of tasks delivered to such workers.

JSON format of result description

{
  "status": < "SUCCESSFUL", "FATAL_ERROR", "RECOVERABLE_ERROR", "POSTPONE" >,
  "errorCode": "...", 
  "errorMessage": "...",
  "counters": {
    "name": numbervalue,
    ...
  }
}

The result description is added to the "task finished" request so that the JobManager can decide on what to do next based on the result of the task. It consists of:

  • status: one of
    • SUCCESSFUL: the task has been processed completely and successfully. Follow-up tasks can be generated.
    • FATAL_ERROR: the task cannot be processed and should not be repeated, e.g. input data is corrupt. This leads to cancelling the complete associated workflow run as failed, so no further tasks can be created in this workflow run. (The job run as a whole is continued, of course).
    • RECOVERABLE_ERROR: the task could not be processed now, e.g. because input data was temporary not available. Usually the job manager will repeat this task until a configured retry limit is reached. However, special workers may specify that the task should not be repeated in this case but follow-up tasks should be created nethertheless.
    • POSTPONE: the worker cannot yet perform this task for some reason (no error, just waiting for a condition in the system) but it should be processed later. The task will be readded to the todo queue for this worker and redelivered later (but very soon, usually). There is no limit to retrying a task after postponing it.
  • errorCode/errorMessage: will currently be logged only. Could be merged to the job run result data in later versions to give an overview about errors that happened during the run.
  • counters: Integer or floating point numbers giving statistical information about the task processing. Not evaluated by task/job manager currently, but may be aggregated in the job run data in a later version.

Internal REST API

Back to the top