Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/JobRuns"

m (Monitor a job run or delete job run data)
(Job modes)
Line 7: Line 7:
 
== Job modes ==
 
== Job modes ==
 
There are two different modes in which job runs can be operated:
 
There are two different modes in which job runs can be operated:
* Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow. They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
+
* Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow (or which the start worker produces from API calls, if it has no input bucket, like the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]]. They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
 
* Job runs in "runOnce" mode require that the connected workflow has exactly one input bucket in its start action. Unlike job runs in standard mode, they do not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically. Once in the FINISHING state, they do not react on further changes and then go to SUCCESSFUL when all tasks have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed at all. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.
 
* Job runs in "runOnce" mode require that the connected workflow has exactly one input bucket in its start action. Unlike job runs in standard mode, they do not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically. Once in the FINISHING state, they do not react on further changes and then go to SUCCESSFUL when all tasks have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed at all. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.
  

Revision as of 10:54, 23 January 2012

Note.png
Available since SMILA 0.9!


Job runs

With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called job run. How the actual processing is then triggered, depends on the mode of the job run.

Job modes

There are two different modes in which job runs can be operated:

  • Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow (or which the start worker produces from API calls, if it has no input bucket, like the Bulkbuilder. They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
  • Job runs in "runOnce" mode require that the connected workflow has exactly one input bucket in its start action. Unlike job runs in standard mode, they do not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically. Once in the FINISHING state, they do not react on further changes and then go to SUCCESSFUL when all tasks have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed at all. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.

Start job run

Use a POST request to start a job run in "standard" mode. To start a job run in "runOnce" mode, add the following simple JSON object to the request body:

{
  "mode": "runOnce"
}

Supported operations:

  • POST: Start job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/
  • Allowed methods:
    • POST (with optional request body specifying the mode)
  • Response status codes:
    • 200 OK: Upon successful execution. A JSON object with jobId and url will be returned.

Example:

To start the job named "myJob" in "standard" mode:

POST /smila/jobmanager/jobs/myJob/

The result would be:

HTTP/1.x 200 OK

{
  "jobId" : "20110712-184509666721",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110712-184509666721/"
}

To start the job named "myJob" in "runOnce" mode:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "runOnce"
}

The result object would be equal to "standard" mode.

Monitor a job run or delete job run data

Use a GET request to view job run data of a specific job run. Use DELETE to delete the data of a specific job run.

Job run data:

The following parameters are contained in the job run data:

  • jobId: The ID of the job run.
  • runMode: The mode of the job run, i.e. either STANDARD or RUNONCE.
  • state: The current status of the job run, see Job run life cycle. May be one of the following:
    • PREPARING: started but not running yet
    • RUNNING: running
    • FINISHING: finished but not all tasks processed yet
    • COMPLETING: finished, all tasks processed but job run not completed (e.g. not persisted) yet
    • SUCCEEDED: successfully completed
    • FAILED: failed
    • CANCELING: canceled, but clean-up is not yet completed.
    • CANCELED: canceling done.
  • workflowRuns: Describes the workflow runs which are part of this job run. Note: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
    • startedWorkflowRunCount: The number of started workflow runs.
    • activeWorkflowRunCount: The number of active workflow runs.
    • successfulWorkflowRunCount: The number of successfully finished workflow runs.
    • failedWorkflowRunCount: The number of failed workflow runs.
    • canceledWorkflowRunCount: The number of canceled workflow runs.
  • tasks: Describes the tasks which are part of this job run. After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
    • createdTaskCount: The number of tasks created in this run. This includes tasks created due to retry.
    • successfulTaskCount: The number of tasks that were finished successfully by a worker.
    • retriedAfterErrorTaskCount: The number of tasks that were retried because a worker finished the task with a recoverable error (e.g. IOError while reading the input or writing the output).
    • retriedAfterTimeoutTaskCount: The number of tasks that were retried because a worker did not send the "keepAlive" signal anymore.
    • failedAfterRetryTaskCount: The number of tasks that finally failed after reaching the configured maximum number of retries.
    • failedWithoutRetryTaskCount: The number of tasks that finally failed because the worker finished the task with a fatal error (e.g. due to corrupt input data).
    • canceledTaskCount: The number of tasks that were canceled because a workflow run was canceled or failed due to another task in the workflow run having finally failed. They may have produced their result successfully, but they did not trigger follow-up tasks.
    • obsoleteTaskCount: The number of tasks that became obsolete for some reason. The difference to 'canceledTaskCount' is that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
  • startTime: The timestamp when the job run was started (DateTime format ISO).
  • finishTime: The timestamp when the finish command was called for this job run (DateTime format ISO).
  • endTime: The timestamp when the job status changed to SUCCEEDED, FAILED or CANCELED.
  • worker: Contains accumulated job run data for all workers that have contributed to this job run. It contains:
    • The number of successful, failed, and retried tasks for each worker in this job run (same counter names and meanings as in the global section above).
    • startTime: The timestamp when the first task for a worker of this type was started in the job run (DateTime format ISO).
    • finishTime: The timestamp when the latest task for a worker of this type was finished in the job run (DateTime format ISO). This timestamp is updated with every finished task.
    • The accumulated counters as reported by the workers in their result descriptions.

Supported operations:

  • GET: To monitor the job run.
  • DELETE: To delete job run data.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-id>/
  • Allowed methods:
    • GET
    • DELETE
  • Response status codes:
    • 200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id does not exist, no error will occur during DELETE.
    • 500 Server Error: If the job run is still running (DELETE).

Monitor a job run with details

It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.

GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true

Finish job run

Use a POST request to finish a job run.

Supported operations:

  • POST: finish job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish/
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Finishes the job run (asynchronous call)
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Cancel job run

Use a POST request to cancel a job run.

Supported operations:

  • POST: cancel job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel/
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution. Cancel the job run
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Job run life cycle

JobLifecycle.png

Monitor a workflow run

Use a GET request to monitor a workflow run.

Supported operations:

  • GET: monitor workflow run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/
  • Allowed methods:
    • GET
  • Response status codes:
    • 200 OK: Upon successful execution.
    • 404 NOT FOUND: If the workflow run specified does not exist. This can either mean that the workflow run existed but has already been finished, or that it never existed all. You cannot differentiate both cases without further information unless you can make sure that the ID existed before.

Examples:

To monitor a workflow run:

GET /smila/jobmanager/jobs/myJob/20110527_175314695579/workflowrun/1/

If it is still running, the result would be:

HTTP/1.x 200 OK

{
  "activeTaskCount": 1
  "transientBulkCount": 1
}

If not, the result would be:

HTTP/1.x 404 NOT FOUND