Jump to: navigation, search

SMILA/Documentation/JobRuns

Note.png
Available since SMILA 0.9!


Job runs

With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called job run. How the actual processing is then triggered, depends on the mode of the job run.

Job modes

There are two different modes in which job runs can be operated:

  • Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow (or, if it has no input bucket, which the start worker produces from other sources like the Bulkbuilder). They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
  • Job runs in "runOnce" start, create some initial tasks and go immediately to the FINISHING state. They do not react on further input. If something goes wrong while creating the initial tasks, the job run goes to state FAILED immediately and no task will be processed at all. All tasks are executed in a single workflow run, all follow-up tasks of the initial tasks are also part of this workflow run. A consequence of this is that the complete job run will fail, if one task fails fatally, so workers in runOnce jobs should finish tasks with fatal errors only if this is really a critical problem that should cancel the job run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Currently we have two varieties:
    • Either the workflow has exactly one persistent input bucket in its start action, and for each existing object in this bucket, one task for the startAction worker is generated. Unlike job runs in standard mode, this job does not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically.
    • As another possibility the workflow starts with a worker using the runOnceTrigger task generator. In runOnce mode, this task creates one initial task without input data objects, afterwards it behaves just like the default task generator. The worker can then create output bulks at will. An application of this is the importing use case, where the starting crawler worker creates the first bulks from data extracted from an external data source. The input bucket of the start action worker does not need to be persistent in this use case.

Start job run

Use a POST request to start a job run in "standard" mode. To start a job run in "runOnce" mode, add the following simple JSON object to the request body:

{
  "mode": "runOnce"
}

Supported operations:

  • POST: Start job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/
  • Allowed methods:
    • POST (with optional request body specifying the mode)
  • Response status codes:
    • 200 OK: Upon successful execution. A JSON object with jobId and url will be returned.

Example:

To start the job named "myJob" in "standard" mode:

POST /smila/jobmanager/jobs/myJob/

The result would be:

HTTP/1.x 200 OK

{
  "jobId" : "20110712-184509666721",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110712-184509666721/"
}

To start the job named "myJob" in "runOnce" mode:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "runOnce"
}

The result object would be equal to "standard" mode.

Monitor a job run or delete job run data

Use a GET request to view job run data of a specific job run. Use DELETE to delete the data of a specific job run.

Job run data:

The following parameters are contained in the job run data:

  • jobId: The ID of the job run.
  • mode: The mode of the job run, i.e. either STANDARD or RUNONCE.
  • state: The current status of the job run, see Job run life cycle. May be one of the following:
    • PREPARING: started but not running yet
    • RUNNING: running
    • FINISHING: finished but not all tasks processed yet
    • COMPLETING: finished, all tasks processed but job run not completed (e.g. not persisted) yet
    • SUCCEEDED: successfully completed
    • FAILED: failed
    • CANCELING: canceled, but clean-up is not yet completed.
    • CANCELED: canceling done.
  • workflowRuns: Describes the workflow runs which are part of this job run. Note: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
    • startedWorkflowRunCount: The number of started workflow runs.
    • activeWorkflowRunCount: The number of active workflow runs.
    • successfulWorkflowRunCount: The number of successfully finished workflow runs.
    • failedWorkflowRunCount: The number of failed workflow runs.
    • canceledWorkflowRunCount: The number of canceled workflow runs.
  • tasks: Describes the tasks which are part of this job run. After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
    • createdTaskCount: The number of tasks created in this run. This includes tasks created due to retry.
    • successfulTaskCount: The number of tasks that were finished successfully by a worker.
    • retriedAfterErrorTaskCount: The number of tasks that were retried because a worker finished the task with a recoverable error (e.g. IOError while reading the input or writing the output).
    • retriedAfterTimeoutTaskCount: The number of tasks that were retried because a worker did not send the "keepAlive" signal anymore.
    • failedAfterRetryTaskCount: The number of tasks that finally failed after reaching the configured maximum number of retries.
    • failedWithoutRetryTaskCount: The number of tasks that finally failed because the worker finished the task with a fatal error (e.g. due to corrupt input data).
    • canceledTaskCount: The number of tasks that were canceled because a workflow run was canceled or failed due to another task in the workflow run having finally failed. They may have produced their result successfully, but they did not trigger follow-up tasks.
    • obsoleteTaskCount: The number of tasks that became obsolete for some reason. The difference to 'canceledTaskCount' is that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
  • startTime: The timestamp when the job run was started (DateTime format ISO).
  • finishTime: The timestamp when the finish command was called for this job run (DateTime format ISO).
  • endTime: The timestamp when the job status changed to SUCCEEDED, FAILED or CANCELED.
  • worker: Contains accumulated job run data for all workers that have contributed to this job run. It contains:
    • The number of successful, failed, and retried tasks for each worker in this job run (same counter names and meanings as in the global section above).
    • startTime: The timestamp when the first task for a worker of this type was started in the job run (DateTime format ISO).
    • finishTime: The timestamp when the latest task for a worker of this type was finished in the job run (DateTime format ISO). This timestamp is updated with every finished task.
    • The accumulated counters as reported by the workers in their result descriptions.

Supported operations:

  • GET: To monitor the job run.
  • DELETE: To delete job run data.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-id>/
  • Allowed methods:
    • GET
    • DELETE
  • Response status codes:
    • 200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id does not exist, no error will occur during DELETE.
    • 500 Server Error: If the job run is still running (DELETE).

Monitor a job run with details

It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.

GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true

Finish job run

Use a POST request to finish a job run.

Supported operations:

  • POST: finish job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish/
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Finishes the job run (asynchronous call)
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Cancel job run

Use a POST request to cancel a job run.

Supported operations:

  • POST: cancel job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel/
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution. Cancel the job run
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Job run life cycle

JobLifecycle.png

Monitor a workflow run

Use a GET request to monitor a workflow run.

Supported operations:

  • GET: monitor workflow run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/
  • Allowed methods:
    • GET
  • Response status codes:
    • 200 OK: Upon successful execution.
    • 404 NOT FOUND: If the workflow run specified does not exist. This can either mean that the workflow run existed but has already been finished, or that it never existed all. You cannot differentiate both cases without further information unless you can make sure that the ID existed before.

Examples:

To monitor a workflow run:

GET /smila/jobmanager/jobs/myJob/20110527_175314695579/workflowrun/1/

If it is still running, the result would be:

HTTP/1.x 200 OK

{
  "activeTaskCount": 1
  "transientBulkCount": 1
}

If not, the result would be:

HTTP/1.x 404 NOT FOUND