Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/JobRuns

< SMILA‎ | Documentation
Revision as of 09:32, 13 February 2012 by Juergen.schumacher.attensity.com (Talk | contribs) (Job modes)

Job runs

With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called job run. How the actual processing is then triggered, depends on the mode of the job run.


Job run life cycle

JobLifecycle-1.1.png

A job run starts in the PREPARING state. In this phase the job run is instantiated, necessary structures in the runtime storage (ZooKeeper) are created and the initial tasks for runOnce jobs are computed. If this has been done successfully, the job run goes to the RUNNING state and the real work of the job begins. When a finish command is sent (either by the user or the jobmanager itself for runOnce jobs) the job run is moved to the FINISHING state, in which no new workflow runs can be created, but only existing tasks and their follow-up tasks are completed. If all workflow runs have been done, a completion phase can follow (state COMPLETING, see below) and finally there is a clean-up phase in which statistics are persisted and the job run structures are removed from the runtime storage. If everything is OK, the job run ends in state SUCCEEDED.

A job run can be cancelled, if it is in the state PREPARING, RUNNING or FINISHING. This tries to remove or abort all current tasks of this job and immediately finishes the job run. No further tasks will be created in this job, no data in transient buckets will be removed. Also it will often cause lots of log entries about tasks that could not be finished anymore. It should be used as an emergency exit if some job is running wild or has gone dead. The state of such a job is CANCELED.

If in any non-final state fatal errors occur, the job will be finished and stored with state FAILED. Especially, this happens if none of the started workflow runs could be completed successfully.

Job Completion Phase

Workers can request to get extra tasks after all the "standard" tasks of a job have been done. The purpose of this is to do clean-up or consolidation work that cannot be triggered before the actual work of the job has been done successfully. An example is the Delta Delete part of the importing jobs: Only after a data source has been crawled completely, it can be determined which elements of the data source have been deleted since a previous import and must be deleted in the import target, too.

A completion job run is intiated if

  • at least one worker used in the workflow requests it: this is done by setting mode requestsCompletion in the worker definition.
  • at least one of the standard workflow runs has been completed successfully, i.e. without the completion run, the job run would have been finished as SUCCEEDED.

Then the taskgenerator of every worker that requests completion is invoked to create completion tasks. The default task generator creates one such task without any input or output bulks attached. Such tasks can be recognized by the worker because the task property isCompletingTask is set to true. These tasks are added to the worker task queue in the TaskManager just like normal tasks. No follow-up tasks are created after these completion tasks have finished successfully or with a fatal error, the retry behaviour on recoverable errors is just like that for normal tasks, too.

During the completion workflow run, the job run is in state COMPLETING. When all completion tasks are done, the job run is cleaned up (in a short intermediate state named CLEANINGUP) and finally finished (as SUCCEEEDED or FAILED). If no worker requests completion, the job leaves the COMPLETING state immediately and (via CLEANINGUP) moves to SUCCEEEDED or FAILED.

Start job run

You can use a POST request without a request body to start a job run in default mode (see Job modes to find out which mode is the default mode of the job). To start a job run in a different mode than the default mode, add the following simple JSON object to the request body (in this example the job should be started in runOnce mode):

{
  "mode": "runOnce"
}

The job cannot be started if it requires another job that is currently not running. A required job is a job that is referenced by a parameter that is marked with range="jobName" in one of the workers used in this job.

Supported operations:

  • POST: Start job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/
  • Allowed methods:
    • POST (with optional request body specifying the mode)
  • Response status codes:
    • 200 OK: Upon successful execution. A JSON object with jobId and url will be returned.
    • 400 BAD REQUEST:
      • The taskgenerator, job or workflow definition do not allow the requested mode.
      • The job uses a worker that required another job to be running.

Example:

To start the job named "myJob" in "default" mode (see above) you don't have to explicitly set a job run mode:

POST /smila/jobmanager/jobs/myJob/

The result would be:

HTTP/1.x 200 OK

{
  "jobId" : "20110712-184509666721",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110712-184509666721/"
}


To start the job named "myJob" in "standard" mode, if that mode is allowed by job and workflow definition:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "standard"
}

To start the job named "myJob" in "runOnce" mode, if that mode is allowed by job and workflow definition:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "runOnce"
}

The result object would be equal to "standard" mode.

Monitor a job run or delete job run data

Use a GET request to view job run data of a specific job run. Use DELETE to delete the data of a specific job run.

Job run data:

The following parameters are contained in the job run data:

  • jobId: The ID of the job run.
  • mode: The mode of the job run, i.e. either STANDARD or RUNONCE.
  • state: The current status of the job run, see Job run life cycle. May be one of the following:
    • PREPARING: started but not running yet
    • RUNNING: running
    • FINISHING: finished but not all tasks processed yet
    • COMPLETING: finished, all tasks processed but job run not completed (e.g. not persisted) yet
    • SUCCEEDED: successfully completed
    • FAILED: failed
    • CANCELING: canceled, but clean-up is not yet completed.
    • CANCELED: canceling done.
  • workflowRuns: Describes the workflow runs which are part of this job run. Note: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
    • startedWorkflowRunCount: The number of started workflow runs.
    • activeWorkflowRunCount: The number of active workflow runs.
    • successfulWorkflowRunCount: The number of successfully finished workflow runs.
    • failedWorkflowRunCount: The number of failed workflow runs.
    • canceledWorkflowRunCount: The number of canceled workflow runs.
  • tasks: Describes the tasks which are part of this job run. After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
    • createdTaskCount: The number of tasks created in this run. This includes tasks created due to retry.
    • successfulTaskCount: The number of tasks that were finished successfully by a worker.
    • retriedAfterErrorTaskCount: The number of tasks that were retried because a worker finished the task with a recoverable error (e.g. IOError while reading the input or writing the output).
    • retriedAfterTimeoutTaskCount: The number of tasks that were retried because a worker did not send the "keepAlive" signal anymore.
    • failedAfterRetryTaskCount: The number of tasks that finally failed after reaching the configured maximum number of retries.
    • failedWithoutRetryTaskCount: The number of tasks that finally failed because the worker finished the task with a fatal error (e.g. due to corrupt input data).
    • canceledTaskCount: The number of tasks that were canceled because a workflow run was canceled or failed due to another task in the workflow run having finally failed. They may have produced their result successfully, but they did not trigger follow-up tasks.
    • obsoleteTaskCount: The number of tasks that became obsolete for some reason. The difference to 'canceledTaskCount' is that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
  • startTime: The timestamp when the job run was started (DateTime format ISO).
  • finishTime: The timestamp when the finish command was called for this job run (DateTime format ISO).
  • endTime: The timestamp when the job status changed to SUCCEEDED, FAILED or CANCELED.
  • worker: Contains accumulated job run data for all workers that have contributed to this job run. It contains:
    • The number of successful, failed, and retried tasks for each worker in this job run (same counter names and meanings as in the global section above).
    • startTime: The timestamp when the first task for a worker of this type was started in the job run (DateTime format ISO).
    • finishTime: The timestamp when the latest task for a worker of this type was finished in the job run (DateTime format ISO). This timestamp is updated with every finished task.
    • The accumulated counters as reported by the workers in their result descriptions.

Supported operations:

  • GET: To monitor the job run.
  • DELETE: To delete job run data.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-id>/
  • Allowed methods:
    • GET
    • DELETE
  • Response status codes:
    • 200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id does not exist, no error will occur during DELETE.
    • 500 Server Error: If the job run is still running (DELETE).

Monitor a job run with details

It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.

GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true

Finish job run

Use a POST request to finish a job run.

A job cannot be finished, while a dependent job is running. A dependent job is one using a worker that has a parameter marked with range="jobName" referencing this job.

Supported operations:

  • POST: finish job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish/
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Finishes the job run (asynchronous call)
    • 400 BAD REQUEST: wrong URL pattern, or another job required that this job is running.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Cancel job run

Use a POST request to cancel a job run.

Supported operations:

  • POST: cancel job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel/
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution. Cancel the job run
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors


Monitor a workflow run

Use a GET request to monitor a workflow run.

Supported operations:

  • GET: monitor workflow run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/
  • Allowed methods:
    • GET
  • Response status codes:
    • 200 OK: Upon successful execution.
    • 404 NOT FOUND: If the workflow run specified does not exist. This can either mean that the workflow run existed but has already been finished, or that it never existed all. You cannot differentiate both cases without further information unless you can make sure that the ID existed before.

Examples:

To monitor a workflow run:

GET /smila/jobmanager/jobs/myJob/20110527_175314695579/workflowrun/1/

If it is still running, the result would be:

HTTP/1.x 200 OK

{
  "activeTaskCount": 1
  "transientBulkCount": 1
}

If not, the result would be:

HTTP/1.x 404 NOT FOUND

Back to the top