Revision as of 13:53, 12 July 2011

Available since SMILA 0.9!

Job runs

This page is work in progress.

With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called job run. How the actual processing is then triggered, depends on the mode of the job run.

Job modes

There are two different modes in which job runs can be operated:

Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow. They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
Job runs in runOnce" mode require that the connected workflow has exactly one input bucket in its start action. Unlike job runs in standard mode, they do not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically. Once in the FINISHING state, they do not react on further changes and then go to SUCCESSFUL when all tasks have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed at all. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.

Start job run

Use a POST request to start a job run in "standard" mode. To start a job run in "runOnce" mode, add the following simple JSON object to the request body:

{
  "mode": "runOnce"
}

Supported operations:

POST: Start job run.

Usage:

URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/
Allowed methods:
- POST (with optional request body specifying the mode)
Response status codes:
- 200 OK: Upon successful execution. A JSON object with jobId and url will be returned.

Example:

To start the job named "myJob" in "standard" mode:

POST /smila/jobmanager/jobs/myJob/

The result would be:

HTTP/1.x 200 OK

{
  "jobId" : "20110712-184509666721",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110712-184509666721/"
}

To start the job named "myJob" in "runOnce" mode:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "runOnce"
}

The result object would be equal to "standard" mode.

Monitoring a job run or deleting job run data

Use a GET request to view job run data of a specific job run. Use DELETE to delete the data of a specific job run.

Supported operations:

GET: To monitor the job run.
DELETE: To delete job run data.

Usage:

URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-id>/
Allowed methods:
- GET
- DELETE
Response status codes:
- 200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id does not exist, no error will occur during DELETE.
- 500 Server Error: If the job run is still running (DELETE).

The following parameters are contained in the job run data:

jobId: The ID of the job run.
runMode: The mode of the job run, i.e. either STANDARD or RUNONCE.
state: The current status of the job run. May be one of the following:
- PREPARING: started but not running yet
- RUNNING: running
- FINISHING: finished but not completed yet
- SUCCEEDED: successfully completed
- FAILED: failed
- CANCELING: canceled, but clean-up is not yet completed.
- CANCELED: canceling done.
workflowRuns: Describes the workflow runs which are part of this job run. Note: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
- startedWorkflowRunCount: The number of started workflow runs.
- activeWorkflowRunCount: The number of active workflow runs.
- successfulWorkflowRunCount: The number of successfully finished workflow runs.
- failedWorkflowRunCount: The number of failed workflow runs.
- canceledWorkflowRunCount: The number of canceled workflow runs.
tasks: Describes the tasks which are part of this job run. After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
- createdTaskCount: The number of tasks created in this run. This includes tasks created due to retry.
- successfulTaskCount: The number of tasks that were finished successfully by a worker.
- retriedAfterErrorTaskCount: The number of tasks that were retried because a worker finished the task with a recoverable error (e.g. IOError while reading the input or writing the output).
- retriedAfterTimeoutTaskCount: The number of tasks that were retried because a worker did not send the "keepAlive" signal anymore.
- failedAfterRetryTaskCount: The number of tasks that finally failed after reaching the configured maximum number of retries.
- failedWithoutRetryTaskCount: The number of tasks that finally failed because the worker finished the task with a fatal error (e.g. due to corrupt input data).
- canceledTaskCount: The number of tasks that were canceled because a workflow run was canceled or failed due to another task in the workflow run having finally failed. They may have produced their result successfully, but they did not trigger follow-up tasks.
- obsoleteTaskCount: The number of tasks that became obsolete for some reason. The difference to 'canceledTaskCount' is that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
startTime: The timestamp when the job run was started (DateTime format ISO).
finishTime: The timestamp when the finish command was called for this job run (DateTime format ISO).
endTime: The timestamp when the job status changed to SUCCEEDED, FAILED or CANCELED.
worker: Contains accumulated job run data for all workers that have contributed to this job run. It contains:
- The number of successful, failed, and retried tasks for each worker in this job run (same counter names and meanings as in the global section above).
- startTime: The timestamp when the first task for a worker of this type was started in the job run (DateTime format ISO).
- finishTime: The timestamp when the latest task for a worker of this type was finished in the job run (DateTime format ISO). This timestamp is updated with every finished task.
- The accumulated counters as reported by the workers in their result descriptions.

Monitoring a job run with more details

It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.

GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true

Finish job run

Use a POST request to finish a job run.

Supported operations:

POST: finish job run.

Usage:

URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish/
Allowed methods:
- POST
Response status codes:
- 202 ACCEPTED: Finishes the job run (asynchronous call)
- 400 BAD REQUEST: wrong URL pattern.
- 404 NOT FOUND: job run not found
- 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
- 410 GONE: job run was finished before and has already been moved to the history of job runs
- 500 INTERNAL SERVER ERROR: other errors

Cancel job run

Use a POST request to cancel a job run.

Supported operations:

POST: cancel job run.

Usage:

URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel/
Allowed methods:
- POST
Response status codes:
- 200 OK: Upon successful execution. Cancel the job run
- 400 BAD REQUEST: wrong URL pattern.
- 404 NOT FOUND: job run not found
- 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
- 410 GONE: job run was finished before and has already been moved to the history of job runs
- 500 INTERNAL SERVER ERROR: other errors

Monitor a workflow run

Use a GET request to monitor a workflow run.

Supported operations:

GET: monitor workflow run.

Usage:

URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/
Allowed methods:
- GET
Response status codes:
- 200 OK: Upon successful execution.
- 404 NOT FOUND: If the workflow run specified does not exist. This can either mean that the workflow run existed but has already been finished, or that it never existed all. You cannot differentiate both cases without further information unless you can make sure that the ID existed before.

Examples:

To monitor a workflow run:

GET /smila/jobmanager/jobs/myJob/20110527_175314695579/workflowrun/1/

If it is still running, the result would be:

HTTP/1.x 200 OK

{
  "activeTaskCount": 1
  "transientBulkCount": 1
}

If not, the result would be:

HTTP/1.x 404 NOT FOUND

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/JobRuns"

Revision as of 13:53, 12 July 2011

Contents

Job runs

Job modes

Start job run

Monitoring a job run or deleting job run data

Monitoring a job run with more details

Finish job run

Cancel job run

Monitor a workflow run

@@ Line 2: / Line 2: @@
 = Job runs =
 <span style="color:#ff0000">'''This page is work in progress.'''</span>
-== Start job run ==
+With a [[SMILA/Documentation/JobDefinition|job definition]] alone, the system is not yet doing anything. First, the job must be started to get a so called job run. How the actual processing is then triggered, depends on the mode of the job run.
-=== In standard mode ===
+== Job modes ==
+There are two different modes in which job runs can be operated:
+* Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow. They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
+* Job runs in ''runOnce" mode require that the connected workflow has exactly one input bucket in its start action. Unlike job runs in standard mode, they do not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically. Once in the FINISHING state, they do not react on further changes and then go to SUCCESSFUL when all tasks have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed at all. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.
-Use a POST request to start a job run in standard mode.
+== Start job run ==
+Use a POST request to start a job run in "standard" mode. To start a job run in "runOnce" mode, add the following simple JSON object to the request body:
+<pre>
+{
+  "mode": "runOnce"
+}
+</pre>
 '''Supported operations:'''
 *POST: Start job run.
@@ Line 18: / Line 28: @@
 *URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/</nowiki></tt>
 *Allowed methods:
-**POST
+**POST (with optional request body specifying the mode)
 *Response status codes:
-**200 OK: Upon successful execution. A JSON object with <tt>jobId</tt> and <tt>url</tt> will be returned. This starts a job run that waits for input: it has either a bulkSource worker as its start action that listens on some API and requests tasks when it needs one (e.g. BulkBuilder) or it waits for changes in the input buckets of the start action. It does this until someone calls the "finish" command for this job run (see below). Then no new workflow runs are allowed anymore, only the current active workflow runs will be able to complete. We call this the "standard" mode of a job run.
+**200 OK: Upon successful execution. A JSON object with <tt>jobId</tt> and <tt>url</tt> will be returned.
 '''Example:'''
-To start the job named "myJob":
+To start the job named "myJob" in "standard" mode:
 <pre>
@@ Line 41: / Line 51: @@
 </pre>
-=== In non-standard mode (e.g. "runOnce" mode) ===
+To start the job named "myJob" in "runOnce" mode:
-If you want to start a job in a different mode than "standard", you must state this in a JSON body of the POST request above:
 <pre>
-POST /smila/jobmanager/jobs/<job-name>/
+POST /smila/jobmanager/jobs/myJob/
 {
@@ Line 53: / Line 61: @@
 </pre>
-This mode is currently supported:
+The result object would be equal to "standard" mode.
-* '''runOnce:''' Allowed only for jobs with a start worker with a start action with exactly one input bucket. The job manager creates tasks to process the complete content of the input bucket and then immediately sets the new job run to state FINISHING, so it does not react to further changes in the input bucket and goes to SUCCESSFUL after all the tasks created initially have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.
 == Monitoring a job run or deleting job run data ==
+Use a GET request to view job run data of a specific job run. Use DELETE to delete the data of a specific job run.
-Use a GET request to view job run data or DELETE to delete job run data..
 '''Supported operations:'''
-*GET: monitor job run.
-*DELETE: delete job run data.
+*GET: To monitor the job run.
+*DELETE: To delete job run data.
 '''Usage:'''
-*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-id></nowiki></tt>.
+*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-id>/</nowiki></tt>
 *Allowed methods:
 **GET
 **DELETE
 *Response status codes:
-**200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id did not exist, no error will occur during DELETE.
+**200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id does not exist, no error will occur during DELETE.
-**500 Server Error: If the job run is active an error will occure (DELETE).
+**500 Server Error: If the job run is still running (DELETE).
-Following parameters will be returned:
+The following parameters are contained in the job run data:
-*'''jobId''': the job id.
+*'''jobId''': The ID of the job run.
-*'''runMode''': the job run mode, currently this can be <tt>STANDARD</tt> or <tt>RUNONCE</tt>.
+*'''runMode''': The mode of the job run, i.e. either <tt>STANDARD</tt> or <tt>RUNONCE</tt>.
-*'''state''': current state of the job run, may be:
+*'''state''': The current status of the job run. May be one of the following:
 **<tt>PREPARING</tt>: started but not running yet
 **<tt>RUNNING</tt>: running
-**<tt>FINISHING</tt>: finished but not yet completed
+**<tt>FINISHING</tt>: finished but not completed yet
 **<tt>SUCCEEDED</tt>: successfully completed
 **<tt>FAILED</tt>: failed
 **<tt>CANCELING</tt>: canceled, but clean-up is not yet completed.
 **<tt>CANCELED</tt>: canceling done.
-*'''workflowRuns''': Numbers of workflow runs in this job run:
+*'''workflowRuns''': Describes the workflow runs which are part of this job run. Note: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
-**'''startedWorkflowRunCount''': describes the number of started workflow runs
+**'''startedWorkflowRunCount''': The number of started workflow runs.
-**'''activeWorkflowRunCount''': describes the number of currently active workflow runs.
+**'''activeWorkflowRunCount''': The number of active workflow runs.
-**'''successfulWorkflowRunCount''': describes the number of successfully finished workflow runs
+**'''successfulWorkflowRunCount''': The number of successfully finished workflow runs.
-**'''failedWorkflowRunCount''': describes the number of failed workflow runs
+**'''failedWorkflowRunCount''': The number of failed workflow runs.
-**'''canceledWorkflowRunCount''': describes the number of canceled workflow runs (active runs when a job run was canceled)
+**'''canceledWorkflowRunCount''': The number of canceled workflow runs.
-** Hint: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
+*'''tasks''': Describes the tasks which are part of this job run. After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
-*'''tasks''': Numbers of tasks in this job run:
+** '''createdTaskCount''': The number of tasks created in this run. This includes tasks created due to retry.
-** '''createdTaskCount''': Number of tasks created in this run. This includes tasks created for retry.
+** '''successfulTaskCount''': The number of tasks that were finished successfully by a worker.
-** '''successfulTaskCount''': Number of tasks that where finished successully by a worker.
+** '''retriedAfterErrorTaskCount''': The number of tasks that were retried because a worker finished the task with a recoverable error (e.g. IOError while reading the input or writing the output).
-** '''retriedAfterErrorTaskCount''': Number of tasks that where retried because a worker finished the task with a recoverable error (e.g. IOError reading the input or writing the output)
+** '''retriedAfterTimeoutTaskCount''': The number of tasks that were retried because a worker did not send the "keepAlive" signal anymore.
-** '''retriedAfterTimeoutTaskCount''': Number of tasks that where retried because a worker did not send the keepAlive anymore.
+** '''failedAfterRetryTaskCount''': The number of tasks that finally failed after reaching the configured maximum number of retries.
-** '''failedAfterRetryTaskCount''': Number of tasks that finally failed after reaching the configured maximum number of retries
+** '''failedWithoutRetryTaskCount''': The number of tasks that finally failed because the worker finished the task with a fatal error (e.g. due to corrupt input data).
-** '''failedWithoutRetryTaskCount''': Number of tasks that finally failed because the worker finished the task with a fatal error (e.g. corrupted input data)
+** '''canceledTaskCount''': The number of tasks that were canceled because a workflow run was canceled or failed due to another task in the workflow run having finally failed. They may have produced their result successfully, but they did not trigger follow-up tasks.
-** '''canceledTaskCount''': Number of tasks that where canceled because a workflow run was canceled or failed due to another task in the workflow run failed finally. They may have produced their result successfully, but they did not trigger follow-up tasks.
+** '''obsoleteTaskCount''': The number of tasks that became obsolete for some reason. The difference to 'canceledTaskCount' is that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
-** '''obsoleteTaskCount''': Number of tasks that became obsolete. The difference to 'canceledTaskCount' is, that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
+* '''startTime''': The timestamp when the job run was started (DateTime format ISO).
-*** ''Example'': After a partition merge the old partitions don't exist anymore, so Delete-Tasks for the old partitions are obsolete now
+* '''finishTime''': The timestamp when the finish command was called for this job run (DateTime format ISO).
-** Hint: After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
+* '''endTime''': The timestamp when the job status changed to <tt>SUCCEEDED</tt>, <tt>FAILED</tt> or <tt>CANCELED</tt>.
-* '''startTime''': describes when the job run has been started (DateTime format ISO)
+* '''worker''': Contains accumulated job run data for all workers that have contributed to this job run. It contains:
-* '''finishTime''': describes when finish job run has been called (DateTime format ISO)
+** The number of successful, failed, and retried tasks for each worker in this job run (same counter names and meanings as in the global section above).
-* '''endTime''': describes when the job status changed to <tt>SUCCEEDED</tt>, <tt>FAILED</tt> or <tt>CANCELED</tt>Upon successful execution
+** '''startTime''': The timestamp when the first task for a worker of this type was started in the job run (DateTime format ISO).
-* '''worker''': descibes job run information for all workers that have contributed to this job run. It contains
+** '''finishTime''': The timestamp when the latest task for a worker of this type was finished in the job run (DateTime format ISO). This timestamp is updated with every finished task.
-** the number of successful, failed and retried tasks (if any) for each worker in this job run (same counter names and meanings as in the global section above)
+** The accumulated counters as reported by the workers in their result descriptions.
-** '''startTime''': describes when the first task for a worker of this type has been started for this workflow (DateTime format ISO)
-** '''finishTime''': describes when the most recent task for a worker of this type has ended for this workflow (DateTime format ISO)
-** the sums of counters reported by the workers in their result descriptions.
 == Monitoring a job run with more details ==
 It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.
 <pre>
 GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true
@@ Line 119: / Line 124: @@
 == Finish job run ==
 Use a POST request to finish a job run.
 '''Supported operations:'''
 *POST: finish job run.
 '''Usage:'''
-*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish</nowiki></tt>.
+*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish/</nowiki></tt>
 *Allowed methods:
 **POST
@@ Line 135: / Line 140: @@
 **404 NOT FOUND: job run not found
 **405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
-**410 GONE: job run had been finished before and was already moved to job run history
+**410 GONE: job run was finished before and has already been moved to the history of job runs
-**500 INTERNAL SERVER ERROR: other error
+**500 INTERNAL SERVER ERROR: other errors
 == Cancel job run ==
 Use a POST request to cancel a job run.
 '''Supported operations:'''
 *POST: cancel job run.
 '''Usage:'''
-*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel</nowiki></tt>.
+*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel/</nowiki></tt>
 *Allowed methods:
 **POST
@@ Line 156: / Line 160: @@
 **404 NOT FOUND: job run not found
 **405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
-**410 GONE: job run had been finished before and was already moved to job run history
+**410 GONE: job run was finished before and has already been moved to the history of job runs
-**500 INTERNAL SERVER ERROR: other error
+**500 INTERNAL SERVER ERROR: other errors
 == Monitor a workflow run ==
 Use a GET request to monitor a workflow run.
 '''Supported operations:'''
 *GET: monitor workflow run.
 '''Usage:'''
-*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id></nowiki></tt>.
+*URL: <tt><nowiki>http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/</nowiki></tt>
 *Allowed methods:
 **GET
 *Response status codes:
 **200 OK: Upon successful execution.
-**404 NOT FOUND: If the workflow run specified doesn't exist (anymore), there will be an error. It can either mean that the workflow run existed but has finished, or that it never existed. You can't differentiate these both cases without further information. However, if it's clear that the workflow run existed before, e.g. the id was returned in a bulkbuilder response, it is clear that it's finished now.
+**404 NOT FOUND: If the workflow run specified does not exist. This can either mean that the workflow run existed but has already been finished, or that it never existed all. You cannot differentiate both cases without further information unless you can make sure that the ID existed before.
+'''Examples:'''
+To monitor a workflow run:
-Sample return value:
 <pre>
+GET /smila/jobmanager/jobs/myJob/20110527_175314695579/workflowrun/1/
+</pre>
+If it is still running, the result would be:
+<pre>
+HTTP/1.x 200 OK
 {
-   activeTaskCount: 1
+   "activeTaskCount": 1
-   transientBulkCount: 1
+   "transientBulkCount": 1
 }
 </pre>
-* '''activeTaskCount''': How many tasks are currently processed in this workflow run.
+If not, the result would be:
-* '''transientBulkCount''': How many transient bulks currently exist for this workflow run.
+<pre>
+HTTP/1.x 404 NOT FOUND
+</pre>

Breadcrumbs

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/JobRuns"

Revision as of 13:53, 12 July 2011

Contents

Job runs

Job modes

Start job run

Monitoring a job run or deleting job run data

Monitoring a job run with more details

Finish job run

Cancel job run

Monitor a workflow run