Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/JobRuns

Jobs

Job Runs

Start Job Run

Standard Mode

Use a POST request to start a job run.

Supported operations:

  • POST: start job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>.
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution. A JSON object with jobId and url will be returned. This starts a job run that waits for input: it has either a bulkSource worker as its start action that listens on some API and requests tasks when it needs one (e.g. BulkBuilder) or it waits for changes in the input buckets of the start action. It does this until someone calls the "finish" command for this job run (see below). Then no new workflow runs are allowed anymore, only the current active workflow runs will be able to complete. We call this the "standard" mode of a job run.

Starting a job run in a non standard mode: RUNONCE

If you want to start a job in a different mode than "standard", you must state this in a JSON body of the POST request above:

POST /smila/jobmanager/jobs/<job-name> {

 "mode": "runOnce"

}

This mode is currently supported:

  • runOnce: Allowed only for jobs with a start worker with a start action with exactly one input bucket. The job manager creates tasks to process the complete content of the input bucket and then immediately sets the new job run to state FINISHING, so it does not react to further changes in the input bucket and goes to SUCCESSFUL after all the tasks created initially have been processed successfully. If something goes wrong while creating the tasks, the job run goes to state FAILED immediately and no task should be processed. All tasks are executed in a single workflow run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Usually it will create one task for each object in the input bucket.

Monitoring a job run or deleting job run data

Use a GET request to view job run data or DELETE to delete job run data..

Supported operations:

  • GET: monitor job run.
  • DELETE: delete job run data.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-id>.
  • Allowed methods:
    • GET
    • DELETE
  • Response status codes:
    • 200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id did not exist, no error will occur during DELETE.
    • 500 Server Error: If the job run is active an error will occure (DELETE).

Following parameters will be returned:

  • jobId: the job id.
  • runMode: the job run mode, currently this can be STANDARD or RUNONCE.
  • state: current state of the job run, may be:
    • PREPARING: started but not running yet
    • RUNNING: running
    • FINISHING: finished but not yet completed
    • SUCCEEDED: successfully completed
    • FAILED: failed
    • CANCELING: canceled, but clean-up is not yet completed.
    • CANCELED: canceling done.
  • workflowRuns: Numbers of workflow runs in this job run:
    • startedWorkflowRunCount: describes the number of started workflow runs
    • activeWorkflowRunCount: describes the number of currently active workflow runs.
    • successfulWorkflowRunCount: describes the number of successfully finished workflow runs
    • failedWorkflowRunCount: describes the number of failed workflow runs
    • canceledWorkflowRunCount: describes the number of canceled workflow runs (active runs when a job run was canceled)
    • Hint: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
  • tasks: Numbers of tasks in this job run:
    • createdTaskCount: Number of tasks created in this run. This includes tasks created for retry.
    • successfulTaskCount: Number of tasks that where finished successully by a worker.
    • retriedAfterErrorTaskCount: Number of tasks that where retried because a worker finished the task with a recoverable error (e.g. IOError reading the input or writing the output)
    • retriedAfterTimeoutTaskCount: Number of tasks that where retried because a worker did not send the keepAlive anymore.
    • failedAfterRetryTaskCount: Number of tasks that finally failed after reaching the configured maximum number of retries
    • failedWithoutRetryTaskCount: Number of tasks that finally failed because the worker finished the task with a fatal error (e.g. corrupted input data)
    • canceledTaskCount: Number of tasks that where canceled because a workflow run was canceled or failed due to another task in the workflow run failed finally. They may have produced their result successfully, but they did not trigger follow-up tasks.
    • obsoleteTaskCount: Number of tasks that became obsolete. The difference to 'canceledTaskCount' is, that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
      • Example: After a partition merge the old partitions don't exist anymore, so Delete-Tasks for the old partitions are obsolete now
    • Hint: After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
  • startTime: describes when the job run has been started (DateTime format ISO)
  • finishTime: describes when finish job run has been called (DateTime format ISO)
  • endTime: describes when the job status changed to SUCCEEDED, FAILED or CANCELEDUpon successful execution
  • worker: descibes job run information for all workers that have contributed to this job run. It contains
    • the number of successful, failed and retried tasks (if any) for each worker in this job run (same counter names and meanings as in the global section above)
    • startTime: describes when the first task for a worker of this type has been started for this workflow (DateTime format ISO)
    • finishTime: describes when the most recent task for a worker of this type has ended for this workflow (DateTime format ISO)
    • the sums of counters reported by the workers in their result descriptions.

Monitoring a job run with more details

It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.

GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true

=== Finish Job Run ===Upon successful execution

Use a POST request to finish a job run.

Supported operations:

  • POST: finish job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish.
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Finishes the job run (asynchronous call)
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run had been finished before and was already moved to job run history
    • 500 INTERNAL SERVER ERROR: other error


Cancel Job Run

Use a POST request to cancel a job run.

Supported operations:

  • POST: cancel job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel.
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution. Cancel the job run
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run had been finished before and was already moved to job run history
    • 500 INTERNAL SERVER ERROR: other error

Back to the top