Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/WorkerAndWorkflows"

m
 
(66 intermediate revisions by 11 users not shown)
Line 1: Line 1:
{{note| Available since SMILA 0.9!}}
 
 
 
= Workers and Workflows =
 
= Workers and Workflows =
  
<span style="color:#ff0000">'''This page is work in progress.'''</span>
+
Please note that job manager element names (like workers and workflows) must conform to the job manager naming convention:
 +
* names must inly consist of the following characters: <b>a-zA-Z._-</b>
 +
 
 +
If they do not conform, they won't be accessible in SMILA.
 +
* Pushing elements with invalid names will result in a 400 Bad Request,
 +
* predefined elements with invalid names won't be loaded, a warning will be logged in the SMILA.log file.
 +
E.g.
 +
<source lang="text">
 +
... WARN  ...  internal.DefinitionPersistenceImpl            - Error parsing predefined worker definitions from configuration area
 +
  org.eclipse.smila.common.exceptions.InvalidDefinitionException: Value 'worker#1' in field 'name' is not valid:
 +
  A name must match pattern ^[a-zA-Z0-9-_\.]+$.
 +
</source>
  
 
== Workers ==
 
== Workers ==
  
=== Worker Definition ===
+
=== Worker definition ===
  
The worker definition is provided with software. It defines default workers provided and must not be changed by the user.
+
A worker definition describes the input and output behavior as well as the required parameters of a worker. The definitions are provided with the software and must be known in the system before a worker can be added as an action to a workflow. They cannot be added or edited at runtime and are therefore not intended to be manipulated by the user.  
  
Worker definitions cannot be added at runtime. They describe worker behaviour as needed by job manager to generate appropriate tasks and data objects.
+
Typically, a worker definition consists of the following parts:
  
Required input and output data is described in terms of bucket types as defined before. Additional string parameters may be needed and must be defined when used in workflow. Values for these parameters must be added as properties to the tasks created for this worker. (I.e. the names of all input and output slots have to be explicitly linked to names of existing buckets by the workflow referencing the workers as actions, see below. The workflow doesn't not need to define output slots which are marked as <tt>optional</tt>.)  
+
# A '''parameter section''' declaring the worker's parameters: These [[SMILA/Documentation/JobParameters|parameters]] must be set either in the workflow or in the job definition when using this worker.
 +
# An '''input slot''' describing the type of input objects that the worker is able to consume: All input slots must be connected to buckets in a workflow definition that wants to use this worker.
 +
# An '''output slot''' describing the type of output objects that the worker generates: All output slots must be connected to buckets in a workflow definition that wants to use this worker. An exception to this rule are output slots that were marked as optional in the worker definition or output slots that belong to another slot group (see below).
  
As an advanced feature, output slots can be arranged into ''groups''. The purpose of this is to describe which slots must or must not be used together: In a single workflow action it is not possible to use slot from different groups, but only slots of a single group and slots that are not marked with a group (they belong to each group implicitly). When using groups, the rules concerning optional and mandatory slots are as follows:
+
==== Slot groups ====
* A non-<tt>optional</tt> slot without a group must be always be connected to a bucket.
+
As an advanced feature, output slots can be associated with a group label. Slots having the same group label then belong to the same group. Grouping is used to define which slots can be used together in the same workflow and which not. Whereas slots that were not associated with a group label can be combined freely because they belong to each group implicitly, it is not possible to use slots from different groups in the same workflow. When using groups, the rules concerning optional and mandatory output slots are as follows:
* An <tt>optional</tt> slot without a group is allowed in combination with any group slot.
+
* A mandatory slot without a group label must always be connected to a bucket.
* If a group is used, all non-<tt>optional</tt> slots of the same group must be connected to a bucket, too.
+
* An optional slot without a group label is allowed in any combination with other any group slot.
* If each group contains at least one non-<tt>optional</tt> slot, at least one group must be connected. It's not possible to use only the groupless slots then.
+
* If a particular group shall be used, all mandatory slots of the group must be connected to a bucket.
 +
* If each group contains at least one mandatory slot, at least one group must be connected. It is not possible then to connect the slots without a group label only.
  
The worker properties in detail:
+
==== Worker properties in detail ====
* <tt>name</tt> is mandatory.
+
 
* <tt>modes</tt> is optional and describes the mode of the worker
+
* <tt>name</tt>: Required. Defines the name of the worker. Can be used in a workflow to add the worker as an action.
**<tt>bulkSource</tt>: Can start a workflow, does not need input data. A task for this worker is created on demand when the worker requests it (in-progress tasks only)
+
* <tt>modes</tt>: Optional. Sets a mode in the worker, controlling a special behavior. Currently availably:
**<tt>autoCommit</tt>: When the worker dies while working on a task (sends no keep-alive anymore) the started bulks are committed by the job manager and follow-up actions are triggered, the task is not rolled back.
+
** <tt>bulkSource</tt>: describes workers like the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] that get data from somewhere not under JobManager control (e.g. an external API). Such workers are needed to start jobs in standard mode without an input bucket. They are not waiting for tasks to appear in the TaskManager queues for them, but require the JobManager to create "initial tasks" for them on-demand.
* <tt>parameters</tt> is optional and describes the parameters needed to configure the worker
+
** <tt>autoCommit</tt>: if this worker fails to process a tasks, the Job Manager will not retry that task but finishes it as successfully and create follow-up tasks based on what the worker has produced already. An example is the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] again: If a client sends a record to the Bulkbuilder, it assumes that it will be processed, so if the Bulkbuilder fails to finish the task correctly, the records already added to a bulk must be processed by the job nonetheless.
* <tt>taskGenerator</tt> is optional and configures a piece of code (OSGi service) that is used to create the actual tasks after changes in the input buckets. Can be used to create multiple tasks for a single change event, or to filter events: If the generator does not actually create a task for the event, the action is cancelled.
+
** <tt>runAlways</tt>: Task delivery to this worker should not be limited by scale-up control: If tasks are available the worker will be allowed to process as much tasks as the scaleUp limit for this worker specifies (by default 1), this will not be prevented if the global task scale-up limit for this node is already reached. So this mode should be used only for workers that perform very important tasks that should not be delayed too much (the internal "_finishingTasks" worker is an example), but you should be careful to increase the scaleUp limit for such workers because this can easily result in an overload on a node. Especially you should be very cautious about runAlways workers that have long-running tasks which are computationally intensive.
 +
** <tt>requestsCompletion</tt>: describes a worker that wants to add ''completion tasks'' after the normal job tasks have finished. See [[SMILA/Documentation/JobRuns#Job_run_life_cycle|Job Run Life Cycle]] for details and [[SMILA/Documentation/Importing/Concept#Delta_Delete|Import Delta Delete]] for an example.
 +
** <tt>barrier</tt>: Tasks for this worker will not be generated before all tasks for workers have been finished that occur in the same workflow run before this worker. Then, all bulks created so far in this workflow run in all  buckets connected to the slot will be given to the task generator of this worker in a single call. The purpose of this mode is to support [[SMILA/Discussion/MapReduce|MapReduce workflows]]. Using more than one barrier workers in a single workflow is supported. Cycles with barriers have not yet been tested ;-)
 +
* <tt>parameters</tt>: Optional. Contains a description of the worker's [[SMILA/Documentation/JobParameters|parameter]], i.e. the supported parameters, the possible types, cardinalities and values, etc. See [[SMILA/Documentation/ParameterDefinition]] for details.
 +
* <tt>taskGenerator</tt>: Optional. Defines the name of the OSGi service which should be used to create the tasks whenever there are changes in the respective input buckets. If the taskGenerator is not set, the default task generator is used.
 +
* <tt>input</tt>: Optional. Describes the input slots:
 +
**<tt>name</tt>: Gives the name of a slot. Has to be bound as a parameter key to an existing bucket in a workflow.
 +
**<tt>type</tt>: Gives the required data object type of the input slot. The bucket bound in an actual workflow must comply with this type.
 +
**<tt>modes</tt>: Sets the mode(s) of the respective input slot, controlling a special behavior.
 +
*** <tt>qualifier</tt>: When set, worker uses "Conditional GET" to select tasks with certain objects for this input slot.
 +
***<tt>optional</tt>: When set, no error will occur when adding a workflow that uses this worker as start worker without an input bucket.
 +
**** the worker has to check whether the input slot is connected instead of accessing it directly.
 +
* <tt>output</tt>: Optional. Describes the output slots:
 +
**<tt>name</tt>: Gives the name of the slot. Has to be bound as a parameter key to an existing bucket in a workflow.
 +
**<tt>type</tt>: Gives the required data object type of the output slot. The bucket bound in an actual workflow must comply with this type.
 +
**<tt>group</tt>: Gives the group label of this slot (see above).
 +
**<tt>modes</tt>: Sets the mode(s) of the respective output slot, controlling a special behavior.
 +
***<tt>optional</tt>: When set, no error will occur when adding a workflow that does not bind the output slot to a bucket.
 +
***<tt>multiple</tt>: When set, the number of output bulks is not predefined by job management. Instead, worker can create multiple output bulks based on an object id prefix given by job management.
 +
***<tt>maybeEmpty</tt>: When set, no error will occur when a worker doesn't create an output bulk for a processed task.
 +
***<tt>appendable</tt>: When set, bulk has to be created for failsafe append.
 +
 
 +
Worker definitions can include additional information (e.g. comments or layout information for graphical design tools, etc.), but a GET request will return only relevant information (i.e. the above attributes). If you want to retrieve the additional info that is present in the json file, add returnDetails=true as request parameter.
 +
 
 +
==== Example ====
 +
An exemplary worker definition:
 +
 
 +
<pre>
 +
{
 +
  "name" : "exampleWorker",
 +
  "readOnly": true,
 +
  "parameters":[
 +
    { "name": "parameter1" , "optional": "true"},
 +
    { "name": "parameter2" }           
 +
  ],
 +
  "input" : [ {
 +
    "name" : "inputRecords",
 +
    "type" : "recordBulks"
 +
  } ],
 +
  "output" : [ {
 +
    "name" : "outputRecords",
 +
    "type" : "recordBulks"
 +
  }]
 +
}
 +
</pre>
 +
 
 +
As workers currently can be defined in the system configuration only, they are all marked as "readOnly" (see [[SMILA/Documentation/JobManagerConfiguration]]).
 +
 
 +
A more complex sample:
 +
<pre>
 +
{
 +
  "workers":[
 +
      {
 +
        "name":"worker",
 +
        "parameters":[
 +
            {
 +
              "name":"stringParam",
 +
              "optional":true,
 +
              "description":"optional string parameter with default type 'string'"
 +
            },
 +
            {
 +
              "name":"booleanParam",
 +
              "type":"boolean",
 +
              "description":"boolean parameter"
 +
            },
 +
            {
 +
              "name":"enumParam",
 +
              "type":"string",
 +
              "values":[
 +
                  "val1",
 +
                  "val2"
 +
              ],
 +
              "optional":true,
 +
              "description":"optional enum parameter with values 'val1' or 'val2'"
 +
            },
 +
            {
 +
              "name":"mapParam",
 +
              "type":"map",
 +
              "entries":[
 +
                  {
 +
                    "name":"key1",
 +
                    "type":"string"
 +
                  },
 +
                  {
 +
                    "name":"key2",
 +
                    "type":"string"
 +
                  }
 +
              ],
 +
              "description":"map parameter with two entries of type string and keys 'key1' and 'key2'"
 +
            },
 +
            {
 +
              "name":"sequenceOfStringsParam",
 +
              "type":"string",
 +
              "multi":true,
 +
              "description":"a sequence of string parameters"
 +
            },
 +
            {
 +
              "name":"<something>",
 +
              "type":"string",
 +
              "description":"additional parameter with unspecified name"
 +
            },
 +
            {
 +
              "name":"anyParam",
 +
              "type":"any",
 +
              "optional":true,
 +
              "description":"optional parameter with an 'Any' value"
 +
            }
 +
        ]
 +
      }
 +
  ]
 +
}
 +
</pre>
  
 
=== List workers ===
 
=== List workers ===
  
 
==== All workers  ====
 
==== All workers  ====
 
 
Use a GET request to list all worker definitions.
 
Use a GET request to list all worker definitions.
  
 
'''Supported operations:'''  
 
'''Supported operations:'''  
  
*GET: Returns a list of all worker definitions.
+
*GET: Returns a list of all worker definitions. If you want to retrieve the additional information (if present), add returnDetails=true as request parameter.  
  
 
'''Usage:'''  
 
'''Usage:'''  
Line 48: Line 170:
  
 
==== Specific worker  ====
 
==== Specific worker  ====
 
 
Use a GET request to list the definition of a specific worker.
 
Use a GET request to list the definition of a specific worker.
  
 
'''Supported operations:'''  
 
'''Supported operations:'''  
  
*GET: Returns the definition of the given worker.  
+
*GET: Returns the definition of the given worker. Optional parameter: <tt>returnDetails</tt>: <tt>true</tt> or <tt>false</tt> (default)
  
 
'''Usage:'''  
 
'''Usage:'''  
Line 65: Line 186:
 
== Workflows ==
 
== Workflows ==
  
=== Workflow Definition ===
+
=== Workflow definition ===
Describes the work to be done by associating buckets to workers. All input and output slots of workers must be associated to buckets. The types of buckets must match the required bucket types described in the worker definition.
+
 
 +
A workflow definition describes the individual actions of an asynchronous workflow by connecting workers to input and
 +
output slots. Which slots have to be connected depends on the workers you are using and is defined by the [[#Worker_definition|worker definition]]. Typically, all input and output slots of a used worker must be associated to buckets. And, the type of the connected bucket must match that defined in the worker's definition.
  
 
A workflow run starts with the start-action. The order of the other actions is determined by their inputs and outputs.
 
A workflow run starts with the start-action. The order of the other actions is determined by their inputs and outputs.
  
 +
==== Connecting a workflow to another workflow ====
 +
 +
A workflow can be linked to another workflow when both share the same persistent bucket. To give an example, let's assume a workflow named A and a workflow named B sharing the same bucket. If the first workflow A then adds an object into the shared bucket, the second workflow B is triggered to process this data. To be able to connect workflow A and B, the following prerequisites must be fulfilled:
 +
* The shared bucket must be a persistent one.
 +
* The definition of workflow A must define the shared bucket as an output bucket of an action. This can be any action in the workflow chain, hence, not necessarily the first or the last one.
 +
* The definition of workflow B must state the shared bucket as the input bucket of its start action. Other positions in the workflow definition will not do.
 +
* Individual jobs must be created for both the triggering (A) and the triggered workflow (B).
 +
* The parameters used for the store and object name in the data object type definition of the shared bucket must be identical in both job definitions.
 +
* The job runs must fulfill the following conditions to allow for the triggering of a connected workflow:
 +
** The status of the job run using workflow A must be RUNNING or FINISHING.
 +
** The status of the job run using workflow B must be RUNNING.
 +
 +
Warning: As there is no explicit chaining of workflows, you have to be very careful when using the same bucket name in multiple workflow definitions. This might result in the triggering of jobs which were not meant to be triggered at all.
 +
 +
==== Workflow properties in detail ====
  
 
Description of a workflow:
 
Description of a workflow:
*<tt>name</tt>
 
** MANDATORY
 
** The name of a workflow
 
*<tt>parameters</tt> (MAP)
 
** The parameters defined within this workflow as a map
 
*<tt>startAction</tt> (MAP)
 
** MANDATORY
 
** the starting action of this workflow
 
** there can be only one starting action
 
*<tt>actions</tt> (LIST of MAPs)
 
** the non-starting-actions of this workflow which
 
  
Description of an action
+
*<tt>name</tt>: Required. Gives the name of the workflow.
*<tt>worker</tt>
+
*<tt>modes</tt> (LIST): Optional. Restricts the modes a job referring to this workflow can be started in and defines the default mode. Possible modes are {{code|standard}} and {{code|runOnce}}.
** the name of an existing worker definition
+
** if no <tt>modes</tt> are given and overwritten in a job definition, all modes can be used to start this workflow in a job, default mode will be {{code|standard}} if no mode is explicitly provided at job start.
*<tt>parameters</tt>
+
** the first mode in this list will be used as the default job run mode (i.e. if no mode is provided during job start).
** the parameters the workflow defines for this worker (not for the buckets the worker uses!)
+
** a <tt>modes</tt> section in the workflow can be restricted (or the default mode can be changed) with a <tt>modes</tt> section in the job definition, but can not be expanded in the job definition. See [[SMILA/Documentation/JobRuns#Job_modes|Job modes]] for more information.
*<tt>input</tt> (MAP)
+
*<tt>parameters</tt> (MAP): Optional. Sets the [[SMILA/Documentation/JobParameters|global workflow parameters]]. They apply to all actions in the workflow as well as to the buckets used by these workers.
** The mapping of the worker's named input slots (KEY) to an existing bucket definition (VALUE)
+
*<tt>startAction</tt> (MAP): Required. Defines the starting action of the workflow. There can be only one starting action within the workflow.
** all of the worker's named input slots have to be resolved against an existing bucket of the same type.
+
*<tt>actions</tt> (LIST of MAPs): Optional. Defines the follow-up actions of the workflow.
*<tt>output</tt> (MAP)
+
*<tt>timestamp</tt>: The (readonly) timestamp that is created by the system when the workflow has been pushed to the system (initial creation or last update). Read-Only workflows (i.e. workflows initially loaded from <tt>workflow.json</tt> file have no <tt>timestamp</tt> property. The value cannot be set manually, it is system defined.
** The mapping of the worker's named output slots (KEY) to an existing bucket definition (VALUE)
+
*Additional properties can be provided, but will only be listed when <tt>returnDetails</tt> is set to <tt>true</tt>. This could be used by a designer tool to add layout information or comments.
** all of the worker's named output slots have to be resolved against an existing bucket of the same type.
+
  
 +
Description of <tt>startAction</tt> and <tt>actions</tt>:
  
'''Parameters'''
+
*<tt>worker</tt>: Gives the name of a worker. This name must comply with the name given in the [[#Worker_definition|worker definition]].
 +
*<tt>parameters</tt>: Sets the [[SMILA/Documentation/JobParameters|local worker parameters]]. The apply to the referenced worker but not to the buckets used by this worker.
 +
*<tt>input</tt> (MAP): Maps the worker's named input slot(s) to an existing bucket definition. The name of an input slot must be the key and the name of the bucket must be the value of that key. All of the worker's named input slots have to be resolved against existing buckets of the expected type.
 +
*<tt>output</tt> (MAP): Maps the worker's named output slot(s) to an existing bucket definition. The name of an output slot must be the key and the name of the bucket must be the value of that key. All of the worker's named output slots have to be resolved against existing buckets of the expected type.
  
We have two kinds of parameters in the workflow definition:
+
Workflow definitions can include additional information (e.g. comments or layout information for graphical design tools, etc.), but a GET request will return only relevant information (i.e. the above attributes). If you want to retrieve the additional info that is present in the json file or has been posted with the definition, add returnDetails=true as request parameter.
* ''Global workflow parameters'': Parameters that are set globally and apply to every action in the workflow.
+
* ''Local worker parameters'': Parameters that are set locally and are applied only to the action in which they were defined.
+
  
Please note that the local parameters do not affect the buckets that the respective worker is using. As a consequence, parameters declared in a data object type can only be initialized using global workflow parameters or job parameters alternatively.
+
===== Non-forking workflows =====
 +
Workflows are called "non-forking" if no two workers in the workflow share the same input bucket. This has an impact on the clean up of temporary objects during a job run. For non-forking workflows, the removal of an input object from a transient bucket will be done directly after the worker has successfully finished its task. For forking workflows, this clean up will be done not before the whole workflow run has finished.
  
Sample:
+
==== Example ====
 +
An exemplary workflow definition:  
  
 
<pre>
 
<pre>
    {
+
{
      "name": "myWorkflow",
+
  "name":"myWorkflow",
      "parameters":  
+
  "modes": ["runOnce","standard"],
        {
+
  "parameters":{
            "myGlobalParam": "..."
+
      "paramKey2":"paramValue2",
        }
+
      "paramKey1":"paramValue2"
       "startAction":
+
  },
        {
+
  "startAction":{
          "parameters":  
+
      "worker":"worker1",
             {
+
      "input":{
              "myLocalParam": "..."
+
        "slotA":"myBucketA"
             }
+
      },
          "worker": "myWorker",
+
       "output":{
      ...
+
        "slotB":"myBucketB"
 +
      }
 +
  },
 +
  "actions":[
 +
      {
 +
        "worker":"worker2",
 +
        "parameters":{
 +
             "paramKey3":"paramValue3"
 +
        },
 +
        "input":{
 +
            "slotC":"myBucketB"
 +
        },
 +
        "output":{
 +
             "slotD":"myBucketC"
 +
        }
 +
      },
 +
      {
 +
        "worker":"worker3",
 +
        "input":{
 +
            "slotE":"myBucketC"
 +
        },
 +
        "output":{
 +
            "slotF":"myBucketD"
 +
        }
 +
      }
 +
  ],
 +
  "timestamp" : "2011-07-25T08:57:47.628+0200"
 +
}
 
</pre>
 
</pre>
 
Data object types and workers define parameter variables: ${...}
 
* Needed data object type variables that are not set in a bucket parameter must be either set as workflow or job parameter.
 
* Needed worker variables must be either set as workflow or job parameter.
 
  
 
=== List, create, and modify workflows ===
 
=== List, create, and modify workflows ===
 
==== All workflows ====
 
==== All workflows ====
  
Use a GET request to list the definitions of all workflows. Use POST for adding or updating a workflow definition.
+
Use a GET request to list the definitions of all workflows. If the timestamps (if present) or any other additional information contained in the definition should also be displayed, the request parameter <tt>returnDetails</tt> must be set to <tt>true</tt>. Use POST for adding or updating a workflow definition.
  
 
'''Supported operations:'''  
 
'''Supported operations:'''  
*GET: Returns a list of all workflow definitions. If there are no workflows defined, you will get an empty list.
+
*GET: Returns a list of all workflow definitions. If there are no workflows defined, you will get an empty list. Optional request parameter: <tt>returnDetails</tt>: <tt>true</tt> or <tt>false</tt> (default).
*POST: Create a new workflow definition or update an existing one. If the workflow already exists, it will be updated after successful validation. However, the changes will not apply until the next job run, i.e. the current job run is not influenced by the changes. Only workers for which worker definitions exist can be added to the workflow definition as actions. When adding a worker, all parameters defined in the worker's definition have to be satisfied. If not in the global or local sections of the workflow definition itself, then later in the job definition. Also, all input and output slots have to be connected to existing buckets if they are persistent ones or at least a bucket name must be provided in case of transient ones. Expceptions to this rule are optional slots or those of other slot groups which need not and must not (in the latter case) be connected to buckets. If a required slot is not connected to a bucket or if a referenced bucket,  defined as persistent one, does not exist, an error will be thrown.  An error will be thrown:
+
*POST: Create a new workflow definition or update an existing one. If the workflow already exists, it will be updated after successful validation. However, the changes will not apply until the next job run, i.e. the current job run is not influenced by the changes. Only workers for which worker definitions exist can be added to the workflow definition as actions. When adding a worker, all parameters defined in the worker's definition have to be satisfied. If not in the global or local sections of the workflow definition itself, then later in the job definition. Also, all input and output slots have to be connected to existing buckets if they are persistent ones or at least a bucket name must be provided in case of transient ones. Expceptions to this rule are optional slots or those of other slot groups which need not and must not (in the latter case) be connected to buckets. An error will be thrown:
 
**If a required slot is not connected to a bucket.
 
**If a required slot is not connected to a bucket.
 
**If a referenced bucket, defined as persistent one, does not exist.
 
**If a referenced bucket, defined as persistent one, does not exist.
Line 148: Line 301:
 
**200 OK: Upon successful execution (GET).  
 
**200 OK: Upon successful execution (GET).  
 
**201 CREATED: Upon successful execution (POST).
 
**201 CREATED: Upon successful execution (POST).
**400 Bad Request: <tt>name</tt>, <tt>startAction</tt> are mandatory fields. If they are not set, an HTTP 400 Bad Request including an error message in the response body will be returned.
+
**400 Bad Request: <tt>name</tt>, <tt>startAction</tt> are mandatory fields. If they are not set or the name is invalid, an HTTP 400 Bad Request including an error message in the response body will be returned. If a workflow update is requested but results in an error during job validation, then the update will fail with response status 400, as well.
  
 
==== Specific workflow ====
 
==== Specific workflow ====
Line 155: Line 308:
 
'''Supported operations:'''  
 
'''Supported operations:'''  
 
*GET: Returns the definition of the given workflow.
 
*GET: Returns the definition of the given workflow.
 +
** You can set the URL parameter <tt>returnDetails</tt> to <tt>true</tt> to return additional information that might have been provided when creating the workflow. If the parameter is ommitted or set to <tt>false</tt> only the relevant information (name, parameters, startAction, actions, timestamp) is gathered.
 
*DELETE: Deletes the given workflow.
 
*DELETE: Deletes the given workflow.
  

Latest revision as of 03:46, 27 January 2015

Workers and Workflows

Please note that job manager element names (like workers and workflows) must conform to the job manager naming convention:

  • names must inly consist of the following characters: a-zA-Z._-

If they do not conform, they won't be accessible in SMILA.

  • Pushing elements with invalid names will result in a 400 Bad Request,
  • predefined elements with invalid names won't be loaded, a warning will be logged in the SMILA.log file.

E.g.

... WARN  ...  internal.DefinitionPersistenceImpl            - Error parsing predefined worker definitions from configuration area
  org.eclipse.smila.common.exceptions.InvalidDefinitionException: Value 'worker#1' in field 'name' is not valid:
  A name must match pattern ^[a-zA-Z0-9-_\.]+$.

Workers

Worker definition

A worker definition describes the input and output behavior as well as the required parameters of a worker. The definitions are provided with the software and must be known in the system before a worker can be added as an action to a workflow. They cannot be added or edited at runtime and are therefore not intended to be manipulated by the user.

Typically, a worker definition consists of the following parts:

  1. A parameter section declaring the worker's parameters: These parameters must be set either in the workflow or in the job definition when using this worker.
  2. An input slot describing the type of input objects that the worker is able to consume: All input slots must be connected to buckets in a workflow definition that wants to use this worker.
  3. An output slot describing the type of output objects that the worker generates: All output slots must be connected to buckets in a workflow definition that wants to use this worker. An exception to this rule are output slots that were marked as optional in the worker definition or output slots that belong to another slot group (see below).

Slot groups

As an advanced feature, output slots can be associated with a group label. Slots having the same group label then belong to the same group. Grouping is used to define which slots can be used together in the same workflow and which not. Whereas slots that were not associated with a group label can be combined freely because they belong to each group implicitly, it is not possible to use slots from different groups in the same workflow. When using groups, the rules concerning optional and mandatory output slots are as follows:

  • A mandatory slot without a group label must always be connected to a bucket.
  • An optional slot without a group label is allowed in any combination with other any group slot.
  • If a particular group shall be used, all mandatory slots of the group must be connected to a bucket.
  • If each group contains at least one mandatory slot, at least one group must be connected. It is not possible then to connect the slots without a group label only.

Worker properties in detail

  • name: Required. Defines the name of the worker. Can be used in a workflow to add the worker as an action.
  • modes: Optional. Sets a mode in the worker, controlling a special behavior. Currently availably:
    • bulkSource: describes workers like the Bulkbuilder that get data from somewhere not under JobManager control (e.g. an external API). Such workers are needed to start jobs in standard mode without an input bucket. They are not waiting for tasks to appear in the TaskManager queues for them, but require the JobManager to create "initial tasks" for them on-demand.
    • autoCommit: if this worker fails to process a tasks, the Job Manager will not retry that task but finishes it as successfully and create follow-up tasks based on what the worker has produced already. An example is the Bulkbuilder again: If a client sends a record to the Bulkbuilder, it assumes that it will be processed, so if the Bulkbuilder fails to finish the task correctly, the records already added to a bulk must be processed by the job nonetheless.
    • runAlways: Task delivery to this worker should not be limited by scale-up control: If tasks are available the worker will be allowed to process as much tasks as the scaleUp limit for this worker specifies (by default 1), this will not be prevented if the global task scale-up limit for this node is already reached. So this mode should be used only for workers that perform very important tasks that should not be delayed too much (the internal "_finishingTasks" worker is an example), but you should be careful to increase the scaleUp limit for such workers because this can easily result in an overload on a node. Especially you should be very cautious about runAlways workers that have long-running tasks which are computationally intensive.
    • requestsCompletion: describes a worker that wants to add completion tasks after the normal job tasks have finished. See Job Run Life Cycle for details and Import Delta Delete for an example.
    • barrier: Tasks for this worker will not be generated before all tasks for workers have been finished that occur in the same workflow run before this worker. Then, all bulks created so far in this workflow run in all buckets connected to the slot will be given to the task generator of this worker in a single call. The purpose of this mode is to support MapReduce workflows. Using more than one barrier workers in a single workflow is supported. Cycles with barriers have not yet been tested ;-)
  • parameters: Optional. Contains a description of the worker's parameter, i.e. the supported parameters, the possible types, cardinalities and values, etc. See SMILA/Documentation/ParameterDefinition for details.
  • taskGenerator: Optional. Defines the name of the OSGi service which should be used to create the tasks whenever there are changes in the respective input buckets. If the taskGenerator is not set, the default task generator is used.
  • input: Optional. Describes the input slots:
    • name: Gives the name of a slot. Has to be bound as a parameter key to an existing bucket in a workflow.
    • type: Gives the required data object type of the input slot. The bucket bound in an actual workflow must comply with this type.
    • modes: Sets the mode(s) of the respective input slot, controlling a special behavior.
      • qualifier: When set, worker uses "Conditional GET" to select tasks with certain objects for this input slot.
      • optional: When set, no error will occur when adding a workflow that uses this worker as start worker without an input bucket.
        • the worker has to check whether the input slot is connected instead of accessing it directly.
  • output: Optional. Describes the output slots:
    • name: Gives the name of the slot. Has to be bound as a parameter key to an existing bucket in a workflow.
    • type: Gives the required data object type of the output slot. The bucket bound in an actual workflow must comply with this type.
    • group: Gives the group label of this slot (see above).
    • modes: Sets the mode(s) of the respective output slot, controlling a special behavior.
      • optional: When set, no error will occur when adding a workflow that does not bind the output slot to a bucket.
      • multiple: When set, the number of output bulks is not predefined by job management. Instead, worker can create multiple output bulks based on an object id prefix given by job management.
      • maybeEmpty: When set, no error will occur when a worker doesn't create an output bulk for a processed task.
      • appendable: When set, bulk has to be created for failsafe append.

Worker definitions can include additional information (e.g. comments or layout information for graphical design tools, etc.), but a GET request will return only relevant information (i.e. the above attributes). If you want to retrieve the additional info that is present in the json file, add returnDetails=true as request parameter.

Example

An exemplary worker definition:

{
  "name" : "exampleWorker",
  "readOnly": true,
  "parameters":[
    { "name": "parameter1" , "optional": "true"},
    { "name": "parameter2" }            
  ],
  "input" : [ {
    "name" : "inputRecords",
    "type" : "recordBulks"
  } ],
  "output" : [ {
    "name" : "outputRecords",
    "type" : "recordBulks"
  }]
}

As workers currently can be defined in the system configuration only, they are all marked as "readOnly" (see SMILA/Documentation/JobManagerConfiguration).

A more complex sample:

{
   "workers":[
      {
         "name":"worker",
         "parameters":[
            {
               "name":"stringParam",
               "optional":true,
               "description":"optional string parameter with default type 'string'"
            },
            {
               "name":"booleanParam",
               "type":"boolean",
               "description":"boolean parameter"
            },
            {
               "name":"enumParam",
               "type":"string",
               "values":[
                  "val1",
                  "val2"
               ],
               "optional":true,
               "description":"optional enum parameter with values 'val1' or 'val2'"
            },
            {
               "name":"mapParam",
               "type":"map",
               "entries":[
                  {
                     "name":"key1",
                     "type":"string"
                  },
                  {
                     "name":"key2",
                     "type":"string"
                  }
               ],
               "description":"map parameter with two entries of type string and keys 'key1' and 'key2'"
            },
            {
               "name":"sequenceOfStringsParam",
               "type":"string",
               "multi":true,
               "description":"a sequence of string parameters"
            },
            {
               "name":"<something>",
               "type":"string",
               "description":"additional parameter with unspecified name"
            },
            {
               "name":"anyParam",
               "type":"any",
               "optional":true,
               "description":"optional parameter with an 'Any' value"
            }
         ]
      }
   ]
}

List workers

All workers

Use a GET request to list all worker definitions.

Supported operations:

  • GET: Returns a list of all worker definitions. If you want to retrieve the additional information (if present), add returnDetails=true as request parameter.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/workers/
  • Allowed methods:
    • GET
  • Response status codes:
    • 200 OK: Upon successful execution.

Specific worker

Use a GET request to list the definition of a specific worker.

Supported operations:

  • GET: Returns the definition of the given worker. Optional parameter: returnDetails: true or false (default)

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/workers/<worker-name>/
  • Allowed methods:
    • GET
  • Response status codes:
    • 200 OK: Upon successful execution.

Workflows

Workflow definition

A workflow definition describes the individual actions of an asynchronous workflow by connecting workers to input and output slots. Which slots have to be connected depends on the workers you are using and is defined by the worker definition. Typically, all input and output slots of a used worker must be associated to buckets. And, the type of the connected bucket must match that defined in the worker's definition.

A workflow run starts with the start-action. The order of the other actions is determined by their inputs and outputs.

Connecting a workflow to another workflow

A workflow can be linked to another workflow when both share the same persistent bucket. To give an example, let's assume a workflow named A and a workflow named B sharing the same bucket. If the first workflow A then adds an object into the shared bucket, the second workflow B is triggered to process this data. To be able to connect workflow A and B, the following prerequisites must be fulfilled:

  • The shared bucket must be a persistent one.
  • The definition of workflow A must define the shared bucket as an output bucket of an action. This can be any action in the workflow chain, hence, not necessarily the first or the last one.
  • The definition of workflow B must state the shared bucket as the input bucket of its start action. Other positions in the workflow definition will not do.
  • Individual jobs must be created for both the triggering (A) and the triggered workflow (B).
  • The parameters used for the store and object name in the data object type definition of the shared bucket must be identical in both job definitions.
  • The job runs must fulfill the following conditions to allow for the triggering of a connected workflow:
    • The status of the job run using workflow A must be RUNNING or FINISHING.
    • The status of the job run using workflow B must be RUNNING.

Warning: As there is no explicit chaining of workflows, you have to be very careful when using the same bucket name in multiple workflow definitions. This might result in the triggering of jobs which were not meant to be triggered at all.

Workflow properties in detail

Description of a workflow:

  • name: Required. Gives the name of the workflow.
  • modes (LIST): Optional. Restricts the modes a job referring to this workflow can be started in and defines the default mode. Possible modes are standard and runOnce.
    • if no modes are given and overwritten in a job definition, all modes can be used to start this workflow in a job, default mode will be standard if no mode is explicitly provided at job start.
    • the first mode in this list will be used as the default job run mode (i.e. if no mode is provided during job start).
    • a modes section in the workflow can be restricted (or the default mode can be changed) with a modes section in the job definition, but can not be expanded in the job definition. See Job modes for more information.
  • parameters (MAP): Optional. Sets the global workflow parameters. They apply to all actions in the workflow as well as to the buckets used by these workers.
  • startAction (MAP): Required. Defines the starting action of the workflow. There can be only one starting action within the workflow.
  • actions (LIST of MAPs): Optional. Defines the follow-up actions of the workflow.
  • timestamp: The (readonly) timestamp that is created by the system when the workflow has been pushed to the system (initial creation or last update). Read-Only workflows (i.e. workflows initially loaded from workflow.json file have no timestamp property. The value cannot be set manually, it is system defined.
  • Additional properties can be provided, but will only be listed when returnDetails is set to true. This could be used by a designer tool to add layout information or comments.

Description of startAction and actions:

  • worker: Gives the name of a worker. This name must comply with the name given in the worker definition.
  • parameters: Sets the local worker parameters. The apply to the referenced worker but not to the buckets used by this worker.
  • input (MAP): Maps the worker's named input slot(s) to an existing bucket definition. The name of an input slot must be the key and the name of the bucket must be the value of that key. All of the worker's named input slots have to be resolved against existing buckets of the expected type.
  • output (MAP): Maps the worker's named output slot(s) to an existing bucket definition. The name of an output slot must be the key and the name of the bucket must be the value of that key. All of the worker's named output slots have to be resolved against existing buckets of the expected type.

Workflow definitions can include additional information (e.g. comments or layout information for graphical design tools, etc.), but a GET request will return only relevant information (i.e. the above attributes). If you want to retrieve the additional info that is present in the json file or has been posted with the definition, add returnDetails=true as request parameter.

Non-forking workflows

Workflows are called "non-forking" if no two workers in the workflow share the same input bucket. This has an impact on the clean up of temporary objects during a job run. For non-forking workflows, the removal of an input object from a transient bucket will be done directly after the worker has successfully finished its task. For forking workflows, this clean up will be done not before the whole workflow run has finished.

Example

An exemplary workflow definition:

{
   "name":"myWorkflow",
   "modes": ["runOnce","standard"],
   "parameters":{
      "paramKey2":"paramValue2",
      "paramKey1":"paramValue2"
   },
   "startAction":{
      "worker":"worker1",
      "input":{
         "slotA":"myBucketA"
      },
      "output":{
         "slotB":"myBucketB"
      }
   },
   "actions":[
      {
         "worker":"worker2",
         "parameters":{
            "paramKey3":"paramValue3"
         },
         "input":{
            "slotC":"myBucketB"
         },
         "output":{
            "slotD":"myBucketC"
         }
      },
      {
         "worker":"worker3",
         "input":{
            "slotE":"myBucketC"
         },
         "output":{
            "slotF":"myBucketD"
         }
      }
   ],
   "timestamp" : "2011-07-25T08:57:47.628+0200"
}

List, create, and modify workflows

All workflows

Use a GET request to list the definitions of all workflows. If the timestamps (if present) or any other additional information contained in the definition should also be displayed, the request parameter returnDetails must be set to true. Use POST for adding or updating a workflow definition.

Supported operations:

  • GET: Returns a list of all workflow definitions. If there are no workflows defined, you will get an empty list. Optional request parameter: returnDetails: true or false (default).
  • POST: Create a new workflow definition or update an existing one. If the workflow already exists, it will be updated after successful validation. However, the changes will not apply until the next job run, i.e. the current job run is not influenced by the changes. Only workers for which worker definitions exist can be added to the workflow definition as actions. When adding a worker, all parameters defined in the worker's definition have to be satisfied. If not in the global or local sections of the workflow definition itself, then later in the job definition. Also, all input and output slots have to be connected to existing buckets if they are persistent ones or at least a bucket name must be provided in case of transient ones. Expceptions to this rule are optional slots or those of other slot groups which need not and must not (in the latter case) be connected to buckets. An error will be thrown:
    • If a required slot is not connected to a bucket.
    • If a referenced bucket, defined as persistent one, does not exist.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/workflows.
  • Allowed methods:
    • GET
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution (GET).
    • 201 CREATED: Upon successful execution (POST).
    • 400 Bad Request: name, startAction are mandatory fields. If they are not set or the name is invalid, an HTTP 400 Bad Request including an error message in the response body will be returned. If a workflow update is requested but results in an error during job validation, then the update will fail with response status 400, as well.

Specific workflow

Use a GET request to retrieve the defintion of a specific workflow. Use DELETE to delete a specific workflow.

Supported operations:

  • GET: Returns the definition of the given workflow.
    • You can set the URL parameter returnDetails to true to return additional information that might have been provided when creating the workflow. If the parameter is ommitted or set to false only the relevant information (name, parameters, startAction, actions, timestamp) is gathered.
  • DELETE: Deletes the given workflow.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/workflows/
  • Allowed methods:
    • GET
    • DELETE
  • Response status codes:
    • 200 OK: Upon successful execution (GET, DELETE). If the workflow to be deleted does not exist, you will get 200 anyway (DELETE).
    • 404 Not Found: If the workflow does not exist (GET).
    • 400 Bad Request: If the workflow to be deleted is stil referenced by a job definition (DELETE).

Copyright © Eclipse Foundation, Inc. All Rights Reserved.