Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/JobManager"

(Common Behaviour of JobManager definition APIs)
(What are Asynchronous Workflows?)
 
(13 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{note| Available since SMILA 0.9!}}
 
 
 
= Job Manager =
 
= Job Manager =
 
<span style="color:#ff0000">'''This page is work in progress.'''</span>
 
  
 
The Job Manager controls the processing logic of [[SMILA/Glossary#W|asynchronous workflows]] in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which [[SMILA/Glossary#W|worker]] and when.  
 
The Job Manager controls the processing logic of [[SMILA/Glossary#W|asynchronous workflows]] in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which [[SMILA/Glossary#W|worker]] and when.  
  
== Understanding the Entities of the Job Manager ==
+
=== What are Asynchronous Workflows? ===
 
+
The Job Manager handles the following entities:
+
 
+
* Workflows
+
* Workers
+
* Buckets
+
* Data object types
+
* Jobs and job runs
+
  
The definition of an asynchronous workflow in SMILA consists of one or multiple steps, called [[SMILA/Glossary#A|actions]]. Each action specifies the [[SMILA/Glossary#W|worker]] that does the actual processing and connects the [[SMILA/Glossary#S|slots]] of the worker to actual [[SMILA/Glossary#B|buckets]]. A slot is a description of the worker's input and output behavior. There are ''input slots'' that describe the type of data objects that the worker is able to process and also ''output slots'' that define the type of data objects that the worker will produce. Hence, to be able to use a worker in a workflow, you will have to assign buckets of the correct type. A bucket is simply a logical data container grouping data objects of the same type. In addition to that, parameters may be set at different levels in a workflow to configure its behavior. Global parameters are valid for all actions in a workflow; specific parameters at the action level are only valid for the respective worker.
+
''Asynchronous workflow'' consists of a set of ''[[SMILA/Glossary#A|actions]]''. Each action connects the input and output ''[[SMILA/Glossary#S|slots]]'' of a ''[[SMILA/Glossary#W|workers]]'' to appropriate ''[[SMILA/Glossary#B|buckets]]''. A bucket is a virtual container of ''data objects'' of the same type. The most common data object type in SMILA is the ''record bulk'', which is just a concatenated sequence of records (including attachments) stored in the [[SMILA/Documentation/ObjectStore/Bundle_org.eclipse.smila.objectstore|ObjectStore service]]. When a new data object arrives in a bucket connected to the input slot of a worker (usually created by a worker that has the bucket connected to its output slot), a task is created for the worker to process this object and to produce data object with the results in the buckets connected to the output slots. Thus the workflow (consisting of actions reading from and writing to buckets) describes a data flow of the data objects through the workers. The workflow usually starts with a worker that creates data objects from data sent to a SMILA API (e.g. the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] creates bulks of records sent by external or internal clients) or from the data which have been extracted from an external data source (e.g. a [[SMILA/Documentation/Importing/Concept|Crawler]] worker). The workflow ends either when workers do not have output buckets, or the output buckets are not connected to input slots of other workers. Then all temporary data objects created during the workflow are deleted and only the data objects in buckets marked as ''persistent'' will remain.
  
To run a certain workflow in SMILA, you will have to create a [[SMILA/Glossary#J|job]] definition that references the desired workflow. Like in workflows, it is possible to set further parameters, e.g. to adapt the used workflow for a certain application. For example, when using the same workflow in two jobs you could set different values for the <tt>store</tt> parameter in each to make sure that data is written to different places.
+
A workflow definition is usually still kind of generic because it does not define all the parameters needed by workers (e.g. the name of an index to build) and buckets (e.g. the name of the store for temporary data objects) used in the actions. To execute a workflow, a ''[[SMILA/Glossary#J|job]]'' must be defined that sets all these parameters to appropriate values. Then the job can be started which initiates a ''[[SMILA/Glossary#J|job run]]''. As long as the job run is active, a new data can be submitted to it and the JobManager will take care that it is processed by the workflow. Finally, after receiving finish command, the job run will not accept any new data, but the job will finish to process the already submitted data (''[[SMILA/Glossary#W|workflow runs]]''). Then the job can be started again and repeated arbitrary number of times. All the time it is possible to monitor the job run and see the amount of data being processed by a worker during some time period and how many errors have occurred and how much work is still to be done. After the job run has finally finished the monitoring data is persisted for later sighting.
  
With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called [[SMILA/Glossary#J|job run]]. Job runs in turn consist of one or multiple [[SMILA/Glossary#W|workflow runs]]. A workflow run refers to one traversal of the respective workflow, e.g. one traversal for each object in an input bucket.  
+
Two main components are responsible for making this work: The JobManager knows the workflow and job definitions, he controls the creation of initial and follow-up tasks and accumulates the monitoring data measured with each task being finished. The TaskManager knows which tasks are to be done by which worker and which tasks are currently in progress, he also delivers tasks to workers which are available currently and ensures that a task will be repeated if a worker has died while working on it. All this works in a cluster of SMILA nodes as well, so the work can easily and reliably be distributed and parallelized across all nodes.
  
Job runs provide so called ''job run data'' which can be used to monitor the current job processing. Also, job runs can be canceled in case of any problems. Except for the so-called "runOnce" jobs, which are finished automatically, job runs must be stopped manually when they should no longer react on changes.
+
Check out this [[SMILA/Documentation/JobManagerFirstExample|very simple first example]] for all of this.
  
== Common Behaviour of JobManager definition APIs ==
+
== Common Behavior of JobManager definition APIs ==
  
SMILA provides APIs to read and write the JobManager configuration elements (currently you can write only buckets, workflows and job definitions). The pages linked below describe the specific APIs to do this. However, they have some common properties:
+
SMILA provides APIs to read and write JobManager configuration elements. (Currently you can only write buckets, workflows and job definitions). The pages linked below describe the specific APIs to do this. However, they have some common properties:
* Elements can be defined either in the system configuration, or using the APIs. System-defined elements cannot be changed by API calls. Therefore, when reading such system-defined elements using the API, they will contain a <code>readOnly</code> flag set to <code>true</code>. Requests to update these elements will result in an error.  
+
* Elements can be defined either in the system configuration, or using the APIs. System-defined elements cannot be changed by API calls. Therefore, when reading such system-defined elements using the API, they will contain a <code>readOnly</code> flag set to <code>true</code>. Requests to update those elements will result in an error. You cannot set this flag when you create own elements to protect them from being overwritten. The API will remove it.
* User-defined elements, on the other hand, will contain a <code>timestamp</code> attribute describing the it was last changed. This can be used by modelling tools to ensure that they do not overwrite changes made by other users.
+
* User-defined elements, on the other hand, will contain a timestamp attribute holding the information about when an element has been lastly changed. This can be used by modeling tools to ensure that they do not overwrite changes made by other users. You cannot set this timestamp yourself in an update request, it will be overwritten by the API.
 
* Additionally, when a update request for an element is performed successfully, the response object will also contain the <code>timestamp</code> attribute generated for this update action.
 
* Additionally, when a update request for an element is performed successfully, the response object will also contain the <code>timestamp</code> attribute generated for this update action.
* Apart from the required and optional structure and content of the job manager elements as specified in the pages linked below, each element can contain additional information as needed by the user. This makes it possible to add comments, descriptions, author information, etc. However, the read APIs show this additional information in the result objects only, if invoked with a <code>...?returnDetails=true</code>. Else the results will show only the basic information.
+
* Apart from the required and optional structure and content of the job manager, elements as specified in the pages linked below, each element can contain additional information as needed by the user. This makes it possible to add comments, descriptions, author information, etc. However, the read APIs show this additional information in the result objects only, if invoked with a <code>...?returnDetails=true</code>. Otherwise the response will contain only the basic information.
See the following pages for examples of all this behaviour.
+
See the following pages for examples of all this behavior.
  
 
== Using the Job Manager ==
 
== Using the Job Manager ==
Line 39: Line 27:
 
*[[SMILA/Documentation/DataObjectTypesAndBuckets|Creating and managing buckets]]
 
*[[SMILA/Documentation/DataObjectTypesAndBuckets|Creating and managing buckets]]
 
*[[SMILA/Documentation/WorkerAndWorkflows|Modeling workflows]]
 
*[[SMILA/Documentation/WorkerAndWorkflows|Modeling workflows]]
 +
*[[SMILA/Documentation/TaskGenerators|Task Generators]]
 
*[[SMILA/Documentation/JobDefinitions|Creating jobs]]
 
*[[SMILA/Documentation/JobDefinitions|Creating jobs]]
 
*[[SMILA/Documentation/JobRuns|Running and monitoring jobs]]
 
*[[SMILA/Documentation/JobRuns|Running and monitoring jobs]]
 
*[[SMILA/Documentation/JobParameters|Setting parameters]]
 
*[[SMILA/Documentation/JobParameters|Setting parameters]]
 +
*[[SMILA/Documentation/JobManagerFirstExample|A first example]]

Latest revision as of 07:52, 8 January 2013

Job Manager

The Job Manager controls the processing logic of asynchronous workflows in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which worker and when.

What are Asynchronous Workflows?

Asynchronous workflow consists of a set of actions. Each action connects the input and output slots of a workers to appropriate buckets. A bucket is a virtual container of data objects of the same type. The most common data object type in SMILA is the record bulk, which is just a concatenated sequence of records (including attachments) stored in the ObjectStore service. When a new data object arrives in a bucket connected to the input slot of a worker (usually created by a worker that has the bucket connected to its output slot), a task is created for the worker to process this object and to produce data object with the results in the buckets connected to the output slots. Thus the workflow (consisting of actions reading from and writing to buckets) describes a data flow of the data objects through the workers. The workflow usually starts with a worker that creates data objects from data sent to a SMILA API (e.g. the Bulkbuilder creates bulks of records sent by external or internal clients) or from the data which have been extracted from an external data source (e.g. a Crawler worker). The workflow ends either when workers do not have output buckets, or the output buckets are not connected to input slots of other workers. Then all temporary data objects created during the workflow are deleted and only the data objects in buckets marked as persistent will remain.

A workflow definition is usually still kind of generic because it does not define all the parameters needed by workers (e.g. the name of an index to build) and buckets (e.g. the name of the store for temporary data objects) used in the actions. To execute a workflow, a job must be defined that sets all these parameters to appropriate values. Then the job can be started which initiates a job run. As long as the job run is active, a new data can be submitted to it and the JobManager will take care that it is processed by the workflow. Finally, after receiving finish command, the job run will not accept any new data, but the job will finish to process the already submitted data (workflow runs). Then the job can be started again and repeated arbitrary number of times. All the time it is possible to monitor the job run and see the amount of data being processed by a worker during some time period and how many errors have occurred and how much work is still to be done. After the job run has finally finished the monitoring data is persisted for later sighting.

Two main components are responsible for making this work: The JobManager knows the workflow and job definitions, he controls the creation of initial and follow-up tasks and accumulates the monitoring data measured with each task being finished. The TaskManager knows which tasks are to be done by which worker and which tasks are currently in progress, he also delivers tasks to workers which are available currently and ensures that a task will be repeated if a worker has died while working on it. All this works in a cluster of SMILA nodes as well, so the work can easily and reliably be distributed and parallelized across all nodes.

Check out this very simple first example for all of this.

Common Behavior of JobManager definition APIs

SMILA provides APIs to read and write JobManager configuration elements. (Currently you can only write buckets, workflows and job definitions). The pages linked below describe the specific APIs to do this. However, they have some common properties:

  • Elements can be defined either in the system configuration, or using the APIs. System-defined elements cannot be changed by API calls. Therefore, when reading such system-defined elements using the API, they will contain a readOnly flag set to true. Requests to update those elements will result in an error. You cannot set this flag when you create own elements to protect them from being overwritten. The API will remove it.
  • User-defined elements, on the other hand, will contain a timestamp attribute holding the information about when an element has been lastly changed. This can be used by modeling tools to ensure that they do not overwrite changes made by other users. You cannot set this timestamp yourself in an update request, it will be overwritten by the API.
  • Additionally, when a update request for an element is performed successfully, the response object will also contain the timestamp attribute generated for this update action.
  • Apart from the required and optional structure and content of the job manager, elements as specified in the pages linked below, each element can contain additional information as needed by the user. This makes it possible to add comments, descriptions, author information, etc. However, the read APIs show this additional information in the result objects only, if invoked with a ...?returnDetails=true. Otherwise the response will contain only the basic information.

See the following pages for examples of all this behavior.

Using the Job Manager

Back to the top