Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/JobManager"

(Using the Job Manager)
(Understanding the Entities of the Job Manager)
Line 5: Line 5:
 
The Job Manager controls the processing logic of [[SMILA/Glossary#W|asynchronous workflows]] in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which [[SMILA/Glossary#W|worker]] and when.  
 
The Job Manager controls the processing logic of [[SMILA/Glossary#W|asynchronous workflows]] in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which [[SMILA/Glossary#W|worker]] and when.  
  
== Understanding the Entities of the Job Manager ==
+
=== What are Asynchronous Workflows? ===
  
The Job Manager handles the following entities:
+
''Asynchronous workflows'' consists of a set of ''[[SMILA/Glossary#A|actions]]''. Each action connects the input and output ''[[SMILA/Glossary#S|slots]]'' of a ''[[SMILA/Glossary#W|workers]]'' to appropriate ''[[SMILA/Glossary#B|buckets]]''. A bucket is a virtual container of ''data objects'' of the same type. The most common data object type in SMILA is the ''record bulk'', which is just a concatenated sequence of records (including attachments) stored in the [[SMILA/Documentation/ObjectStore/Bundle_org.eclipse.smila.objectstore|ObjectStore service]]. When a new data object arrives in a bucketconnected to the input slot of a worker (usually created by a worker that has the bucket connected to its output slot), a task is created for the worker to process this object and to produce data object with the results in the buckets connected to the output slots. Thus the workflow (consisting of actions reading from and writing to buckets) describes a data flow of the data objects through the workers. The workflow usually starts with a worker that creates data objects from data sent to an SMILA API (e.g. the [[SMILA/Documentation/Bulkbuilder|Bulkbuilder]] creates bulks of records sent by external or internal clients) or from data extracted from an external data source (e.g. a [[SMILA/Documentation/Importing/Concept|Crawler]] worker). The workflow ends either when workers do not have output buckets, or the output buckets are not connected to input slots of other workers. Then all temporary data objects created during the workflow is deleted, only data objects in buckets marked as ''persistent'' will remain.
  
* Workflows
+
A workflow definition is usually still kind of generic because it does not define all the parameters needed by workers (e.g. the name of an index to build) and buckets (e.g. the name of the store for temporary data objects) used in the actions. To execute a workflow, a ''[[SMILA/Glossary#J|job]]'' must be defined that sets all these parameters to appropriate values. Then the job can be started which initiates a ''[[SMILA/Glossary#J|job run]]''. As long as the job run is active, new data can be submitted to it and the JobManager will take care that it is processed by the workflow. Finally, the job run is finished, after which no new data will be accepted, but the job will finish to process the already submitted data (''[[SMILA/Glossary#W|workflow runs]]''). Then the job can be started again and repeated. All the time it is possible to monitor the job run and see which worker processes how much data in what time, how many errors have occurred and how much work is still to be done. After the job run has finally finished the monitoring data is persisted for later retrieval.
* Workers
+
* Buckets
+
* Data object types
+
* Jobs and job runs
+
  
The definition of an asynchronous workflow in SMILA consists of one or multiple steps, called [[SMILA/Glossary#A|actions]]. Each action specifies the [[SMILA/Glossary#W|worker]] that does the actual processing and connects the [[SMILA/Glossary#S|slots]] of the worker to actual [[SMILA/Glossary#B|buckets]]. A slot is a description of the worker's input and output behavior. There are ''input slots'' that describe the type of data objects that the worker is able to process and also ''output slots'' that define the type of data objects that the worker will produce. Hence, to be able to use a worker in a workflow, you will have to assign buckets of the correct type. A bucket is simply a logical data container grouping data objects of the same type. In addition to that, parameters may be set at different levels in a workflow to configure its behavior. Global parameters are valid for all actions in a workflow; specific parameters at the action level are only valid for the respective worker.
+
Two main components are responsible for making this work: The JobManager knows the workflow and job definitions, controls the creation of initial and follow-up tasks and accumulates the monitoring data measured with each task finished. The TaskManager knows which tasks are to be done by which worker and which tasks are currently in progress, it delivers tasks to workers which are available currently and ensures that a task will be repeated if a workers has died while working on it. All this works in a cluster of SMILA nodes as well, so the work can easily be distributed and parallelized.
 
+
To run a certain workflow in SMILA, you will have to create a [[SMILA/Glossary#J|job]] definition that references the desired workflow. Like in workflows, it is possible to set further parameters, e.g. to adapt the used workflow for a certain application. For example, when using the same workflow in two jobs you could set different values for the <tt>store</tt> parameter in each to make sure that data is written to different places.
+
 
+
With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called [[SMILA/Glossary#J|job run]]. Job runs in turn consist of one or multiple [[SMILA/Glossary#W|workflow runs]]. A workflow run refers to one traversal of the respective workflow, e.g. one traversal for each object in an input bucket.
+
 
+
Job runs provide so called ''job run data'' which can be used to monitor the current job processing. Also, job runs can be canceled in case of any problems. Except for the so-called "runOnce" jobs, which are finished automatically, job runs must be stopped manually when they should no longer react on changes.
+
  
 
== Common Behaviour of JobManager definition APIs ==
 
== Common Behaviour of JobManager definition APIs ==

Revision as of 10:12, 23 January 2012

Note.png
Available since SMILA 0.9!


Job Manager

The Job Manager controls the processing logic of asynchronous workflows in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which worker and when.

What are Asynchronous Workflows?

Asynchronous workflows consists of a set of actions. Each action connects the input and output slots of a workers to appropriate buckets. A bucket is a virtual container of data objects of the same type. The most common data object type in SMILA is the record bulk, which is just a concatenated sequence of records (including attachments) stored in the ObjectStore service. When a new data object arrives in a bucketconnected to the input slot of a worker (usually created by a worker that has the bucket connected to its output slot), a task is created for the worker to process this object and to produce data object with the results in the buckets connected to the output slots. Thus the workflow (consisting of actions reading from and writing to buckets) describes a data flow of the data objects through the workers. The workflow usually starts with a worker that creates data objects from data sent to an SMILA API (e.g. the Bulkbuilder creates bulks of records sent by external or internal clients) or from data extracted from an external data source (e.g. a Crawler worker). The workflow ends either when workers do not have output buckets, or the output buckets are not connected to input slots of other workers. Then all temporary data objects created during the workflow is deleted, only data objects in buckets marked as persistent will remain.

A workflow definition is usually still kind of generic because it does not define all the parameters needed by workers (e.g. the name of an index to build) and buckets (e.g. the name of the store for temporary data objects) used in the actions. To execute a workflow, a job must be defined that sets all these parameters to appropriate values. Then the job can be started which initiates a job run. As long as the job run is active, new data can be submitted to it and the JobManager will take care that it is processed by the workflow. Finally, the job run is finished, after which no new data will be accepted, but the job will finish to process the already submitted data (workflow runs). Then the job can be started again and repeated. All the time it is possible to monitor the job run and see which worker processes how much data in what time, how many errors have occurred and how much work is still to be done. After the job run has finally finished the monitoring data is persisted for later retrieval.

Two main components are responsible for making this work: The JobManager knows the workflow and job definitions, controls the creation of initial and follow-up tasks and accumulates the monitoring data measured with each task finished. The TaskManager knows which tasks are to be done by which worker and which tasks are currently in progress, it delivers tasks to workers which are available currently and ensures that a task will be repeated if a workers has died while working on it. All this works in a cluster of SMILA nodes as well, so the work can easily be distributed and parallelized.

Common Behaviour of JobManager definition APIs

SMILA provides APIs to read and write the JobManager configuration elements (currently you can write only buckets, workflows and job definitions). The pages linked below describe the specific APIs to do this. However, they have some common properties:

  • Elements can be defined either in the system configuration, or using the APIs. System-defined elements cannot be changed by API calls. Therefore, when reading such system-defined elements using the API, they will contain a readOnly flag set to true. Requests to update these elements will result in an error. You cannot set this flag when you create own elements to protect them from being overwritten. The API will remove them.
  • User-defined elements, on the other hand, will contain a timestamp attribute describing the it was last changed. This can be used by modelling tools to ensure that they do not overwrite changes made by other users. You cannot set this timestamp yourself in an update request, it will be overwritten by the API.
  • Additionally, when a update request for an element is performed successfully, the response object will also contain the timestamp attribute generated for this update action.
  • Apart from the required and optional structure and content of the job manager elements as specified in the pages linked below, each element can contain additional information as needed by the user. This makes it possible to add comments, descriptions, author information, etc. However, the read APIs show this additional information in the result objects only, if invoked with a ...?returnDetails=true. Otherwise the results will show only the basic information.

See the following pages for examples of all this behaviour.

Using the Job Manager

Back to the top