Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/JobManager

< SMILA‎ | Documentation
Revision as of 07:52, 8 January 2013 by Dhaenssgen.brox.de (Talk | contribs) (What are Asynchronous Workflows?)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Job Manager

The Job Manager controls the processing logic of asynchronous workflows in SMILA by regulating the Task Manager, which in turn generates tasks and decides which task should be processed by which worker and when.

What are Asynchronous Workflows?

Asynchronous workflow consists of a set of actions. Each action connects the input and output slots of a workers to appropriate buckets. A bucket is a virtual container of data objects of the same type. The most common data object type in SMILA is the record bulk, which is just a concatenated sequence of records (including attachments) stored in the ObjectStore service. When a new data object arrives in a bucket connected to the input slot of a worker (usually created by a worker that has the bucket connected to its output slot), a task is created for the worker to process this object and to produce data object with the results in the buckets connected to the output slots. Thus the workflow (consisting of actions reading from and writing to buckets) describes a data flow of the data objects through the workers. The workflow usually starts with a worker that creates data objects from data sent to a SMILA API (e.g. the Bulkbuilder creates bulks of records sent by external or internal clients) or from the data which have been extracted from an external data source (e.g. a Crawler worker). The workflow ends either when workers do not have output buckets, or the output buckets are not connected to input slots of other workers. Then all temporary data objects created during the workflow are deleted and only the data objects in buckets marked as persistent will remain.

A workflow definition is usually still kind of generic because it does not define all the parameters needed by workers (e.g. the name of an index to build) and buckets (e.g. the name of the store for temporary data objects) used in the actions. To execute a workflow, a job must be defined that sets all these parameters to appropriate values. Then the job can be started which initiates a job run. As long as the job run is active, a new data can be submitted to it and the JobManager will take care that it is processed by the workflow. Finally, after receiving finish command, the job run will not accept any new data, but the job will finish to process the already submitted data (workflow runs). Then the job can be started again and repeated arbitrary number of times. All the time it is possible to monitor the job run and see the amount of data being processed by a worker during some time period and how many errors have occurred and how much work is still to be done. After the job run has finally finished the monitoring data is persisted for later sighting.

Two main components are responsible for making this work: The JobManager knows the workflow and job definitions, he controls the creation of initial and follow-up tasks and accumulates the monitoring data measured with each task being finished. The TaskManager knows which tasks are to be done by which worker and which tasks are currently in progress, he also delivers tasks to workers which are available currently and ensures that a task will be repeated if a worker has died while working on it. All this works in a cluster of SMILA nodes as well, so the work can easily and reliably be distributed and parallelized across all nodes.

Check out this very simple first example for all of this.

Common Behavior of JobManager definition APIs

SMILA provides APIs to read and write JobManager configuration elements. (Currently you can only write buckets, workflows and job definitions). The pages linked below describe the specific APIs to do this. However, they have some common properties:

  • Elements can be defined either in the system configuration, or using the APIs. System-defined elements cannot be changed by API calls. Therefore, when reading such system-defined elements using the API, they will contain a readOnly flag set to true. Requests to update those elements will result in an error. You cannot set this flag when you create own elements to protect them from being overwritten. The API will remove it.
  • User-defined elements, on the other hand, will contain a timestamp attribute holding the information about when an element has been lastly changed. This can be used by modeling tools to ensure that they do not overwrite changes made by other users. You cannot set this timestamp yourself in an update request, it will be overwritten by the API.
  • Additionally, when a update request for an element is performed successfully, the response object will also contain the timestamp attribute generated for this update action.
  • Apart from the required and optional structure and content of the job manager, elements as specified in the pages linked below, each element can contain additional information as needed by the user. This makes it possible to add comments, descriptions, author information, etc. However, the read APIs show this additional information in the result objects only, if invoked with a ...?returnDetails=true. Otherwise the response will contain only the basic information.

See the following pages for examples of all this behavior.

Using the Job Manager

Back to the top