Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Bulkbuilder"

 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{note| Available since SMILA 0.9.0}}
+
= Bulkbuilder (bundle org.eclipse.smila.bulkbuilder) =
  
= BulkBuilder (bundle org.eclipse.smila.bulkbuilder) =
+
The [[SMILA/Glossary#B|Bulkbuilder]] is the standard entry [[SMILA/Glossary#W|worker]] for data to an [[SMILA/Glossary#W|asynchronous workflow]] in SMILA.
  
The [[SMILA/Glossary#B|Bulkbuilder]] is a [[SMILA/Glossary#W|worker]] designed to increase throughput of record processing in SMILA in an [[SMILA/Glossary#W|asynchronous workflow]].
+
The Bulk Builder receives single records or [[SMILA/Glossary#M|micro bulks]] and combines them into one single bulk for further processing in an asynchronous workflow.
 
+
The bulks are created on a time and/or bulk size basis (either specified by a configuration file or with job parameters) to create new bulks from the incoming records as follows:
The bulk builder receives single records or [[SMILA/Glossary#M|micro bulks]] and combines them into one single bulk for further processing an an asynchronous workflow.
+
* a record or micro bulk is sent to the Bulkbuilder, either immediately by another Java component in the same VM or by an external client via the HTTP API.
The bulks are created on a time and/or bulks size base (either specified by a configuration file or with job parameters) to create new bulks from the incoming records as follows:
+
* if the Bulk Builder has no current task for this job, a new initial task will be requested from the [[SMILA/Glossary#T|Taskmanager]]
* a record or microbulk is pushed into the system via the bulk builder's handlers
+
* if the Bulkbuilder has no current task for this job, a new initial task will be requested from the [[SMILA/Glossary#T|Taskmanager]]
+
* the created bulks will be created with <tt>org.eclipse.smila.objectstore</tt>
+
 
* a record or a micro bulk will be appended to the bulk file specified by the worker's current task.
 
* a record or a micro bulk will be appended to the bulk file specified by the worker's current task.
* attachments will be added to the record in the bulks, not written seperately to binary storage.
+
* attachments will be added to the record in the bulks, not written separately to binary storage.
* if the bulk exceeds the configured bulk size, the bulk will be committed (i.e. the task will be finished and the bulk can be processed by follow-up workers), if not, the task will remain active and the bulk will remain in progress by the Bulkbuilder worker.
+
* if the bulk exceeds the configured bulk size, the bulk will be committed (i.e. the task will be finished and the bulk can be processed by follow-up workers), if not, the task will remain active and the bulk will remain in progress by the Bulk Builder worker.
* open bulks will be examined regularly if they exceed their time constraints, i.e. if the bulk's age exceeds the maximum age configured. If so, the current bulk will be commited and the task finished.
+
* open bulks will be examined regularly if they exceed their time constraints, i.e. if the bulk's age exceeds the maximum age configured. If so, the current bulk will be committed and the task finished.
  
 
== JavaDoc ==
 
== JavaDoc ==
Line 19: Line 16:
  
 
== Configuration ==
 
== Configuration ==
The Bulkbuilder can be configured via a configuration file named <tt>bulkbuilder.properties</tt>.
+
The Bulk Builder can be configured via a configuration file named <tt>bulkbuilder.properties</tt>.
  
 
The file looks like follows:
 
The file looks like follows:
 
<pre>
 
<pre>
# configuration of BulkBuilder
+
# configuration of Bulk Builder
  
 
# maximum number of micro bulks that can be processed in parallel.
 
# maximum number of micro bulks that can be processed in parallel.
Line 39: Line 36:
 
Description of parameters:
 
Description of parameters:
 
*maxParallelMicroBulks
 
*maxParallelMicroBulks
**the number of maximum allowed parallel microbulks. Unlimited (-1) by default.
+
**the number of maximum allowed parallel micro bulks. Unlimited (-1) by default.
*** Since microbulks will be parsed in memory, a large amount of microbulks can cause <tt>OutOfMemoryException</tt>s, so the user should be careful about the amount of data confined in a microbulk as well as the number of parallel clients pushing microbulks.
+
*** Since micro bulks will be parsed in memory, a large amount of micro bulks can cause <tt>OutOfMemoryException</tt>s, so the user should be careful about the amount of data confined in a micro bulk as well as the number of parallel clients pushing micro bulks.
 
*bulkLimitSize
 
*bulkLimitSize
 
**the size limit for bulks
 
**the size limit for bulks
**if the bulk size exceeds this limit after a record or microbulk has been appended to this bulk, the task is finished and the next request will cause a new bulk to be created
+
**if the bulk size exceeds this limit after a record or micro bulk has been appended to this bulk, the task is finished and the next request will cause a new bulk to be created
 
**default size is 10 Mebibytes
 
**default size is 10 Mebibytes
 
*bulkLimitTime
 
*bulkLimitTime
Line 51: Line 48:
 
These configured <tt>bulkLimitSize</tt> and <tt>bulkLimitTime</tt> configuration values can be overridden by job properties, so the limits can be fine tuned to the expected record sizes or frequencies of the different jobs and thus behave different for each job.
 
These configured <tt>bulkLimitSize</tt> and <tt>bulkLimitTime</tt> configuration values can be overridden by job properties, so the limits can be fine tuned to the expected record sizes or frequencies of the different jobs and thus behave different for each job.
  
== Bulkbuilder definition in workers.json ==
+
== Bulk Builder definition in workers.json ==
 
<pre>
 
<pre>
 
{
 
{
Line 81: Line 78:
 
Notes:
 
Notes:
 
* a record must contain a <tt>_recordid</tt> metadata attribute.
 
* a record must contain a <tt>_recordid</tt> metadata attribute.
* attachments are not yet support at the ReST API.
 
  
 
=== push a single record or push a request to delete a single record ===
 
=== push a single record or push a request to delete a single record ===
Line 87: Line 83:
 
Use a POST request to push a record to a specific job.
 
Use a POST request to push a record to a specific job.
 
Use a DELETE request to request deletion of a specific record in a specific job.
 
Use a DELETE request to request deletion of a specific record in a specific job.
 +
 +
Adding records with attachments is supported by using Multipart POST requests, see [[SMILA/Documentation/JettyHttpServer#Attachments]] for details and code example.
  
 
'''Supported operations:'''  
 
'''Supported operations:'''  
Line 93: Line 91:
 
**if no request body is present the current bulks (i.e. the records and delte requests added to the system) will be flushed and the current task finished.
 
**if no request body is present the current bulks (i.e. the records and delte requests added to the system) will be flushed and the current task finished.
 
*DELETE: request deletion of a single record
 
*DELETE: request deletion of a single record
**if the <tt>_recordid</tt> request parameter is present, then a delete request for the record with this id is appended to the DELETE bulk of the bulkbuilder
+
**if the <tt>_recordid</tt> request parameter is present, then a delete request for the record with this id is appended to the DELETE bulk of the Bulk Builder
 
**if <tt>_recordid</tt> request parameter is not present, the current bulks will be flushed and the current task will be finished.
 
**if <tt>_recordid</tt> request parameter is not present, the current bulks will be flushed and the current task will be finished.
  
Line 106: Line 104:
 
**404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status <tt>RUNNING</tt>.
 
**404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status <tt>RUNNING</tt>.
 
**400 BAD REQUEST + JSON Body with error message: If the pushed record/delete request has no <tt>_recordid</tt> or the record is invalid in another way (e.g. invalid JSON syntax).
 
**400 BAD REQUEST + JSON Body with error message: If the pushed record/delete request has no <tt>_recordid</tt> or the record is invalid in another way (e.g. invalid JSON syntax).
 
==== push single records with attachments ====
 
 
Record attachments are supported by using Multipart POST requests, see [[SMILA/Documentation/JettyHttpServer#Attachments]] for details and code example. The response of this handler will be the same.
 
  
 
=== push a micro bulk ===
 
=== push a micro bulk ===
  
Note: a microbulk consists of a JSON record per line and is thus in itself not valid JSON. ''Record attachments are not supported in micro bulks.''
+
Note: a micro bulk consists of a JSON record per line and is thus in itself not valid JSON. ''Record attachments are not supported in micro bulks.''
  
 
E.g.
 
E.g.
Line 122: Line 116:
 
</pre>
 
</pre>
  
Use a POST request to push a microbulk to a specific job.
+
Use a POST request to push a micro bulk to a specific job.
  
 
'''Supported operations:'''  
 
'''Supported operations:'''  
*POST: push a microbulk.
+
*POST: push a micro bulk.
 
**if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
 
**if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
 
**if no request body is present, an error will be generated. A micro bulk can be committed using the <tt>record</tt> API (see above).
 
**if no request body is present, an error will be generated. A micro bulk can be committed using the <tt>record</tt> API (see above).
Line 137: Line 131:
 
**202 ACCEPTED: Upon successful execution.
 
**202 ACCEPTED: Upon successful execution.
 
**404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status <tt>RUNNING</tt>.
 
**404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status <tt>RUNNING</tt>.
**400 BAD REQUEST + JSON Body with error message: If one the pushed records in the microbulk has no <tt>_recordid</tt> or the micro bulk is invalid in another way (e.g. invalid JSON syntax of a single record, or a record spans over multiple lines of the input).
+
**400 BAD REQUEST + JSON Body with error message: If one the pushed records in the micro bulk has no <tt>_recordid</tt> or the micro bulk is invalid in another way (e.g. invalid JSON syntax of a single record, or a record spans over multiple lines of the input).

Latest revision as of 11:50, 23 January 2012

Bulkbuilder (bundle org.eclipse.smila.bulkbuilder)

The Bulkbuilder is the standard entry worker for data to an asynchronous workflow in SMILA.

The Bulk Builder receives single records or micro bulks and combines them into one single bulk for further processing in an asynchronous workflow. The bulks are created on a time and/or bulk size basis (either specified by a configuration file or with job parameters) to create new bulks from the incoming records as follows:

  • a record or micro bulk is sent to the Bulkbuilder, either immediately by another Java component in the same VM or by an external client via the HTTP API.
  • if the Bulk Builder has no current task for this job, a new initial task will be requested from the Taskmanager
  • a record or a micro bulk will be appended to the bulk file specified by the worker's current task.
  • attachments will be added to the record in the bulks, not written separately to binary storage.
  • if the bulk exceeds the configured bulk size, the bulk will be committed (i.e. the task will be finished and the bulk can be processed by follow-up workers), if not, the task will remain active and the bulk will remain in progress by the Bulk Builder worker.
  • open bulks will be examined regularly if they exceed their time constraints, i.e. if the bulk's age exceeds the maximum age configured. If so, the current bulk will be committed and the task finished.

JavaDoc

This page gives only a rough overview of the service. Please refer to the Bulkbuilder JavaDoc for detailed information about the Java components.

Configuration

The Bulk Builder can be configured via a configuration file named bulkbuilder.properties.

The file looks like follows:

# configuration of Bulk Builder

# maximum number of micro bulks that can be processed in parallel.
# Default is -1 (unlimited)
maxParallelMicroBulks=-1

# maximum size after which to close a pending bulk
# Default is 10m = 10 Mebibytes
bulkLimitSize=10m
# maximum time after which to close a pending bulk in seconds
# Default is 120.
bulkLimitTime=120

Description of parameters:

  • maxParallelMicroBulks
    • the number of maximum allowed parallel micro bulks. Unlimited (-1) by default.
      • Since micro bulks will be parsed in memory, a large amount of micro bulks can cause OutOfMemoryExceptions, so the user should be careful about the amount of data confined in a micro bulk as well as the number of parallel clients pushing micro bulks.
  • bulkLimitSize
    • the size limit for bulks
    • if the bulk size exceeds this limit after a record or micro bulk has been appended to this bulk, the task is finished and the next request will cause a new bulk to be created
    • default size is 10 Mebibytes
  • bulkLimitTime
    • if the age of a bulk exceeds bulkLimitTime seconds, the bulk will be committed, the task will be finished and any future request will cause a new task and bulk to be generated.
    • default age of a bulk is 120 seconds.

These configured bulkLimitSize and bulkLimitTime configuration values can be overridden by job properties, so the limits can be fine tuned to the expected record sizes or frequencies of the different jobs and thus behave different for each job.

Bulk Builder definition in workers.json

{
  "name" : "bulkbuilder",
  "modes" : ["bulkSource", "autoCommit"],
  "output" : [{
      "name" : "insertedRecords",
      "type" : "recordBulks",
      "group" : "recordBulks",
      "modes" : ["optional"]
    }, {
      "name" : "deletedRecords",
      "type" : "indexDeletes",
      "group" : "recordBulks",
      "modes" : ["optional"]
    }
  ]
}

After flushing a bulk (either automatically via timing or sizing constraints or triggered by the user), the task is finished and tasks will be generated for the workers connected to the insertedRecords and deletedRecords slots, so they can process the created bulk of records and/or delete requests pushed into the system.

See JobManager for more information on job processing.

Record push REST API

Note that records will only be processed for active jobs. If a job is not in the state "RUNNING" it will not accept new records or micro bulks.

Notes:

  • a record must contain a _recordid metadata attribute.

push a single record or push a request to delete a single record

Use a POST request to push a record to a specific job. Use a DELETE request to request deletion of a specific record in a specific job.

Adding records with attachments is supported by using Multipart POST requests, see SMILA/Documentation/JettyHttpServer#Attachments for details and code example.

Supported operations:

  • POST: push a single record or commit the current bulk.
    • if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
    • if no request body is present the current bulks (i.e. the records and delte requests added to the system) will be flushed and the current task finished.
  • DELETE: request deletion of a single record
    • if the _recordid request parameter is present, then a delete request for the record with this id is appended to the DELETE bulk of the Bulk Builder
    • if _recordid request parameter is not present, the current bulks will be flushed and the current task will be finished.

Usage:

  • URL: http://<hostname>:8080/smila/job/<job-name>/record/.
  • Allowed methods:
    • POST
    • DELETE
  • Response status codes:
    • 202 ACCEPTED: Upon successful execution.
    • 404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status RUNNING.
    • 400 BAD REQUEST + JSON Body with error message: If the pushed record/delete request has no _recordid or the record is invalid in another way (e.g. invalid JSON syntax).

push a micro bulk

Note: a micro bulk consists of a JSON record per line and is thus in itself not valid JSON. Record attachments are not supported in micro bulks.

E.g.

{"_recordid": "id1", "attribute1": "attribute1", ...}
{"_recordid": "id2", "attribute1": "attribute2", ...}
{"_recordid": "id3", "attribute1": "attribute3", ...}

Use a POST request to push a micro bulk to a specific job.

Supported operations:

  • POST: push a micro bulk.
    • if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
    • if no request body is present, an error will be generated. A micro bulk can be committed using the record API (see above).

Usage:

  • URL: http://<hostname>:8080/smila/job/<job-name>/bulk/.
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Upon successful execution.
    • 404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status RUNNING.
    • 400 BAD REQUEST + JSON Body with error message: If one the pushed records in the micro bulk has no _recordid or the micro bulk is invalid in another way (e.g. invalid JSON syntax of a single record, or a record spans over multiple lines of the input).

Back to the top