Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Bulkbuilder"

(Record push REST API)
m (Configuration)
Line 40: Line 40:
 
*maxParallelMicroBulks
 
*maxParallelMicroBulks
 
**the number of maximum allowed parallel microbulks. Unlimited (-1) by default.
 
**the number of maximum allowed parallel microbulks. Unlimited (-1) by default.
*** Since microbulks will be parsed in memory, a large amount of microbulks can cause <tt>OutOfMemoryException</tt>s, so the user should be careful about the amount of data confined in a microbulk as well as the number of parallel microbulk pushing clients.
+
*** Since microbulks will be parsed in memory, a large amount of microbulks can cause <tt>OutOfMemoryException</tt>s, so the user should be careful about the amount of data confined in a microbulk as well as the number of clients pushing parallel microbulks.
 
*bulkLimitSize
 
*bulkLimitSize
 
**the size limit for bulks
 
**the size limit for bulks

Revision as of 05:17, 21 November 2011

Note.png
Available since SMILA 0.9.0!


BulkBuilder (bundle org.eclipse.smila.bulkbuilder)

The Bulkbuilder is a worker designed to increase throughput of record processing in SMILA in an asynchronous workflow.

The bulk builder receives single records or micro bulks and combines them into one single bulk for further processing an an asynchronous workflow. The bulks are created on a time and/or bulks size base (either specified by a configuration file or with job parameters) to create new bulks from the incoming records as follows:

  • a record or microbulk is pushed into the system via the bulk builder's handlers
  • if the Bulkbuilder has no current task for this job, a new initial task will be requested from the Taskmanager
  • the created bulks will be created with org.eclipse.smila.objectstore
  • a record or a micro bulk will be appended to the bulk file specified by the worker's current task.
  • attachments will be added to the record in the bulks, not written seperately to binary storage.
  • if the bulk exceeds the configured bulk size, the bulk will be committed (i.e. the task will be finished and the bulk can be processed by follow-up workers), if not, the task will remain active and the bulk will remain in progress by the Bulkbuilder worker.
  • open bulks will be examined regularly if they exceed their time constraints, i.e. if the bulk's age exceeds the maximum age configured. If so, the current bulk will be commited and the task finished.

JavaDoc

This page gives only a rough overview of the service. Please refer to the Bulkbuilder JavaDoc for detailed information about the Java components.

Configuration

The Bulkbuilder can be configured via a configuration file named bulkbuilder.properties.

The file looks like follows:

# configuration of BulkBuilder

# maximum number of micro bulks that can be processed in parallel.
# Default is -1 (unlimited)
maxParallelMicroBulks=-1

# maximum size after which to close a pending bulk
# Default is 10m = 10 Mebibytes
bulkLimitSize=10m
# maximum time after which to close a pending bulk in seconds
# Default is 120.
bulkLimitTime=120

Description of parameters:

  • maxParallelMicroBulks
    • the number of maximum allowed parallel microbulks. Unlimited (-1) by default.
      • Since microbulks will be parsed in memory, a large amount of microbulks can cause OutOfMemoryExceptions, so the user should be careful about the amount of data confined in a microbulk as well as the number of clients pushing parallel microbulks.
  • bulkLimitSize
    • the size limit for bulks
    • if the bulk size exceeds this limit after a record or microbulk has been appended to this bulk, the task is finished and the next request will cause a new bulk to be created
    • default size is 10 Mebibytes
  • bulkLimitTime
    • if the age of a bulk exceeds bulkLimitTime seconds, the bulk will be committed, the task will be finished and any future request will cause a new task and bulk to be generated.
    • default age of a bulk is 120 seconds.

These configured bulkLimitSize and bulkLimitTime configuration values can be overridden by job properties, so the limits can be fine tuned to the expected record sizes or frequencies of the different jobs and thus behave different for each job.

Bulkbuilder definition in workers.json

{
  "name" : "bulkbuilder",
  "modes" : ["bulkSource", "autoCommit"],
  "output" : [{
      "name" : "insertedRecords",
      "type" : "recordBulks",
      "group" : "recordBulks",
      "modes" : ["optional"]
    }, {
      "name" : "deletedRecords",
      "type" : "indexDeletes",
      "group" : "recordBulks",
      "modes" : ["optional"]
    }
  ]
}

After flushing a bulk (either automatically via timing or sizing constraints or triggered by the user), the task is finished and tasks will be generated for the workers connected to the insertedRecords and deletedRecords slots, so they can process the created bulk of records and/or delete requests pushed into the system.

See JobManager for more information on job processing.

Record push REST API

Note that records will only be processed for active jobs. If a job is not in the state "RUNNING" it will not accept new records or micro bulks.

Notes:

  • a record must contain a _recordid metadata attribute.
  • attachments are not yet support at the ReST API.

push a single record or push a request to delete a single record

Use a POST request to push a record to a specific job. Use a DELETE request to request deletion of a specific record in a specific job.

Supported operations:

  • POST: push a single record or commit the current bulk.
    • if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
    • if no request body is present the current bulks (i.e. the records and delte requests added to the system) will be flushed and the current task finished.
  • DELETE: request deletion of a single record
    • if the _recordid request parameter is present, then a delete request for the record with this id is appended to the DELETE bulk of the bulkbuilder
    • if _recordid request parameter is not present, the current bulks will be flushed and the current task will be finished.

Usage:

  • URL: http://<hostname>:8080/smila/job/<job-name>/record/.
  • Allowed methods:
    • POST
    • DELETE
  • Response status codes:
    • 202 ACCEPTED: Upon successful execution.
    • 404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status RUNNING.
    • 400 BAD REQUEST + JSON Body with error message: If the pushed record/delete request has no _recordid or the record is invalid in another way (e.g. invalid JSON syntax).

push a micro bulk

Note: a microbulk consists of a JSON record per line and is thus in itself not valid JSON.

E.g.

{"_recordid": "id1", "attribute1": "attribute1", ...}
{"_recordid": "id2", "attribute1": "attribute2", ...}
{"_recordid": "id3", "attribute1": "attribute3", ...}

Use a POST request to push a microbulk to a specific job.

Supported operations:

  • POST: push a microbulk.
    • if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
    • if no request body is present, an error will be generated. A micro bulk can be committed using the record API (see above).

Usage:

  • URL: http://<hostname>:8080/smila/job/<job-name>/bulk/.
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Upon successful execution.
    • 404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status RUNNING.
    • 400 BAD REQUEST + JSON Body with error message: If one the pushed records in the microbulk has no _recordid or the micro bulk is invalid in another way (e.g. invalid JSON syntax of a single record, or a record spans over multiple lines of the input).

Back to the top