Jump to: navigation, search

SMILA/Documentation/Bulkbuilder

Bulkbuilder (bundle org.eclipse.smila.bulkbuilder)

The Bulkbuilder is the standard entry worker for data to an asynchronous workflow in SMILA.

The Bulk Builder receives single records or micro bulks and combines them into one single bulk for further processing in an asynchronous workflow. The bulks are created on a time and/or bulk size basis (either specified by a configuration file or with job parameters) to create new bulks from the incoming records as follows:

  • a record or micro bulk is sent to the Bulkbuilder, either immediately by another Java component in the same VM or by an external client via the HTTP API.
  • if the Bulk Builder has no current task for this job, a new initial task will be requested from the Taskmanager
  • a record or a micro bulk will be appended to the bulk file specified by the worker's current task.
  • attachments will be added to the record in the bulks, not written separately to binary storage.
  • if the bulk exceeds the configured bulk size, the bulk will be committed (i.e. the task will be finished and the bulk can be processed by follow-up workers), if not, the task will remain active and the bulk will remain in progress by the Bulk Builder worker.
  • open bulks will be examined regularly if they exceed their time constraints, i.e. if the bulk's age exceeds the maximum age configured. If so, the current bulk will be committed and the task finished.

JavaDoc

This page gives only a rough overview of the service. Please refer to the Bulkbuilder JavaDoc for detailed information about the Java components.

Configuration

The Bulk Builder can be configured via a configuration file named bulkbuilder.properties.

The file looks like follows:

# configuration of Bulk Builder

# maximum number of micro bulks that can be processed in parallel.
# Default is -1 (unlimited)
maxParallelMicroBulks=-1

# maximum size after which to close a pending bulk
# Default is 10m = 10 Mebibytes
bulkLimitSize=10m
# maximum time after which to close a pending bulk in seconds
# Default is 120.
bulkLimitTime=120

Description of parameters:

  • maxParallelMicroBulks
    • the number of maximum allowed parallel micro bulks. Unlimited (-1) by default.
      • Since micro bulks will be parsed in memory, a large amount of micro bulks can cause OutOfMemoryExceptions, so the user should be careful about the amount of data confined in a micro bulk as well as the number of parallel clients pushing micro bulks.
  • bulkLimitSize
    • the size limit for bulks
    • if the bulk size exceeds this limit after a record or micro bulk has been appended to this bulk, the task is finished and the next request will cause a new bulk to be created
    • default size is 10 Mebibytes
  • bulkLimitTime
    • if the age of a bulk exceeds bulkLimitTime seconds, the bulk will be committed, the task will be finished and any future request will cause a new task and bulk to be generated.
    • default age of a bulk is 120 seconds.

These configured bulkLimitSize and bulkLimitTime configuration values can be overridden by job properties, so the limits can be fine tuned to the expected record sizes or frequencies of the different jobs and thus behave different for each job.

Bulk Builder definition in workers.json

{
  "name" : "bulkbuilder",
  "modes" : ["bulkSource", "autoCommit"],
  "output" : [{
      "name" : "insertedRecords",
      "type" : "recordBulks",
      "group" : "recordBulks",
      "modes" : ["optional"]
    }, {
      "name" : "deletedRecords",
      "type" : "indexDeletes",
      "group" : "recordBulks",
      "modes" : ["optional"]
    }
  ]
}

After flushing a bulk (either automatically via timing or sizing constraints or triggered by the user), the task is finished and tasks will be generated for the workers connected to the insertedRecords and deletedRecords slots, so they can process the created bulk of records and/or delete requests pushed into the system.

See JobManager for more information on job processing.

Record push REST API

Note that records will only be processed for active jobs. If a job is not in the state "RUNNING" it will not accept new records or micro bulks.

Notes:

  • a record must contain a _recordid metadata attribute.

push a single record or push a request to delete a single record

Use a POST request to push a record to a specific job. Use a DELETE request to request deletion of a specific record in a specific job.

Adding records with attachments is supported by using Multipart POST requests, see SMILA/Documentation/JettyHttpServer#Attachments for details and code example.

Supported operations:

  • POST: push a single record or commit the current bulk.
    • if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
    • if no request body is present the current bulks (i.e. the records and delte requests added to the system) will be flushed and the current task finished.
  • DELETE: request deletion of a single record
    • if the _recordid request parameter is present, then a delete request for the record with this id is appended to the DELETE bulk of the Bulk Builder
    • if _recordid request parameter is not present, the current bulks will be flushed and the current task will be finished.

Usage:

  • URL: http://<hostname>:8080/smila/job/<job-name>/record/.
  • Allowed methods:
    • POST
    • DELETE
  • Response status codes:
    • 202 ACCEPTED: Upon successful execution.
    • 404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status RUNNING.
    • 400 BAD REQUEST + JSON Body with error message: If the pushed record/delete request has no _recordid or the record is invalid in another way (e.g. invalid JSON syntax).

push a micro bulk

Note: a micro bulk consists of a JSON record per line and is thus in itself not valid JSON. Record attachments are not supported in micro bulks.

E.g.

{"_recordid": "id1", "attribute1": "attribute1", ...}
{"_recordid": "id2", "attribute1": "attribute2", ...}
{"_recordid": "id3", "attribute1": "attribute3", ...}

Use a POST request to push a micro bulk to a specific job.

Supported operations:

  • POST: push a micro bulk.
    • if a request body is present, this is interpreted as the JSON representation of a record and pushed into the system.
    • if no request body is present, an error will be generated. A micro bulk can be committed using the record API (see above).

Usage:

  • URL: http://<hostname>:8080/smila/job/<job-name>/bulk/.
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Upon successful execution.
    • 404 NOT FOUND + JSON Body with error message: If the specified job cannot be found or has not the status RUNNING.
    • 400 BAD REQUEST + JSON Body with error message: If one the pushed records in the micro bulk has no _recordid or the micro bulk is invalid in another way (e.g. invalid JSON syntax of a single record, or a record spans over multiple lines of the input).