Difference between revisions of "SMILA/Documentation/Importing/Crawler/JDBC"

Revision as of 11:20, 13 December 2012

JDBC Crawler and JDBC Fetcher worker are used for importing data via JDBC from a database. For a big picture and the worker's interaction have a look at the Importing Concept.

Worker name: jdbcCrawler
Parameters:
- dataSource: (req.) value for attribute _source, needed e.g. by the delta service
- dbUrl: (req.) database URL to connect to
- dbProps: (opt.) properties used when connecting to the database.
  - user (opt.) user name
  - password (opt.) user password
  - ... (opt.) any property supported by the used JDBC driver
- crawlSql: (req.) the SQL statement to execute to get the records
- splitLimitsSql: (opt.) the SQL statement to determine limits for splitting the result set into smaller parts based on an integer column. See below for details.
- splitIncrement: (opt.) the increment to use for splitting the integer ragen determined by splitLimitsSql See below for details.
- mapping (req.) specifies how to map database column names to record attributes or attachments.
  - COLUMN-1 (opt.) mapping of the first column to an attribute/attachment
  - COLUMN-2 (opt.) mapping of the second column to an attribute/attachment
  - COLUMN-N (opt.) mapping of the last column to an attribute/attachment
- idColumns (req.) a list of database column names used to generate the record ID from
- deltaColumns (opt.) a list of database column names (upper-case) used to generate the value for attribute _deltaHash
- maxAttachmentSize (opt.) maximum accepted size of BLOB/CLOB column values (in bytes/characters) for attachment creation. Default is "1000000000" (1 billion). Larger values are skipped and a warning is written to the log.
- parameters to control size of output bulks, see below for details
  - maxRecordsPerBulk (opt.) maximum number of records in one bulk. (default: 1000)
Task generator: runOnceTrigger
Output slots:
- crawledRecords

Processing

The JDBC Crawler executes the crawlSql statement and produces one record per result row in the bucket connected to crawledRecords. Please note that internally database column names are normalized to upper-case. In the configuration however any casing can be used. The resulting records contain only the values of the columns configured in the mapping. Whether a column is represented as an attribute or as an attachment depends on the type of the database column. Below is a table that summarizes the supported database types, how they are mapped to Java types and whether they are represented as attributes or attachments:

Database Type	Java Type	represented as
BIT	Boolean	attribute
BOOLEAN	Boolean	attribute
BIGINT	Long	attribute
INTEGER	Long	attribute
SMALLINT	Long	attribute
TINYINT	Long	attribute
DOUBLE	Double	attribute
FLOAT	Double	attribute
REAL	Double	attribute
DECIMAL	Double(scale>0) or Long	attribute
NUMERIC	Double(scale>0) or Long	attribute
DATE	Date	attribute
TIME	DateTime	attribute
TIMESTAMP	DateTime	attribute
CHAR	String	attribute
VARCHAR	String	attribute
NCHAR	String	attribute
NVARCHAR	String	attribute
ROWID	String	attribute
BINARY	byte[]	attachment
VARBINARY	byte[]	attachment
LONGVARBINARY	byte[]	attachment
BLOB	byte[]	attachment
CLOB	byte[]	attachment
NCLOB	byte[]	attachment
LONGNVARCHAR	byte[] (UTF-8 encoded)	attachment (because CLOB-like types are often reported by JDBC drivers as this type)
LONGVARCHAR	byte[] (UTF-8 encoded)	attachment (because CLOB-like types are often reported by JDBC drivers as this type)
NULL	-	no entry is generated

The following types are not fully supported:

ARRAY
DATALINK
DISTINCT
JAVA_OBJECT
OTHER
REF
SQLXML
STRUCT

The crawler tries to automatically convert any values into attributes of an appropriate data type. If this is not possible an attachment with the bytes is generated.

The records are collected in bulks, whose size can be configured via the parameter maxRecordsPerBulk:

maxRecordsPerBulk has the same effect in any of the following cases:
- not configured: a new crawledRecords bulk is started after 1000 records.
- configured: a new crawledRecords bulk is started when the configured value is reached.

Please note that maxRecordsPerBulk must be > 0. Otherwise your job will fail.

Source: The attribute _source is set from the task parameter dataSource which has no further meaning currently, but it is needed by the delta service.

Splitting

If the table (view, result set) to crawl gets very large, just getting all rows in a single task may not be feasible for two reasons:

Performance: Reading the complete result set sequentially may take too long so we want to parallelize it.
Memory Consumption: JDBC may use more memory for managing the result set than available if the number of rows gets to large.

This is possible by using the "splitting" feature of the JDBC crawler worker. Basically, it works like this:

Instead of getting the rows of the table immediately, an extra SQL statement (parameter "splitLimitsSql") must be specified that selects the minimum and maximum value from some INTEGER column. This statement is executed in the initial crawl task of the job.
Instead of producing crawled records, the first task produces "splitsToCrawl" records that describe a partition of the complete range determined by this limits-statement. Each of this records is written to a seperate bulk so each split will crawled in an own follow-up task. The size of these partitions is specified by a "splitIncrement" parameter.
In these follow-up tasks the "crawlSql" is used to get the actual rows of the current split. Therefore the crawlSql must be extended to contain two parameters, the lower and upper bound (inclusive) of the INTEGER column used in the first task. The output of these tasks are then crawledRecords as in normal "one-step" crawling, no further follow-up tasks for the crawler will be triggered.

Example

WORK IN PROGRESS

Assume we want to crawl a very large table that has a INTEGER key column named ID. So the "splitLimitsSql" should determine the minimum and maximum value in this column, and the "crawlSql" would restrict the result set for the task based on a given upper and lower bound for this column:

{
  ...
  "parameters": {
    ...
    "splitLimitsSql": "SELECT min(ID) as MIN, max(ID) as MAX FROM VERY_LARGE_TABLE",
    "crawlSql": "SELECT * FROM VERY_LARGE_TABLE WHERE ID >= ? and ID <= ?",
    "splitIncrement": 10000,
    ...
    "mapping": {
      // as usual
    }
  }
}

The "splitLimitsSql" must deliver the limits in a single row using column names MIN and MAX. Lets assume it yields MIN=1 and MAX=1000000. Then, using the "splitIncrement" value of 10,000, the first task creates 1,00 splitToCrawl records with these bounds:

MIN=1, MAX=10,000
MIN=10,001, MAX=20,000
...
MIN=990,001, MAX=1,000,000

Each of these records is then used in a seperate task.

TODO finish this.

JDBC Fetcher

For each input record, it executes the fetchSql statement and adds the result rows to the record.

Configuration

Worker name: jdbcFetcher
Parameters:
- dbUrl: (req.) database URL to connect to
- dbProps: (opt.) propereties used when connecting to the database.
  - user (opt.) user name
  - password (opt.) user password
  - ... (opt.) any property supported by the used JDBC driver
- fetchSql: (opt.) the SQL statement executed to fetch the data which is used to enrich the crawled input record. It may contain one or more '?' as parameter placeholders, see PreparedStatement. If the fetchSql parameter isn't set, the originally crawled record is written unchanged to the output.
- fetchParameterAttributes: (opt.) a list of record attribute names who's values are used in the given order as parameters (substitutes for the '?') in the fetchSql statement. If fetchParameterAttributes isn't set, the record attributes which were mapped from the idColumns are used as fetchSql statement parameter substitutes.
- mapping (req.) specifies how to map database column names to record attributes or attachments. Please note that database column names are normalized to upper-case.
  - COLUMN-1 (opt.) mapping of the first column to an attribute/attachment
  - COLUMN-2 (opt.) mapping of the second column to an attribute/attachment
  - COLUMN-N (opt.) mapping of the last column to an attribute/attachment
- idColumns (opt.) If fetchParameterAttributes isn't set, the record attributes which were mapped from the idColumns DB columns are used as parameter (substitutes for the '?') in the fetchSql statement. In this case the id columns must also be specified in the mapping.
- maxAttachmentSize (opt.) maximum accepted size of BLOB/CLOB column values (in bytes/characters) for attachment creation. Default is "1000000000" (1 billion). Larger values are skipped and a warning is written to the log.
Input slots:
- recordsToFetch
Output slots:
- fetchedRecords

Processing

The JDBC Fetcher is used to enrich the input records with the data selected by the fetchSql statement. The fetchSql is executed as PreparedStatement. So it can have parameters ('?') which will be replaced by either the values of the record attributes specified by the fetchParameterAttributes or (per default) by the values of the record attributes which were mapped from the idColumns.

All columns that are selected by the fetchSql query and are mapped in the mapping section will enrich the crawled record. That means, the mapped attributes are added to the record's metadata, binary data (BLOB, CLOB) will be added as attachments. Have a look at the table above for a description of the mapping from database types to Java types.

In general, the fetchSql query will return exactly one row. If it returns no row at all (resp. no result), the originally crawled record is written unchanged to the output. If it returns more than one row, the values of the rows are merged (e.g. as lists) before being added to the input record.

If one of the attributes used as fetch parameters is not set or is does not have a simple values, but a sequence or a map, it is skipped and no additional data will be fetched for this record. It will just be written unchanged to the output.

Sample JDBC crawl job

This example uses the following fictitious tables of database EmailDB:

Emails
Sender	Receiver	Subject	SendDate	BodyId

EmailBodies
BodyId	Body

  {
   "name":"crawlJdbcJob",
   "workflow":"jdbcCrawling",
   "parameters":{
     "dataSource":"emails",
     "dbUrl":"jdbc:derby:memory:EmailDB",
     "dbProps":
     {
        "user": "admin",
        "password": "topsecret"
     },
     "crawlSql":"SELECT * FROM Emails",
     "fetchSql":"SELECT Body FROM EmailBodies WHERE BodyId=?",
     "fetchParameterAttributes": "bodyReference",
     "idColumns": ["Sender", "Receiver"],
     "deltaColumns": "SendDate",
     "mapping":{
       "Sender":"From",
       "Receiver":"To",       
       "Subject":"Title",       
       "SendDate":"lastModified",   
       "BodyId":"bodyReference"
     },
     "jobToPushTo":"indexUpdateJob",
     "tempStore": "temp"
   }
 }

Adding JDBC Drivers

By default SMILA includes only JDBC drivers for Derby. If you want to access other databases then you have to provide according JDBC drivers. To add a JDBC driver you have to do the following steps:

Copy your JDBC driver jars into folder SMILA/plugins/org.eclipse.smila.jdbc/lib
Edit SMILA/plugins/org.eclipse.smila.jdbc/META-INF/MANIFEST.MF and add the jars to the bundle classpath, e.g.

Bundle-ClassPath: .,
 lib/postgresql-9.1-902.jdbc4.jar,
 lib/mysql-connector-java-5.1.20-bin.jar

Do not forget the whitespace character in front of the path.
JDBC-4-compliant drivers should be found automatically after a restart. For other drivers you must add the driver class name to the SMILA-JDBC header in META-INF/MANIFEST.MF, e.g.

SMILA-JDBC: org.apache.derby.jdbc.EmbeddedDriver,
 org.apache.derby.jdbc.ClientDriver

@@ Line 130: / Line 130: @@
 * In these follow-up tasks the "crawlSql" is used to get the actual rows of the current split. Therefore the crawlSql must be extended to contain two parameters, the lower and upper bound (inclusive) of the INTEGER column used in the first task. The output of these tasks are then crawledRecords as in normal "one-step" crawling, no further follow-up tasks for the crawler will be triggered.
-====== Splitting ======
+====== Example ======
+'''WORK IN PROGRESS'''
 Assume we want to crawl a very large table that has a INTEGER key column named ID. So the "splitLimitsSql" should determine the minimum and maximum value in this column, and the "crawlSql" would restrict the result set for the task based on a given upper and lower bound for this column:
@@ Line 149: / Line 151: @@
 }
 </pre>
+The "splitLimitsSql" must deliver the limits in a single row using column names MIN and MAX. Lets assume it yields ''MIN=1'' and ''MAX=1000000''. Then, using the "splitIncrement" value of 10,000, the first task creates 1,00 splitToCrawl records with these bounds:
+* MIN=1, MAX=10,000
+* MIN=10,001, MAX=20,000
+* ...
+* MIN=990,001, MAX=1,000,000
+Each of these records is then used in a seperate task.
+'''TODO''' finish this.
 === JDBC Fetcher ===

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Importing/Crawler/JDBC"

Revision as of 11:20, 13 December 2012

Contents

JDBC Crawler

Configuration

Processing

Splitting

Example

JDBC Fetcher

Configuration

Processing

Sample JDBC crawl job

Adding JDBC Drivers

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "SMILA/Documentation/Importing/Crawler/JDBC"

Revision as of 11:20, 13 December 2012

Contents

JDBC Crawler

Configuration

Processing

Splitting

Example

JDBC Fetcher

Configuration

Processing

Sample JDBC crawl job

Adding JDBC Drivers