Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/Crawler/JDBC"

(Configuration)
(Configuration)
Line 130: Line 130:
 
*** <tt>password</tt> ''(opt.)'' user password
 
*** <tt>password</tt> ''(opt.)'' user password
 
*** <tt>...</tt> ''(opt.)'' any property supported by the used JDBC driver
 
*** <tt>...</tt> ''(opt.)'' any property supported by the used JDBC driver
** <tt>fetchSql</tt>: ''(req.)'' the SQL statement to execute
+
** <tt>fetchSql</tt>: ''(opt.)'' the SQL statement executed to fetch the data which is used to enrich the crawled input record. It may contain one or more '?' as parameter placeholders, see [http://docs.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html PreparedStatement]. If the <tt>fetchSql</tt> parameter isn't set, the originally crawled record is written unchanged to the output.
** <tt>fetchParameterAttributes</tt>: ''(opt.)'' a list of attribute names who's values are used as filters in the fetchSQL WHERE clause
+
** <tt>fetchParameterAttributes</tt>: ''(opt.)'' a list of record attribute names who's values are used in the given order as parameters (substitutes for the '?') in the <tt>fetchSql</tt> statement. If <tt>fetchParameterAttributes</tt> isn't set, the record attributes which were mapped from the <tt>idColumns</tt> are used as <tt>fetchSql</tt> statement parameter substitutes.
 
** <tt>mapping</tt> ''(req.)'' specifies how to map database column names to record attributes or attachments. Please note that database column names are normalized to upper-case.
 
** <tt>mapping</tt> ''(req.)'' specifies how to map database column names to record attributes or attachments. Please note that database column names are normalized to upper-case.
 
*** <tt>COLUMN-1</tt> ''(opt.)'' mapping of the first column to an attribute/attachment
 
*** <tt>COLUMN-1</tt> ''(opt.)'' mapping of the first column to an attribute/attachment
 
*** <tt>COLUMN-2</tt> ''(opt.)'' mapping of the second column to an attribute/attachment
 
*** <tt>COLUMN-2</tt> ''(opt.)'' mapping of the second column to an attribute/attachment
 
*** <tt>COLUMN-N</tt> ''(opt.)'' mapping of the last column to an attribute/attachment
 
*** <tt>COLUMN-N</tt> ''(opt.)'' mapping of the last column to an attribute/attachment
*** <tt>idColumns</tt> ''(req.)'' a list of database column names (upper-case) who's values are used as filters in the fetchSQL WHERE clause
+
*** <tt>idColumns</tt> ''(opt.)'' If <tt>fetchParameterAttributes</tt> isn't set, the record attributes which were mapped from the <tt>idColumns</tt> DB columns are used as parameter (substitutes for the '?') in the <tt>fetchSql</tt> statement.
 
* Input slots:
 
* Input slots:
 
** <tt>recordsToFetch</tt>
 
** <tt>recordsToFetch</tt>

Revision as of 09:34, 18 June 2012

JDBC Crawler and JDBC Fetcher worker are used for importing data via JDBC from a database. For a big picture and the worker's interaction have a look at the Importing Concept.

JDBC Crawler

The JDBC Crawler executes an SQL statement and crawles the result set, producing a record for each row of the result set.

Configuration

The JDBC Crawler worker is usually the first worker in a workflow and the job is started in runOnce mode.

  • Worker name: jdbcCrawler
  • Parameters:
    • dataSource: (req.) value for attribute _source, needed e.g. by the delta service
    • dbUrl: (req.) database URL to connect to
    • dbProps: (opt.) propereties used when connecting to the database.
      • user (opt.) user name
      • password (opt.) user password
      • ... (opt.) any property supported by the used JDBC driver
    • crawlSql: (req.) the SQL statement to execute
    • mapping (req.) specifies how to map database column names to record attributes or attachments.
      • COLUMN-1 (opt.) mapping of the first column to an attribute/attachment
      • COLUMN-2 (opt.) mapping of the second column to an attribute/attachment
      • COLUMN-N (opt.) mapping of the last column to an attribute/attachment
    • idColumns (req.) a list of database column names used to generate the record ID from
    • deltaColumns (opt.) a list of database column names (upper-case) used to generate the value for attribute _deltaHash
    • parameters to control size of output bulks, see below for details
      • maxRecordsPerBulk (opt.) maximum number of records in one bulk. (default: 1000)
  • Task generator: runOnceTrigger
  • Output slots:
    • crawledRecords
Processing

The JDBC Crawler executes the crawlSql statement and produces one record per result row in the bucket connected to crawledRecords. Please note that internally database column names are normalized to upper-case. In the configuration however any casing can be used. The resulting records contain only the values of the columns configured in the mapping. Wether a column is represented as an attribute or as an attachment depends on the type of the database column. Below is a table that summarizes the supported database types, how they are mapped to Java types and if they are represented as attributes or attachments:

Database Type Java Type represented as
BIT Boolean attribute
BOOLEAN Boolean attribute
BIGINT Long attribute
INTEGER Long attribute
SMALLINT Long attribute
TINYINT Long attribute
DOUBLE Double attribute
FLOAT Double attribute
REAL Double attribute
DECIMAL Double(scale>0) or Long attribute
NUMERIC Double(scale>0) or Long attribute
DATE Date attribute
TIME DateTime attribute
TIMESTAMP DateTime attribute
CHAR String attribute
LONGNVARCHAR String attribute
LONGVARCHAR String attribute
VARCHAR String attribute
NCHAR String attribute
NVARCHAR String attribute
ROWID String attribute
BINARY byte[] attachment
VARBINARY byte[] attachment
LONGVARBINARY byte[] attachment
BLOB byte[] attachment
CLOB byte[] attachment
NCLOB byte[] attachment
NULL - no entry is generated

The following types are not fully supported:

  • ARRAY
  • DATALINK
  • DISTINCT
  • JAVA_OBJECT
  • OTHER
  • REF
  • SQLXML
  • STRUCT

The crawler tries to automatically convert any values into attributes of an appropriate data type. If this is not possible an attachment with the bytes is generated.

The records are collected in bulks, whose size can be configured via the parameter maxRecordsPerBulk:

  • maxRecordsPerBulk has the same effect in any of the following cases:
    • not configured: a new crawledRecords bulk is started after 1000 records.
    • configured: a new crawledRecords bulk is started when the configured value is reached.

Please note that maxRecordsPerBulk must be > 0. Otherwise your job will fail.

Source: The attribute _source is set from the task parameter dataSource which has no further meaning currently, but it is needed by the delta service.


JDBC Fetcher

For each input record, it executes the fetchSql statement and adds the result rows to the record.

Configuration
  • Worker name: jdbcFetcher
  • Parameters:
    • dbUrl: (req.) database URL to connect to
    • dbProps: (opt.) propereties used when connecting to the database.
      • user (opt.) user name
      • password (opt.) user password
      • ... (opt.) any property supported by the used JDBC driver
    • fetchSql: (opt.) the SQL statement executed to fetch the data which is used to enrich the crawled input record. It may contain one or more '?' as parameter placeholders, see PreparedStatement. If the fetchSql parameter isn't set, the originally crawled record is written unchanged to the output.
    • fetchParameterAttributes: (opt.) a list of record attribute names who's values are used in the given order as parameters (substitutes for the '?') in the fetchSql statement. If fetchParameterAttributes isn't set, the record attributes which were mapped from the idColumns are used as fetchSql statement parameter substitutes.
    • mapping (req.) specifies how to map database column names to record attributes or attachments. Please note that database column names are normalized to upper-case.
      • COLUMN-1 (opt.) mapping of the first column to an attribute/attachment
      • COLUMN-2 (opt.) mapping of the second column to an attribute/attachment
      • COLUMN-N (opt.) mapping of the last column to an attribute/attachment
      • idColumns (opt.) If fetchParameterAttributes isn't set, the record attributes which were mapped from the idColumns DB columns are used as parameter (substitutes for the '?') in the fetchSql statement.
  • Input slots:
    • recordsToFetch
  • Output slots:
    • fetchedRecords
Processing

// TODO: describe how fetching works, how WHERE clause is generated, how (multi)result rows are set in the record

Sample JDBC crawl job

This example uses the following fictitious tables of database EmailDB:

Emails
Sender Receiver Subject SendDate BodyId
EmailBodies
BodyId Body


  {
   "name":"crawlJdbcJob",
   "workflow":"jdbcCrawling",
   "parameters":{
     "dataSource":"emails",
     "dbUrl":"jdbc:derby:memory:EmailDB",
     "dbProps":
     {
        "user": "admin",
        "password": "topsecret"
     },
     "crawlSQL":"SELECT * FROM Emails",
     "fetchSQL":"SELECT Body FROM EmailBodies WHERE BodyId=?",
     "fetchParameterAttributes": "bodyReference",
     "idColumns": ["Sender", "Receiver"],
     "deltaColumns": "SendDate",
     "mapping":{
       "Sender":"From",
       "Receiver":"To",       
       "Subject":"Title",       
       "SendDate":"lastModified",   
       "BodyId":"bodyReference"
     },
     "jobToPushTo":"indexUpdateJob"     
   }
 }

Copyright © Eclipse Foundation, Inc. All Rights Reserved.