Jump to: navigation, search

SMILA/Documentation/JDBC Crawler

Note.png
This is deprecated for SMILA 1.0, the connectivity framework is still functional but will aimed to be replaced by scalable import based on SMILAs job management.


Overview

The JDBC crawler collects data from a JDBC-accessible databases based on a SELECT statement given in DataSourceConnectionConfig. Any of the columns of the database record returned by the SELECT statement may be mapped to an attribute of the SMILA record in the Attributes section of the DataSourceConnectionConfig.

To overcome certain shortcomings of some JDBCd drivers concerning the handling of really large data sets, the crawler features an optional grouping mechanism enabling it to retrieve data in well defined frames from the database and thus avoid OutOfMemoryExceptions.

To use the JDBC crawler with your custom JDBC driver, the JDBC crawler bundle's ClassLoader has to have access to the driver class, i.e. you have to have a bundle that exports the driver class.

Crawling configuration

The configuration file has to be located at configuration/org.eclipse.smila.connectivity.framework.

Defining Schema: org.eclipse.smila.connectivitiy.framework.crawler.jdbc/schemas/JdbcDataSourceConnectionConfigSchema.xsd

Crawling configuration explanation

See SMILA/Documentation/Crawler#Configuration for the generic parts of the configuration file.

The root element of the crawling configuration is DataSourceConnectionConfig and contains the following sub elements:

  • DataSourceID – the identification of a data source.
  • SchemaID – specify the schema for a crawler job.
  • DataConnectionID – describes which agent crawler should be used.
    • Crawler – implementation class of a crawler.
  • DeltaIndexing – specifies deltaindexing mode.
  • Attributes – list all attributes you would like to use from a database row
    • Attribute
      • Type (required) – the data type to use in the SMILA record.
      • Name (required) – the name of the attribute to create in the SMILA record.
      • HashAttribute (required) – specify if a hash should be created (true or false).
      • KeyAttribute (required) – creates a key for this object, for example for record id (true or false).
      • Attachment (required) – specify if the attribute's data should be stored as an attachment e.g. for blob or clob fields (true or false)
      • ColumnName (required) - the name of the column in the database row to use for this attribute
      • SqlType (required)- specify the SQL Type of the column in the database row (one of: "string", "long", "date", "double", "blob", "clob", "boolean", "byte[]", "timestamp")
  • Process
    • Selections - Which data is to be selected (and how)
      • Grouping (optional) - adds support for "chunk"-wise retrieving of data from the database in order to preserve memory resources. See "Using the Grouping Element" below.
        • Stepping - how many rows should be retrieved at a time (integer)
        • SQL - specify an SQL statement that returns an ordered list of database keys which can be used to create the retrieval partitions
      • SQL - the SQL statement that selects the actual data to be retrieved.
    • Database - JDBC Connection information
      • Connection - The JDBC connection URL to use for connecting to the database
      • User - The username to use when connecting to the database (can be left blank if anonymous access is possible)
      • Password - The password to use when connecting to the database (can be left blank if anonymous access is possible)
      • FetchSize - The FetchSize to set when creating the JDBC-Connection. This is mapped to the JDBC-property and must not be confused with the Stepping functionality of the crawler.
      • JdbcDriver - Specify the fully qualified class name of the Jdbc-Driver to use (must be accessible to the bundle's class loader)

Crawling configuration example

A typical configuration for the JDBC crawler looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<DataSourceConnectionConfig
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.jdbc/schemas/JdbcDataSourceConnectionConfigSchema.xsd">
	<DataSourceID>jdbc</DataSourceID>
	<SchemaID>org.eclipse.smila.connectivity.framework.crawler.jdbc</SchemaID>
	<DataConnectionID>
		<Crawler>JdbcCrawler</Crawler>
	</DataConnectionID>
        <DeltaIndexing>full</DeltaIndexing>
	<Attributes>
		<Attribute Name="Person_ID" HashAttribute="true"
			KeyAttribute="true" Type="Long">
			<ColumnName>id</ColumnName>
			<SqlType>long</SqlType>
		</Attribute>
		<Attribute Name="BMI" HashAttribute="true" KeyAttribute="false"
			Type="double">
			<ColumnName>body_mass_index</ColumnName>
			<SqlType>double</SqlType>
		</Attribute>
		<Attribute Name="VacationDays" HashAttribute="false"
			KeyAttribute="false" Type="Long">
			<ColumnName>vacationdays</ColumnName>
			<SqlType>long</SqlType>
		</Attribute>
		<Attribute Name="BirthDay" HashAttribute="true"
			KeyAttribute="true" Type="date">
			<ColumnName>birthday</ColumnName>
			<SqlType>date</SqlType>
		</Attribute>
		<Attribute Name="DownSizeCandidate" HashAttribute="false"
			KeyAttribute="false" Type="Boolean">
			<ColumnName>scheduled_for_downsizing</ColumnName>
			<SqlType>boolean</SqlType>
		</Attribute>
		<Attribute Name="DownSizedOn" HashAttribute="false"
			KeyAttribute="false" Type="Timestamp">
			<ColumnName>downsized</ColumnName>
			<SqlType>timestamp</SqlType>
		</Attribute>
		<Attribute Name="Photo" HashAttribute="false" Attachment="true"
			KeyAttribute="false" Type="Blob">
			<ColumnName>photo</ColumnName>
			<SqlType>blob</SqlType>
		</Attribute>
		<Attribute Name="Resume" HashAttribute="false"
			KeyAttribute="false" Attachment="true" Type="Clob">
			<ColumnName>cv</ColumnName>
			<SqlType>clob</SqlType>
		</Attribute>
	</Attributes>
	<Process>
		<Selections>
			<Grouping>
				<Stepping>13</Stepping>
				<SQL>SELECT id FROM person ORDER BY id ASC</SQL>
			</Grouping>
			<SQL>
				SELECT id, vorname, name, body_mass_index, vacationdays,
				birthday, scheduled_for_downsizing, downsized, photo, cv
				FROM person where id BETWEEN %01min AND %01max
			</SQL>
		</Selections>
		<Database Connection="jdbc:derby:crawlerTestDerbyDB" User=""
			Password="" FetchSize="100000"
			JdbcDriver="org.apache.derby.jdbc.EmbeddedDriver" />
	</Process>
</DataSourceConnectionConfig>

Output example

<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
  <Val key="_recordid">jdbc:&lt;Birthday=2009-04-02T00:00:00+0100;Person_ID=1&gt;</Val>
  <Val key="_source">jdbc</Val>
  <Val key="Person_ID" type="long">1</Val>
  <Val key="BMI" type="long">0.5497346110141528</Val>
  <Val key="VacationDays" type="long">23</Val>
  <Val key="BirthDay" type="datetime">2009-04-02T00:00:00+0100</Val>
  <Val key="DownSizeCandidate" type="long">0</Val>
  <Val key="DownSizedOn" type="datetime">2009-04-02 00:00:00+0100</Val>
  <Val key="_HASH_TOKEN">69d132fab2fd88cf9ccc17e57f68394ac3fed97ec8bab1c89bf764a6fa662</Val>
  <Attachment>Photo</Attachment>
  <Attachment>Resume</Attachment>
</Record>

Using the Grouping element

If used, the Grouping element must contain its own SQL element with a SELECT statement and an additional Stepping element containing a non-negative integer value.

Use of the Grouping element is best explained by means of an example. Imagine running the retrieval statement
SELECT * FROM CUSTOMER
on your CUSTOMER table and further imagine that you have a very healthy customer base, so your CUSTOMER table contains 750.000 rows. You would invariably come down with OutOfMemoryExceptions during the crawling run, as the JDBC-Resultset does not release the resources for retrieved data rows. After calling
resultSet.next()
a few hundred thousand times an OutOfMemoryException is more or less inevitable.

Grouping to the rescue! The grouping element enables you to break down your query results into blocks of custom size (specified by the Stepping value) which can be retrieved sequentially while releasing the resultset's resources after each block. You have to be sure to provide a SELECT statement in the SQL element of Grouping which returns a sorted list of key values by which the results can be grouped. This could be the primary key of the table for instance or any other suitable discriminator. The stepping value is used by the crawler in turn to form groups of the specified size from the keys.

If the statement
 SELECT primaryKey FROM CUSTOMER ORDER BY primaryKey ASC
returned for example the following values:
{1, 2, 4, 5, 6, 8, 12, 13, 21, 34, 56, 67}

a stepping value of 5 would result in the creation of the following groups:

{1, 2, 4, 5, 6}
{8, 12, 13, 21, 34} and
{56, 67}

whereas with a stepping value of 9 the following groups would have been formed:

{1, 2, 4, 5, 6, 8, 12, 13, 21} and
{34, 56, 67}

You also have to apply a slight modification to your original retrieval SQL statement (the one directly descending the Selections element) in order for the grouping feature to do its magic:

SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN %min01 AND %max01

The crawler will replace the %min01 and %max01 tokens with the respective minimum and maximum values of each of the formed groups thus creating the following three SQL statements for data retrieval (assuming a stepping value of 5 was used:

SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 1 AND 6
SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 8 AND 34
SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 56 AND 67

which will subsequently be submitted sequentially to the database.