Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/2011.Simplification/JDBC Crawler"

m (Overview)
(For SMILA 1.0: Simplification pages are obsolete, redirect to SMILA/Documentation/JDBC_Crawler)
 
Line 1: Line 1:
== Overview ==
+
#REDIRECT [[SMILA/Documentation/JDBC_Crawler]]
 
+
The JDBC crawler collects data from a JDBC-accessible databases based on a SELECT statement given in <tt>DataSourceConnectionConfig</tt>.
+
Any of the columns of the database record returned by the SELECT statement may be mapped to an attribute of the SMILA record in the <tt>Attributes</tt> section of the <tt>DataSourceConnectionConfig</tt>.
+
 
+
To overcome certain shortcomings of some JDBCd drivers concerning the handling of really large data sets, the crawler features an optional grouping mechanism enabling it to retrieve data in well defined frames from the database and thus avoid OutOfMemoryExceptions.
+
 
+
To use the JDBC crawler with your custom JDBC driver, the JDBC crawler bundle's ClassLoader has to have access to the driver class, i.e. you have to have a bundle that exports the driver class.
+
 
+
== Crawling configuration ==
+
 
+
The configuration file has to be located at <tt>configuration/org.eclipse.smila.connectivity.framework</tt>.
+
 
+
Defining Schema: <tt>org.eclipse.smila.connectivitiy.framework.crawler.jdbc/schemas/JdbcDataSourceConnectionConfigSchema.xsd</tt>
+
 
+
== Crawling configuration explanation ==
+
 
+
The root element of the crawling configuration is <tt>DataSourceConnectionConfig</tt> and contains the following sub elements:
+
 
+
* <tt>DataSourceID</tt> – the identification of a data source.
+
* <tt>SchemaID</tt> – specify the schema for a crawler job.
+
* <tt>DataConnectionID</tt> – describes which agent crawler should be used.
+
** <tt>Crawler</tt> – implementation class of a crawler.
+
* <tt>DeltaIndexing</tt> – specifies deltaindexing mode.
+
* <tt>Attributes</tt> – list all attributes you would like to use from a database row
+
** <tt>Attribute</tt>
+
*** <tt>Type</tt> (required) – the data type to use in the SMILA record.
+
*** <tt>Name</tt> (required) – the name of the attribute to create in the SMILA record.
+
*** <tt>HashAttribute</tt> (required) – specify if a hash should be created (true or false).
+
*** <tt>KeyAttribute</tt> (required) – creates a key for this object, for example for record id (true or false).
+
*** <tt>Attachment</tt> (required) – specify if the attribute's data should be stored as an attachment e.g. for blob or clob fields (true or false)
+
*** <tt>ColumnName</tt> (required) - the name of the column in the database row to use for this attribute
+
*** <tt>SqlType</tt> (required)- specify the SQL Type of the column in the database row (one of: "string", "long", "date", "double", "blob", "clob", "boolean", "byte[]", "timestamp")
+
 
+
* <tt>Process</tt>
+
** <tt>Selections</tt> - Which data is to be selected (and how)
+
*** <tt>Grouping</tt> (optional) - adds support for "chunk"-wise retrieving of data from the database in order to preserve memory resources. See "Using the Grouping Element" below.
+
**** <tt>Stepping</tt> - how many rows should be retrieved at a time (integer)
+
**** <tt>SQL</tt> - specify an SQL statement that returns an ordered list of database keys which can be used to create the retrieval partitions
+
*** <tt>SQL</tt> - the SQL statement that selects the actual data to be retrieved.
+
** <tt>Database</tt> - JDBC Connection information
+
*** <tt>Connection</tt> - The JDBC connection URL to use for connecting to the database
+
*** <tt>User</tt> - The username to use when connecting to the database (can be left blank if anonymous access is possible)
+
*** <tt>Password</tt> - The password to use when connecting to the database (can be left blank if anonymous access is possible)
+
*** <tt>FetchSize</tt> - The FetchSize to set when creating the JDBC-Connection. This is mapped to the JDBC-property and must not be confused with the Stepping functionality of the crawler.
+
*** <tt>JdbcDriver</tt> - Specify the fully qualified class name of the Jdbc-Driver to use (must be accessible to the bundle's class loader)
+
 
+
== Crawling configuration example ==
+
A typical configuration for the JDBC crawler looks like this:
+
 
+
<source lang="xml">
+
<?xml version="1.0" encoding="UTF-8"?>
+
<DataSourceConnectionConfig
+
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.jdbc/schemas/JdbcDataSourceConnectionConfigSchema.xsd">
+
<DataSourceID>jdbc</DataSourceID>
+
<SchemaID>org.eclipse.smila.connectivity.framework.crawler.jdbc</SchemaID>
+
<DataConnectionID>
+
<Crawler>JdbcCrawler</Crawler>
+
</DataConnectionID>
+
        <DeltaIndexing>full</DeltaIndexing>
+
<Attributes>
+
<Attribute Name="Person_ID" HashAttribute="true"
+
KeyAttribute="true" Type="Long">
+
<ColumnName>id</ColumnName>
+
<SqlType>long</SqlType>
+
</Attribute>
+
<Attribute Name="BMI" HashAttribute="true" KeyAttribute="false"
+
Type="double">
+
<ColumnName>body_mass_index</ColumnName>
+
<SqlType>double</SqlType>
+
</Attribute>
+
<Attribute Name="VacationDays" HashAttribute="false"
+
KeyAttribute="false" Type="Long">
+
<ColumnName>vacationdays</ColumnName>
+
<SqlType>long</SqlType>
+
</Attribute>
+
<Attribute Name="BirthDay" HashAttribute="true"
+
KeyAttribute="true" Type="date">
+
<ColumnName>birthday</ColumnName>
+
<SqlType>date</SqlType>
+
</Attribute>
+
<Attribute Name="DownSizeCandidate" HashAttribute="false"
+
KeyAttribute="false" Type="Boolean">
+
<ColumnName>scheduled_for_downsizing</ColumnName>
+
<SqlType>boolean</SqlType>
+
</Attribute>
+
<Attribute Name="DownSizedOn" HashAttribute="false"
+
KeyAttribute="false" Type="Timestamp">
+
<ColumnName>downsized</ColumnName>
+
<SqlType>timestamp</SqlType>
+
</Attribute>
+
<Attribute Name="Photo" HashAttribute="false" Attachment="true"
+
KeyAttribute="false" Type="Blob">
+
<ColumnName>photo</ColumnName>
+
<SqlType>blob</SqlType>
+
</Attribute>
+
<Attribute Name="Resume" HashAttribute="false"
+
KeyAttribute="false" Attachment="true" Type="Clob">
+
<ColumnName>cv</ColumnName>
+
<SqlType>clob</SqlType>
+
</Attribute>
+
</Attributes>
+
<Process>
+
<Selections>
+
<Grouping>
+
<Stepping>13</Stepping>
+
<SQL>SELECT id FROM person ORDER BY id ASC</SQL>
+
</Grouping>
+
<SQL>
+
SELECT id, vorname, name, body_mass_index, vacationdays,
+
birthday, scheduled_for_downsizing, downsized, photo, cv
+
FROM person where id BETWEEN %01min AND %01max
+
</SQL>
+
</Selections>
+
<Database Connection="jdbc:derby:crawlerTestDerbyDB" User=""
+
Password="" FetchSize="100000"
+
JdbcDriver="org.apache.derby.jdbc.EmbeddedDriver" />
+
</Process>
+
</DataSourceConnectionConfig>
+
</source>
+
 
+
== Output example ==
+
<source lang="xml">
+
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
+
  <Val key="_recordid">jdbc:&lt;Birthday=2009-04-02T00:00:00+0100;Person_ID=1&gt;</Val>
+
  <Val key="_source">jdbc</Val>
+
  <Val key="Person_ID" type="long">1</Val>
+
  <Val key="BMI" type="long">0.5497346110141528</Val>
+
  <Val key="VacationDays" type="long">23</Val>
+
  <Val key="BirthDay" type="datetime">2009-04-02T00:00:00+0100</Val>
+
  <Val key="DownSizeCandidate" type="long">0</Val>
+
  <Val key="DownSizedOn" type="datetime">2009-04-02 00:00:00+0100</Val>
+
  <Val key="_HASH_TOKEN">69d132fab2fd88cf9ccc17e57f68394ac3fed97ec8bab1c89bf764a6fa662</Val>
+
  <Attachment>Photo</Attachment>
+
  <Attachment>Resume</Attachment>
+
</Record>
+
</source>
+
 
+
== Using the Grouping element ==
+
 
+
If used, the '''Grouping''' element must contain its own '''SQL''' element with a SELECT statement and an additional '''Stepping''' element containing a non-negative integer value.
+
 
+
Use of the '''Grouping''' element is best explained by means of an example. Imagine running the retrieval statement <source lang="SQL">SELECT * FROM CUSTOMER</source> on your CUSTOMER table and further imagine that you have a very healthy customer base, so your CUSTOMER table contains 750.000 rows. You would invariably come down with OutOfMemoryExceptions during the crawling run, as the JDBC-Resultset does not release the resources for retrieved data rows. After calling <source lang="java">resultSet.next()</source> a few hundred thousand times an OutOfMemoryException is more or less inevitable.
+
 
+
Grouping to the rescue! The grouping element enables you to break down your query results into blocks of custom size (specified by the '''Stepping''' value) which can be retrieved sequentially while releasing the resultset's resources after each block. You have to be sure to provide a SELECT statement in the '''SQL''' element of '''Grouping''' which returns a '''''sorted''''' list of key values by which the results can be grouped. This could be the primary key of the table for instance or any other suitable discriminator. The stepping value is used by the crawler in turn to form groups of the specified size from the keys.
+
 
+
If the statement <source lang="sql"> SELECT primaryKey FROM CUSTOMER ORDER BY primaryKey ASC</source> returned for example the following values:
+
::'''{1, 2, 4, 5, 6, 8, 12, 13, 21, 34, 56, 67}'''
+
a stepping value of '''5''' would result in the creation of the following groups:
+
::'''{1, 2, 4, 5, 6}'''
+
::'''{8, 12, 13, 21, 34}''' and
+
::'''{56, 67}'''
+
whereas with a stepping value of '''9''' the following groups would have been formed:
+
::'''{1, 2, 4, 5, 6, 8, 12, 13, 21}''' and
+
::'''{34, 56, 67}'''
+
 
+
You also have to apply a slight modification to your original retrieval SQL statement (the one directly descending the '''Selections''' element) in order for the grouping feature to do its magic:
+
<source lang="sql">SELECT * FROM CUSTOMER WHERE primaryKey between %min01 AND %max01</source>
+
 
+
The crawler will replace the '''%min01''' and '''%max01''' tokens with the respective minimum and maximum values of each of the formed groups thus creating the following three SQL statements for data retrieval (assuming a stepping value of '''5''' was used:
+
 
+
<source lang="sql">SELECT * FROM CUSTOMER WHERE primaryKey between 1 AND 6</source>
+
<source lang="sql">SELECT * FROM CUSTOMER WHERE primaryKey between 8 AND 34</source>
+
<source lang="sql">SELECT * FROM CUSTOMER WHERE primaryKey between 56 AND 67</source>
+
 
+
which will subsequently be submitted sequentially to the database.
+

Latest revision as of 07:34, 19 January 2012

Back to the top