Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/JDBC Crawler"

(New page: == The JDBC Crawler's XML Index Order == A typical IndexOrder for the JDBC Crawler looks like this: <source lang="xml"> <?xml version="1.0" encoding="UTF-8"?> <IndexOrderConfiguration xm...)
 
Line 1: Line 1:
== The JDBC Crawler's XML Index Order ==
+
== What does the JDBC Crawler do ==
A typical IndexOrder for the JDBC Crawler looks like this:
+
 
 +
The JDBC Crawler collects data from a JDBC-accessible database based on a SELECT-statement in the DataSourceConnectionConfig.
 +
Any of the columns of the database record returned by the SELECT-Statement may be mapped to an attribute of the SMILA record in the Attributes section of the DataSourceConnectionConfig.
 +
 
 +
To overcome certain shortcomings of some JDBC-implementations concerning the handling of really large data amounts, it features an optional grouping mechanism enabling it to retrieve data in well defined frames from the database and thus avoid OutOfMemoryExceptions.
 +
 
 +
To use the JDBC Crawler with your custom JDBC-Driver, the JDBC Crawler bundle's ClassLoader has to have access to the Driver class, i.e. you have to have a bundle that exports the class.
 +
 
 +
== Crawling configuration ==
 +
 
 +
The configuration is found at <tt>configuration/org.eclipse.smila.connectivity.framework/jdbc</tt>
 +
 
 +
Defining Schema: <tt>org.eclipse.smila.connectivitiy.framework.crawler.jdbc/schemas/JdbcDataSourceConnectionConfigSchema.xsd</tt>
 +
 
 +
 
 +
== Crawling configuration explanation ==
 +
 
 +
The root element of the crawling configuration is DataSourceConnectionConfig and contains the following sub elements:
 +
 
 +
* <tt>DataSourceID</tt> – the identification of a data source.
 +
* <tt>SchemaID</tt> – specify the schema for a crawler job.
 +
* <tt>DataConnectionID</tt> – describes which agent crawler should be used.
 +
** <tt>Crawler</tt> – implementation class of a Crawler.
 +
* <tt>CompoundHandling</tt> – specify if packed data should be unpacked and data within should be crawled (this should always be NO for the JDBC Crawler)
 +
* <tt>Attributes</tt> – list all attributes you would like to use from a database row
 +
** <tt>Attribute</tt>
 +
*** <tt>Type</tt> (required) – the data type to use in the SMILA record.
 +
*** <tt>Name</tt> (required) – the name of the attribute to create in the SMILA record.
 +
*** <tt>HashAttribute</tt> (required) – specify if a hash should be created (true or false).
 +
*** <tt>KeyAttribute</tt> (required) – creates a key for this object, for example for record id (true or false).
 +
*** <tt>Attachment</tt> (required) – specify if the attribute's data should be stored as an attachment e.g. for blob or clob fields (true or false)
 +
*** <tt>ColumnName</tt> (required) - the name of the column in the database row to use for this attribute
 +
*** <tt>SqlType</tt> (required)- specify the SQL Type of the column in the database row (one of: "string", "long", "date", "double", "blob", "clob", "boolean", "byte[]", "timestamp")
 +
 
 +
* <tt>Process</tt>
 +
** <tt>Selections</tt> - Which data is to be selected (and how)
 +
*** <tt>Grouping</tt> (optional) - adds support for "chunk"-wise retrieving of data from the database in order to preserve memory resources. See "Using the Grouping Element" below.
 +
**** <tt>Stepping</tt> - how many rows should be retrieved at a time (integer)
 +
**** <tt>SQL</tt> - specify an SQL statement that returns an ordered list of database keys which can be used to create the retrieval partitions
 +
*** <tt>SQL</tt> - the SQL statement that selects the actual data to be retrieved.
 +
** <tt>Database</tt> - JDBC Connection information
 +
*** <tt>Connection</tt> - The JDBC connection URL to use for connecting to the database
 +
*** <tt>User</tt> - The username to use when connecting to the database (can be left blank if anonymous access is possible)
 +
*** <tt>Password</tt> - The password to use when connecting to the database (can be left blank if anonymous access is possible)
 +
*** <tt>FetchSize</tt> - The FetchSize to set when creating the JDBC-Connection. This is mapped to the JDBC-property and must not be confused with the Stepping functionality of the crawler.
 +
*** <tt>JdbcDriver</tt> - Specify the fully qualified class name of the Jdbc-Driver to use (must be accessible to the bundle's class loader)
 +
 
 +
 
 +
 
 +
== Crawling configuration example ==
 +
A typical Configuration for the JDBC Crawler looks like this:
  
 
<source lang="xml">
 
<source lang="xml">
 
<?xml version="1.0" encoding="UTF-8"?>
 
<?xml version="1.0" encoding="UTF-8"?>
<IndexOrderConfiguration
+
<DataSourceConnectionConfig
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
xsi:noNamespaceSchemaLocation="JdbcIndexOrder.xsd">
+
 
<DataSourceID>jdbc</DataSourceID>
 
<DataSourceID>jdbc</DataSourceID>
 
<SchemaID>org.eclipse.smila.connectivity.framework.crawler.jdbc</SchemaID>
 
<SchemaID>org.eclipse.smila.connectivity.framework.crawler.jdbc</SchemaID>
 
<DataConnectionID>
 
<DataConnectionID>
<Crawler>JdbcCrawlerDS</Crawler>
+
<Crawler>JdbcCrawler</Crawler>
 
</DataConnectionID>
 
</DataConnectionID>
 
<CompoundHandling>No</CompoundHandling>
 
<CompoundHandling>No</CompoundHandling>
Line 71: Line 120:
 
JdbcDriver="org.apache.derby.jdbc.EmbeddedDriver" />
 
JdbcDriver="org.apache.derby.jdbc.EmbeddedDriver" />
 
</Process>
 
</Process>
</IndexOrderConfiguration>
+
</DataSourceConnectionConfig>
 
</source>
 
</source>
  
  
== The XSD Schema ==
+
== Output example ==
The XSD schema file which defines the structure of the JDBC Crawler's index order is rather simple:
+
 
+
 
<source lang="xml">
 
<source lang="xml">
<?xml version="1.0" encoding="UTF-8"?>
+
<Record xmlns="http://www.eclipse.org/smila/record" version="1.0">
<xs:schema elementFormDefault="qualified"
+
<Id xmlns="http://www.eclipse.org/smila/id" version="1.0">
  attributeFormDefault="unqualified"
+
<!-- Element name must be Source, not _Source, it's made due to syntax
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
+
coloring problem in wiki -->
  <xs:redefine schemaLocation="../../org.eclipse.smila.connectivity.framework.indexorder/schemas/RootIndexOrderConfiguration.xsd">
+
<_Source>jdbc</_Source>
    <xs:complexType name="Process">
+
        <Key name="BirthDay">2009-04-02 00:00:00.000</Key>
      <xs:annotation>
+
<Key name="Person_ID">1</Key>
        <xs:documentation>
+
</Id>
          Process Specification
+
<A n="Person_ID">
        </xs:documentation>
+
<L>
      </xs:annotation>
+
<V t="int">1</V>
      <xs:complexContent>
+
</L>
        <xs:extension base="Process">
+
</A>
          <xs:sequence>
+
<A n="BMI">
            <xs:element name="Selections">
+
<L>
              <xs:complexType>
+
<V t="fp">0.5497346110141528</V>
                <xs:sequence>
+
</L>
                  <xs:element name="Grouping"
+
</A>
                    minOccurs="0">
+
<A n="VacationDays">
                    <xs:complexType>
+
<L>
                      <xs:sequence>
+
<V t="int">23</V>
                        <xs:element
+
</L>
                          name="Stepping" type="xs:positiveInteger" />
+
</A>
                        <xs:element name="SQL"
+
<A n="BirthDay">
                          type="xs:string" />
+
<L>
                      </xs:sequence>
+
<V t="datetime">2009-04-02 00:00:00.000</V>
                    </xs:complexType>
+
</L>
                  </xs:element>
+
</A>
                  <xs:element name="SQL"
+
<A n="DownSizeCandidate">
                    type="xs:string" />
+
<L>
                </xs:sequence>
+
<V t="int">0</V>
              </xs:complexType>
+
</L>
            </xs:element>
+
</A>
            <xs:element name="Database">
+
<A n="DownSizedOn">
              <xs:annotation>
+
<L>
                <xs:documentation>
+
<V t="datetime">2009-04-02 00:00:00.000</V>
                  Database connection information
+
</L>
                </xs:documentation>
+
</A>
              </xs:annotation>
+
<A n="_HASH_TOKEN">
              <xs:complexType>
+
<L>
                <xs:attribute name="Connection"
+
<V>69d132fab2fd88cf9ccc17e57f68394ac3fed97ec8bab1c89bf764a6fa662</V>
                  use="required">
+
</L>
                  <xs:simpleType>
+
</A>
                    <xs:restriction
+
<Attachment>Photo</Attachment>
                      base="xs:string">
+
<Attachment>Resume</Attachment>
                      <xs:pattern
+
</Record>
                        value="jdbc:oracle:thin:@[\w\.\-]+:\d+:\w+" />
+
                      <xs:pattern
+
                        value="jdbc:microsoft:sqlserver://[\w\.\-]+:\d+(;(DatabaseName|HostProcess|NetAddress|Password|PortNumber|ProgramName|SelectMethod|SendStringParametersAsUnicode|ServerName|User)=[\w\i]+)*" />
+
                      <xs:pattern
+
                        value="jdbc:sqlserver://[\w\.\-]+:\d+(;(DatabaseName|HostProcess|NetAddress|Password|PortNumber|ProgramName|SelectMethod|SendStringParametersAsUnicode|ServerName|User)=[\w\i]+)*" />
+
                      <xs:pattern
+
                        value="jdbc:odbc:[\w\.\-]+" />
+
                      <xs:pattern
+
                        value="jdbc:derby:[\w\.\-\\:/]+" />
+
                      <!-- please modify the connection string restriction in case of custom jdbc drivers -->
+
                    </xs:restriction>
+
                  </xs:simpleType>
+
                </xs:attribute>
+
                <xs:attribute name="User"
+
                  type="xs:string" use="required" />
+
                <xs:attribute name="Password"
+
                  type="xs:string" use="required" />
+
                <xs:attribute name="FetchSize"
+
                  type="xs:int" use="required" />
+
                <xs:attribute name="JdbcDriver"
+
                  type="xs:string" use="optional" />
+
              </xs:complexType>
+
            </xs:element>
+
          </xs:sequence>
+
        </xs:extension>
+
      </xs:complexContent>
+
    </xs:complexType>
+
    <xs:complexType name="Attribute">
+
      <xs:complexContent>
+
        <xs:extension base="Attribute">
+
          <xs:sequence>
+
            <xs:element name="ColumnName" type="xs:string" />
+
            <xs:element name="SqlType">
+
              <xs:simpleType>
+
                <xs:restriction base="xs:string">
+
                  <xs:enumeration value="string" />
+
                  <xs:enumeration value="long" />
+
                  <xs:enumeration value="date" />
+
                  <xs:enumeration value="double" />
+
                  <xs:enumeration value="blob" />
+
                  <xs:enumeration value="clob" />
+
                  <xs:enumeration value="boolean" />
+
                  <xs:enumeration value="byte[]" />
+
                  <xs:enumeration value="timestamp" />
+
                </xs:restriction>
+
              </xs:simpleType>
+
            </xs:element>
+
          </xs:sequence>
+
        </xs:extension>
+
      </xs:complexContent>
+
    </xs:complexType>
+
  </xs:redefine>
+
</xs:schema>
+
 
</source>
 
</source>
  
Basically it defines two main sections:
+
== Using the Grouping element ==
# A '''Process''' element defining which data is selected from which database.
+
# An '''Attributes''' element describing the rows of the selection's result which should be used for record creation.
+
 
+
(These occur in reverse order in the schema, of course, but shall be explained vice versa as this approach is somewhat more straightforward).
+
 
+
=== The Process element ===
+
 
+
'''''TODO:''''' Include imagery here
+
 
+
 
+
==== The Selections element ====
+
 
+
The '''Selections''' element MUST contain a '''SQL''' element which contains the SELECT statement to use for data retrieval as PCDATA ,e.g. <source lang="sql"> SELECT * from CUSTOMER </source>
+
 
+
Furthermore it optionally ''may'' contain a '''Grouping''' element. The grouping mechanism is one of the key features of the JDBC crawler enabling it to deal with really vast amounts of data.
+
 
+
===== The Grouping element =====
+
  
 
If used, the '''Grouping''' element must contain its own '''SQL''' element with a SELECT statement and an additional '''Stepping''' element containing a non-negative integer value.  
 
If used, the '''Grouping''' element must contain its own '''SQL''' element with a SELECT statement and an additional '''Stepping''' element containing a non-negative integer value.  
Line 225: Line 202:
  
 
which will subsequently be submitted sequentially to the database.
 
which will subsequently be submitted sequentially to the database.
 
==== The Database element ====
 
'''Database''' is the second required element within the '''Process''' element. Its attributes contain information about the database to connect to:
 
 
* Connection: the Connection-String to use for the JDBC-Driver (e.g. ''jdbc:derby:crawlerTestDerbyDB'')
 
* Username: the username to use for the logging into the database
 
* Password: the password to use when logging into the database
 
* Fetchsize: the fetchsize to use for the queries
 
* JdbcDriver: the fully qualified classname of the JDBC-Driver to use
 
 
=== The Attributes element ===
 
 
The '''Attributes''' element decides which of the selected data is put into the index. For each resultset column to include an '''Attribute''' element has to be added to the '''Attributes''' element.
 
 
==== The Attribute element ====
 
 
The '''Attribute''' element has the basic set of attributes as defined in the RootIndexOrder:
 
 
* Name: the Name of the Attribute in the SMILA record.
 
* Type: the type to use
 
* HashAttribute: is this attribute part of the record's hash
 
* KeyAttribute: is this attribute part of the record's key
 
* Attachment: is this attribute an attachment (e.g. a BLOB field)
 
 
It also features two further child elements:
 
 
* ColumnName: the Name of the column in the JDBC resultset whose value is to be mapped to this attribute
 
* SqlType: the data type of the column, must be one of the following:
 
** "long"
 
** "date"
 
** "double"
 
** "blob"
 
** "clob"
 
** "boolean"
 
** "byte[]"
 
** "timestamp"
 
 
== Custom JDBC Drivers ==
 
 
For using custom JDBC-Drivers the XSD schema has to be modified. Currently the following drivers are allowed by the patterns defined in the schema:
 
 
* jdbc:microsoft:sqlserver:
 
* jdbc:sqlserver:
 
* jdbc:odbc:
 
* jdbc:derby:
 
* jdbc:oracle:thin:
 

Revision as of 11:10, 2 April 2009

What does the JDBC Crawler do

The JDBC Crawler collects data from a JDBC-accessible database based on a SELECT-statement in the DataSourceConnectionConfig. Any of the columns of the database record returned by the SELECT-Statement may be mapped to an attribute of the SMILA record in the Attributes section of the DataSourceConnectionConfig.

To overcome certain shortcomings of some JDBC-implementations concerning the handling of really large data amounts, it features an optional grouping mechanism enabling it to retrieve data in well defined frames from the database and thus avoid OutOfMemoryExceptions.

To use the JDBC Crawler with your custom JDBC-Driver, the JDBC Crawler bundle's ClassLoader has to have access to the Driver class, i.e. you have to have a bundle that exports the class.

Crawling configuration

The configuration is found at configuration/org.eclipse.smila.connectivity.framework/jdbc

Defining Schema: org.eclipse.smila.connectivitiy.framework.crawler.jdbc/schemas/JdbcDataSourceConnectionConfigSchema.xsd


Crawling configuration explanation

The root element of the crawling configuration is DataSourceConnectionConfig and contains the following sub elements:

  • DataSourceID – the identification of a data source.
  • SchemaID – specify the schema for a crawler job.
  • DataConnectionID – describes which agent crawler should be used.
    • Crawler – implementation class of a Crawler.
  • CompoundHandling – specify if packed data should be unpacked and data within should be crawled (this should always be NO for the JDBC Crawler)
  • Attributes – list all attributes you would like to use from a database row
    • Attribute
      • Type (required) – the data type to use in the SMILA record.
      • Name (required) – the name of the attribute to create in the SMILA record.
      • HashAttribute (required) – specify if a hash should be created (true or false).
      • KeyAttribute (required) – creates a key for this object, for example for record id (true or false).
      • Attachment (required) – specify if the attribute's data should be stored as an attachment e.g. for blob or clob fields (true or false)
      • ColumnName (required) - the name of the column in the database row to use for this attribute
      • SqlType (required)- specify the SQL Type of the column in the database row (one of: "string", "long", "date", "double", "blob", "clob", "boolean", "byte[]", "timestamp")
  • Process
    • Selections - Which data is to be selected (and how)
      • Grouping (optional) - adds support for "chunk"-wise retrieving of data from the database in order to preserve memory resources. See "Using the Grouping Element" below.
        • Stepping - how many rows should be retrieved at a time (integer)
        • SQL - specify an SQL statement that returns an ordered list of database keys which can be used to create the retrieval partitions
      • SQL - the SQL statement that selects the actual data to be retrieved.
    • Database - JDBC Connection information
      • Connection - The JDBC connection URL to use for connecting to the database
      • User - The username to use when connecting to the database (can be left blank if anonymous access is possible)
      • Password - The password to use when connecting to the database (can be left blank if anonymous access is possible)
      • FetchSize - The FetchSize to set when creating the JDBC-Connection. This is mapped to the JDBC-property and must not be confused with the Stepping functionality of the crawler.
      • JdbcDriver - Specify the fully qualified class name of the Jdbc-Driver to use (must be accessible to the bundle's class loader)


Crawling configuration example

A typical Configuration for the JDBC Crawler looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<DataSourceConnectionConfig
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<DataSourceID>jdbc</DataSourceID>
	<SchemaID>org.eclipse.smila.connectivity.framework.crawler.jdbc</SchemaID>
	<DataConnectionID>
		<Crawler>JdbcCrawler</Crawler>
	</DataConnectionID>
	<CompoundHandling>No</CompoundHandling>
	<Attributes>
		<Attribute Name="Person_ID" HashAttribute="true"
			KeyAttribute="true" Type="Long">
			<ColumnName>id</ColumnName>
			<SqlType>long</SqlType>
		</Attribute>
		<Attribute Name="BMI" HashAttribute="true" KeyAttribute="false"
			Type="double">
			<ColumnName>body_mass_index</ColumnName>
			<SqlType>double</SqlType>
		</Attribute>
		<Attribute Name="VacationDays" HashAttribute="false"
			KeyAttribute="false" Type="Long">
			<ColumnName>vacationdays</ColumnName>
			<SqlType>long</SqlType>
		</Attribute>
		<Attribute Name="BirthDay" HashAttribute="true"
			KeyAttribute="true" Type="date">
			<ColumnName>birthday</ColumnName>
			<SqlType>date</SqlType>
		</Attribute>
		<Attribute Name="DownSizeCandidate" HashAttribute="false"
			KeyAttribute="false" Type="Boolean">
			<ColumnName>scheduled_for_downsizing</ColumnName>
			<SqlType>boolean</SqlType>
		</Attribute>
		<Attribute Name="DownSizedOn" HashAttribute="false"
			KeyAttribute="false" Type="Timestamp">
			<ColumnName>downsized</ColumnName>
			<SqlType>timestamp</SqlType>
		</Attribute>
		<Attribute Name="Photo" HashAttribute="false" Attachment="true"
			KeyAttribute="false" Type="Blob">
			<ColumnName>photo</ColumnName>
			<SqlType>blob</SqlType>
		</Attribute>
		<Attribute Name="Resume" HashAttribute="false"
			KeyAttribute="false" Attachment="true" Type="Clob">
			<ColumnName>cv</ColumnName>
			<SqlType>clob</SqlType>
		</Attribute>
	</Attributes>
	<Process>
		<Selections>
			<Grouping>
				<Stepping>13</Stepping>
				<SQL>SELECT id FROM person ORDER BY id ASC</SQL>
			</Grouping>
			<SQL>
				SELECT id, vorname, name, body_mass_index, vacationdays,
				birthday, scheduled_for_downsizing, downsized, photo, cv
				FROM person where id BETWEEN %01min AND %01max
			</SQL>
		</Selections>
		<Database Connection="jdbc:derby:crawlerTestDerbyDB" User=""
			Password="" FetchSize="100000"
			JdbcDriver="org.apache.derby.jdbc.EmbeddedDriver" />
	</Process>
</DataSourceConnectionConfig>


Output example

<Record xmlns="http://www.eclipse.org/smila/record" version="1.0">
	<Id xmlns="http://www.eclipse.org/smila/id" version="1.0">
		<!-- Element name must be Source, not _Source, it's made due to syntax
			coloring problem in wiki -->
		<_Source>jdbc</_Source>
	        <Key name="BirthDay">2009-04-02 00:00:00.000</Key>
		<Key name="Person_ID">1</Key>
	</Id>
	<A n="Person_ID">
		<L>
			<V t="int">1</V>
		</L>
	</A>
	<A n="BMI">
		<L>
			<V t="fp">0.5497346110141528</V>
		</L>
	</A>
	<A n="VacationDays">
		<L>
			<V t="int">23</V>
		</L>
	</A>
	<A n="BirthDay">
		<L>
			<V t="datetime">2009-04-02 00:00:00.000</V>
		</L>
	</A>
	<A n="DownSizeCandidate">
		<L>
			<V t="int">0</V>
		</L>
	</A>
	<A n="DownSizedOn">
		<L>
			<V t="datetime">2009-04-02 00:00:00.000</V>
		</L>
	</A>
	<A n="_HASH_TOKEN">
		<L>
			<V>69d132fab2fd88cf9ccc17e57f68394ac3fed97ec8bab1c89bf764a6fa662</V>
		</L>
	</A>
	<Attachment>Photo</Attachment>
	<Attachment>Resume</Attachment>
</Record>

Using the Grouping element

If used, the Grouping element must contain its own SQL element with a SELECT statement and an additional Stepping element containing a non-negative integer value.

Use of the Grouping element is best explained by means of an example. Imagine running the retrieval statement
SELECT * FROM CUSTOMER
on your CUSTOMER table and further imagine that you have a very healthy customer base, so your CUSTOMER table contains 750.000 rows. You would invariably come down with OutOfMemoryExceptions during the crawling run, as the JDBC-Resultset does not release the resources for retrieved data rows. After calling
resultSet.next()
a few hundred thousand times an OutOfMemoryException is more or less inevitable.

Grouping to the rescue! The grouping element enables you to break down your query results into blocks of custom size (specified by the Stepping value) which can be retrieved sequentially while releasing the resultset's resources after each block. You have to be sure to provide a SELECT statement in the SQL element of Grouping which returns a sorted list of key values by which the results can be grouped. This could be the primary key of the table for instance or any other suitable discriminator. The stepping value is used by the crawler in turn to form groups of the specified size from the keys.

If the statement
 SELECT primaryKey FROM CUSTOMER ORDER BY primaryKey ASC
returned for example the following values:
{1, 2, 4, 5, 6, 8, 12, 13, 21, 34, 56, 67}

a stepping value of 5 would result in the creation of the following groups:

{1, 2, 4, 5, 6}
{8, 12, 13, 21, 34} and
{56, 67}

whereas with a stepping value of 9 the following groups would have been formed:

{1, 2, 4, 5, 6, 8, 12, 13, 21} and
{34, 56, 67}

You also have to apply a slight modification to your original retrieval SQL statement (the one directly descending the Selections element) in order for the grouping feature to do its magic:

SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN %min01 AND %max01

The crawler will replace the %min01 and %max01 tokens with the respective minimum and maximum values of each of the formed groups thus creating the following three SQL statements for data retrieval (assuming a stepping value of 5 was used:

SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 1 AND 6
SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 8 AND 34
SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 56 AND 67

which will subsequently be submitted sequentially to the database.

Back to the top