Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/JDBC Crawler

< SMILA‎ | Documentation
Revision as of 11:41, 28 November 2008 by Unnamed Poltroon (Talk) (New page: == The JDBC Crawler's XML Index Order == A typical IndexOrder for the JDBC Crawler looks like this: <source lang="xml"> <?xml version="1.0" encoding="UTF-8"?> <IndexOrderConfiguration xm...)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The JDBC Crawler's XML Index Order

A typical IndexOrder for the JDBC Crawler looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<IndexOrderConfiguration
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:noNamespaceSchemaLocation="JdbcIndexOrder.xsd">
	<DataSourceID>jdbc</DataSourceID>
	<SchemaID>org.eclipse.smila.connectivity.framework.crawler.jdbc</SchemaID>
	<DataConnectionID>
		<Crawler>JdbcCrawlerDS</Crawler>
	</DataConnectionID>
	<CompoundHandling>No</CompoundHandling>
	<Attributes>
		<Attribute Name="Person_ID" HashAttribute="true"
			KeyAttribute="true" Type="Long">
			<ColumnName>id</ColumnName>
			<SqlType>long</SqlType>
		</Attribute>
		<Attribute Name="BMI" HashAttribute="true" KeyAttribute="false"
			Type="double">
			<ColumnName>body_mass_index</ColumnName>
			<SqlType>double</SqlType>
		</Attribute>
		<Attribute Name="VacationDays" HashAttribute="false"
			KeyAttribute="false" Type="Long">
			<ColumnName>vacationdays</ColumnName>
			<SqlType>long</SqlType>
		</Attribute>
		<Attribute Name="BirthDay" HashAttribute="true"
			KeyAttribute="true" Type="date">
			<ColumnName>birthday</ColumnName>
			<SqlType>date</SqlType>
		</Attribute>
		<Attribute Name="DownSizeCandidate" HashAttribute="false"
			KeyAttribute="false" Type="Boolean">
			<ColumnName>scheduled_for_downsizing</ColumnName>
			<SqlType>boolean</SqlType>
		</Attribute>
		<Attribute Name="DownSizedOn" HashAttribute="false"
			KeyAttribute="false" Type="Timestamp">
			<ColumnName>downsized</ColumnName>
			<SqlType>timestamp</SqlType>
		</Attribute>
		<Attribute Name="Photo" HashAttribute="false" Attachment="true"
			KeyAttribute="false" Type="Blob">
			<ColumnName>photo</ColumnName>
			<SqlType>blob</SqlType>
		</Attribute>
		<Attribute Name="Resume" HashAttribute="false"
			KeyAttribute="false" Attachment="true" Type="Clob">
			<ColumnName>cv</ColumnName>
			<SqlType>clob</SqlType>
		</Attribute>
	</Attributes>
	<Process>
		<Selections>
			<Grouping>
				<Stepping>13</Stepping>
				<SQL>SELECT id FROM person ORDER BY id ASC</SQL>
			</Grouping>
			<SQL>
				SELECT id, vorname, name, body_mass_index, vacationdays,
				birthday, scheduled_for_downsizing, downsized, photo, cv
				FROM person where id BETWEEN %01min AND %01max
			</SQL>
		</Selections>
		<Database Connection="jdbc:derby:crawlerTestDerbyDB" User=""
			Password="" FetchSize="100000"
			JdbcDriver="org.apache.derby.jdbc.EmbeddedDriver" />
	</Process>
</IndexOrderConfiguration>


The XSD Schema

The XSD schema file which defines the structure of the JDBC Crawler's index order is rather simple:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified"
  attributeFormDefault="unqualified"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:redefine schemaLocation="../../org.eclipse.smila.connectivity.framework.indexorder/schemas/RootIndexOrderConfiguration.xsd">
    <xs:complexType name="Process">
      <xs:annotation>
        <xs:documentation>
          Process Specification
        </xs:documentation>
      </xs:annotation>
      <xs:complexContent>
        <xs:extension base="Process">
          <xs:sequence>
            <xs:element name="Selections">
              <xs:complexType>
                <xs:sequence>
                  <xs:element name="Grouping"
                    minOccurs="0">
                    <xs:complexType>
                      <xs:sequence>
                        <xs:element
                          name="Stepping" type="xs:positiveInteger" />
                        <xs:element name="SQL"
                          type="xs:string" />
                      </xs:sequence>
                    </xs:complexType>
                  </xs:element>
                  <xs:element name="SQL"
                    type="xs:string" />
                </xs:sequence>
              </xs:complexType>
            </xs:element>
            <xs:element name="Database">
              <xs:annotation>
                <xs:documentation>
                  Database connection information
                </xs:documentation>
              </xs:annotation>
              <xs:complexType>
                <xs:attribute name="Connection"
                  use="required">
                  <xs:simpleType>
                    <xs:restriction
                      base="xs:string">
                      <xs:pattern
                        value="jdbc:oracle:thin:@[\w\.\-]+:\d+:\w+" />
                      <xs:pattern
                        value="jdbc:microsoft:sqlserver://[\w\.\-]+:\d+(;(DatabaseName|HostProcess|NetAddress|Password|PortNumber|ProgramName|SelectMethod|SendStringParametersAsUnicode|ServerName|User)=[\w\i]+)*" />
                      <xs:pattern
                        value="jdbc:sqlserver://[\w\.\-]+:\d+(;(DatabaseName|HostProcess|NetAddress|Password|PortNumber|ProgramName|SelectMethod|SendStringParametersAsUnicode|ServerName|User)=[\w\i]+)*" />
                      <xs:pattern
                        value="jdbc:odbc:[\w\.\-]+" />
                      <xs:pattern
                        value="jdbc:derby:[\w\.\-\\:/]+" />
                      <!-- please modify the connection string restriction in case of custom jdbc drivers -->
                    </xs:restriction>
                  </xs:simpleType>
                </xs:attribute>
                <xs:attribute name="User"
                  type="xs:string" use="required" />
                <xs:attribute name="Password"
                  type="xs:string" use="required" />
                <xs:attribute name="FetchSize"
                  type="xs:int" use="required" />
                <xs:attribute name="JdbcDriver"
                  type="xs:string" use="optional" />
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
    <xs:complexType name="Attribute">
      <xs:complexContent>
        <xs:extension base="Attribute">
          <xs:sequence>
            <xs:element name="ColumnName" type="xs:string" />
            <xs:element name="SqlType">
              <xs:simpleType>
                <xs:restriction base="xs:string">
                  <xs:enumeration value="string" />
                  <xs:enumeration value="long" />
                  <xs:enumeration value="date" />
                  <xs:enumeration value="double" />
                  <xs:enumeration value="blob" />
                  <xs:enumeration value="clob" />
                  <xs:enumeration value="boolean" />
                  <xs:enumeration value="byte[]" />
                  <xs:enumeration value="timestamp" />
                </xs:restriction>
              </xs:simpleType>
            </xs:element>
          </xs:sequence>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
  </xs:redefine>
</xs:schema>

Basically it defines two main sections:

  1. A Process element defining which data is selected from which database.
  2. An Attributes element describing the rows of the selection's result which should be used for record creation.

(These occur in reverse order in the schema, of course, but shall be explained vice versa as this approach is somewhat more straightforward).

The Process element

TODO: Include imagery here


The Selections element

The Selections element MUST contain a SQL element which contains the SELECT statement to use for data retrieval as PCDATA ,e.g.
 SELECT * FROM CUSTOMER

Furthermore it optionally may contain a Grouping element. The grouping mechanism is one of the key features of the JDBC crawler enabling it to deal with really vast amounts of data.

The Grouping element

If used, the Grouping element must contain its own SQL element with a SELECT statement and an additional Stepping element containing a non-negative integer value.

Use of the Grouping element is best explained by means of an example. Imagine running the retrieval statement
SELECT * FROM CUSTOMER
on your CUSTOMER table and further imagine that you have a very healthy customer base, so your CUSTOMER table contains 750.000 rows. You would invariably come down with OutOfMemoryExceptions during the crawling run, as the JDBC-Resultset does not release the resources for retrieved data rows. After calling
resultSet.next()
a few hundred thousand times an OutOfMemoryException is more or less inevitable.

Grouping to the rescue! The grouping element enables you to break down your query results into blocks of custom size (specified by the Stepping value) which can be retrieved sequentially while releasing the resultset's resources after each block. You have to be sure to provide a SELECT statement in the SQL element of Grouping which returns a sorted list of key values by which the results can be grouped. This could be the primary key of the table for instance or any other suitable discriminator. The stepping value is used by the crawler in turn to form groups of the specified size from the keys.

If the statement
 SELECT primaryKey FROM CUSTOMER ORDER BY primaryKey ASC
returned for example the following values:
{1, 2, 4, 5, 6, 8, 12, 13, 21, 34, 56, 67}

a stepping value of 5 would result in the creation of the following groups:

{1, 2, 4, 5, 6}
{8, 12, 13, 21, 34} and
{56, 67}

whereas with a stepping value of 9 the following groups would have been formed:

{1, 2, 4, 5, 6, 8, 12, 13, 21} and
{34, 56, 67}

You also have to apply a slight modification to your original retrieval SQL statement (the one directly descending the Selections element) in order for the grouping feature to do its magic:

SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN %min01 AND %max01

The crawler will replace the %min01 and %max01 tokens with the respective minimum and maximum values of each of the formed groups thus creating the following three SQL statements for data retrieval (assuming a stepping value of 5 was used:

SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 1 AND 6
SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 8 AND 34
SELECT * FROM CUSTOMER WHERE primaryKey BETWEEN 56 AND 67

which will subsequently be submitted sequentially to the database.

The Database element

Database is the second required element within the Process element. Its attributes contain information about the database to connect to:

  • Connection: the Connection-String to use for the JDBC-Driver (e.g. jdbc:derby:crawlerTestDerbyDB)
  • Username: the username to use for the logging into the database
  • Password: the password to use when logging into the database
  • Fetchsize: the fetchsize to use for the queries
  • JdbcDriver: the fully qualified classname of the JDBC-Driver to use

The Attributes element

The Attributes element decides which of the selected data is put into the index. For each resultset column to include an Attribute element has to be added to the Attributes element.

The Attribute element

The Attribute element has the basic set of attributes as defined in the RootIndexOrder:

  • Name: the Name of the Attribute in the SMILA record.
  • Type: the type to use
  • HashAttribute: is this attribute part of the record's hash
  • KeyAttribute: is this attribute part of the record's key
  • Attachment: is this attribute an attachment (e.g. a BLOB field)

It also features two further child elements:

  • ColumnName: the Name of the column in the JDBC resultset whose value is to be mapped to this attribute
  • SqlType: the data type of the column, must be one of the following:
    • "long"
    • "date"
    • "double"
    • "blob"
    • "clob"
    • "boolean"
    • "byte[]"
    • "timestamp"

Custom JDBC Drivers

For using custom JDBC-Drivers the XSD schema has to be modified. Currently the following drivers are allowed by the patterns defined in the schema:

  • jdbc:microsoft:sqlserver:
  • jdbc:sqlserver:
  • jdbc:odbc:
  • jdbc:derby:
  • jdbc:oracle:thin:

Back to the top