Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/JobFile Agent"

(Agent configuration explanation)
 
Line 1: Line 1:
 +
{{note|This is deprecated for SMILA 1.0, the connectivity framework is still functional but planned to be replaced by scalable import based on SMILAs job management.}}
 +
 
== Overview ==
 
== Overview ==
  

Latest revision as of 05:44, 24 January 2012

Note.png
This is deprecated for SMILA 1.0, the connectivity framework is still functional but planned to be replaced by scalable import based on SMILAs job management.


Overview

The Job File agent offers the functionality to execute ADD and DELETE jobs. A job file is an XML file using the SMILA datamodel XML representation of Records and Ids to describe the data and special ADD and DELETE tags to specify the action to take.

Agent configuration

The example configuration file is located at configuration/org.eclipse.smila.connectivity.framework/jobfile.xml.

Defining Schema: org.eclipse.smila.connectivits.framework.agent.jobfile/schemas/JobFileDataSourceConnectionConfigSchema.xsd.

Agent configuration explanation

See SMILA/Documentation/Agent#Configuration for the generic parts of the configuration file.

The root element of the configuration is DataSourceConnectionConfig and contains the following sub elements:

  • DataSourceID – the identification of a data source
  • SchemaID – specifies the schema for the data source
  • DataConnectionID – describes which agent or crawler should be used
    • Crawler – service ID of a crawler
    • Agent – service ID of an agent
  • CompoundHandling – specify if packed data (like a ZIP containing files) should be unpack and files within should be processed (YES or NO).
  • Attributes – list all attributes provided by the data source
    • Attribute
      • Type (required) – the data type (String, Integer or Date).
      • Name (required) – attributes name.
      • HashAttribute – specify if a hash should be created (true or false).
      • KeyAttribute – creates a key for this object, for example for record id (true or false).
      • Attachment – specify if the attribute return the data as attachment of record.
  • Process – contains parameters for the agent business logic.
    • UpdateInterval – the number of seconds to wait before reloading the job files specified by JobFileUrl.
    • JobFileUrl – the URL of the job file to load. Protocols file:// and http:// are supported. You may specify multiple URLs.
    • AttachmentSeparator - the separator used to separate attachment names and attachment URLs


The Job File agent offers no attributes by itself, rather it just creates the attributes that are part of each record in the job file. However, you have to specify the names of those attributes that should be used for hash creation (the hash is not part of the record) and optionally for id creation (it is also possible to already provide an Id in the job file for each record..

Configuration example

<DataSourceConnectionConfig
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.agent.jobfile/schemas/FeedDataSourceConnectionConfigSchema.xsd"
>
  <DataSourceID>jobfile</DataSourceID>
  <SchemaID>org.eclipse.smila.connectivity.framework.agent.jobfile</SchemaID>
  <DataConnectionID>
    <Agent>JobFileAgent</Agent>
  </DataConnectionID>
  <DeltaIndexing>full</DeltaIndexing>
  <Attributes>
    <Attribute Type="Date" Name="LastModifiedDate" HashAttribute="true" />
    <Attribute Type="String" Name="Path" KeyAttribute="true" />
    <Attribute Type="String" Name="Url" KeyAttribute="true" />
  </Attributes>
  <Process>
    <UpdateInterval>300</UpdateInterval>
    <AttachmentSeparator>####</AttachmentSeparator>
    <JobFileUrl>file://samplejobfile.xml</JobFileUrl>
  </Process>
</DataSourceConnectionConfig>

The format of job files

A example configuration file called "samplejobfile.xml" is located at configuration/org.eclipse.smila.connectivity.framework.

Defining Schema: org.eclipse.smila.connectivits.framework.agent.jobfile/schemas/jobfile.xsd.

In a job file you can have either a ADD section, or a DELETE section or both. A ADD section can contain one or more Record sections. A Record section need not contain an Id. If no Id is contained, an Id object is created according to the Job File agent configuration. A DELETE section can contain one or more Id sections. In all respects the content of ADD and DELETE sections adhere to the datamodel XML schemes org.eclipse.smila.datamodel/xml/id.xsd. and org.eclipse.smila.datamodel/xml/record.xsd.

Attachments are handled slightly different: Normally the XML datamodel contains only the name of an attachment. But during an import we want to fill an attachment with a value. Therefore it is necessary to not only include the attachment name in the XML but also an URL where the actual attachment value is located. Both information are separated by the AttachmentSeparator configured in the Job File agent configuration.

For example the attachment named Content should be filled with the document referenced by http://www.eclipse.org. As AttachmentSeparator the string #### is used. Then the XML looks like this:

...
    <Attachment>Content####epl-v10.html</Attachment>
...

Note: If you set the "_source" attribute for records in the job file, the value must match the DataSourceID in the Job File agent configuration! Otherwise the record is skipped.

Example of a job file

Here is an example for a job file with both a ADD and DELETE section. It shows the different options of

  • creating Id objects from attribute values
  • providing Ids within the XML
  • loading data into attachments
  • providing text or markup data in attributes
<?xml version="1.0" encoding="UTF-8"?>
<JobFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.agent.jobfile/schemas/jobfile.xsd">
    <Add>
        <!-- sample record where id is created and content is loaded into attachment from file url //-->
        <Record version="2.0">
          <Val key="MimeType">text/html</Val>
          <Val key="Size" type="long">16536</Val>
          <Val key="Extension">html</Val>
          <Val key="LastModifiedDate" type="datetime">2009-03-13T10:42:00+0100</Val>
          <Val key="Filename">epl-v10.html</Val>
		  <Val key="Path">epl-v10.html</Val>   
          <Attachment>Content####epl-v10.html</Attachment>
        </Record>     
 
        <!-- sample record where id is created and content is loaded inti attachment from http url //-->
        <Record version="1.0">
          <Val key="MimeType">text/html</Val>
          <Val key="Size" type="long">11765</Val>
          <Val key="Extension">html</Val>
          <Val key="LastModifiedDate" type="date">2009-07-09</Val>
          <Val key="Url">http://www.eclipse.org/smila/</Val>
          <Attachment>Content####http://www.eclipse.org/smila/</Attachment>
        </Record>     
 
        <!-- sample record where id is provided and txt content is provided in attribute //-->
        <Record version="2.0">
          <Val key="_recordid">jobfile:C:/sample folder/sample filename.txt</Val>
          <Val key="_source">jobfile</Val>
          <Val key="MimeType">text/plain</Val>
          <Val key="Size" type="long">16384</Val>
          <Val key="Extension">txt</Val>
          <Val key="LastModifiedDate" type="datetime">2009-07-09T14:53:16+0100</Val>
          <Val key="Filename">sample filename.txt</Val>
		  <Val key="Path">C:/sample folder/sample filename.txt</Val>   
          <Val key="Content">This is just some imaginary text content. Used to show how SMILA JobFileAgent works.</Val>   
        </Record>  
 
        <!-- sample record where id is provided and html content is provided in attribute //-->
        <Record version="2.0">
          <Val key="_recordid">jobfile:C:/sample folder/sample filename.html</Val>
          <Val key="_source">jobfile</Val>
          <Val key="MimeType">text/html</Val>
          <Val key="Size" type="long">16384</Val>
          <Val key="Extension">html</Val>
          <Val key="LastModifiedDate" type="datetime">2009-07-09T14:53:16+0100</Val>
          <Val key="Filename">sample filename.html</Val>
		  <Val key="Path">C:/sample folder/sample filename.html</Val>   
          <Val key="Content">
                 <![CDATA[
                    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
                    <HTML>
                     <HEAD>
                      <TITLE> A sample test document </TITLE>
                      <META NAME="Author" CONTENtype="Danieel Stucky">
                      <META NAME="Keywords" CONTENtype="SMILA eclipse">
                      <META NAME="Description" CONTENtype="sample test document">
                     </HEAD>
                     <BODY>
                      This is just some imaginary text content. Used to show how SMILA's Job File agent works. It even contains a <a href="http://www.eclipse.org">link</a>.
                     </BODY>
                    </HTML>
                ]]>              
              </Val>   
        </Record> 
    </Add>
 
</JobFile>

See also