Difference between revisions of "SMILA/Documentation/Agent"

From Eclipsepedia

Jump to: navigation, search
(API)
 
(4 intermediate revisions by 2 users not shown)
Line 5: Line 5:
  
  
SMILA currently comes with two types of Agents, each for a different datasource type, namely MockAgent (a sample implementation of an agent) and a RSSAgent that enables monitoring of RSS and atom feeds. Furthermore, the Connectivity Framework provides an API for developers to create their own Agents.
+
SMILA currently comes with two types of Agents, each for a different datasource type, namely MockAgent (a sample implementation of an agent) and a FeedAgent that enables monitoring of RSS and atom feeds. Furthermore, the Connectivity Framework provides an API for developers to create their own Agents.
  
 
== API ==
 
== API ==
Line 11: Line 11:
 
An Agent has to implement interface <tt>Agent</tt> which extends interface <tt>Runnable</tt>. The easiest way to achieve this is to extend the abstract base class <tt>AbstractAgent</tt> located in bundle <tt>org.eclipse.smila.connectivity.framework</tt>. This class already contains handling for the Agents Id, an OSGI service activate method and also default implementations for the <tt>start()</tt> and <tt>stop()</tt> methods creating a new Thread for the Agent to run in. So the only method that has to be implemented is method <tt>run()</tt> of the <tt>Runnable</tt> interface which contains the processing logic of the agent.
 
An Agent has to implement interface <tt>Agent</tt> which extends interface <tt>Runnable</tt>. The easiest way to achieve this is to extend the abstract base class <tt>AbstractAgent</tt> located in bundle <tt>org.eclipse.smila.connectivity.framework</tt>. This class already contains handling for the Agents Id, an OSGI service activate method and also default implementations for the <tt>start()</tt> and <tt>stop()</tt> methods creating a new Thread for the Agent to run in. So the only method that has to be implemented is method <tt>run()</tt> of the <tt>Runnable</tt> interface which contains the processing logic of the agent.
  
<source lang="java">
+
Javadoc [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/connectivity/framework/Agent.html org.eclipse.smila.connectivity.framework.Agent]
/**
+
* The Interface Agent.
+
*/
+
public interface Agent extends Runnable {
+
 
+
/**
+
  * Returns the ID of this Agent.
+
  *
+
  * @return a String containing the ID of this Agent
+
  *
+
  * @throws AgentException
+
  *          if any error occurs
+
  */
+
  String getAgentId() throws AgentException;
+
 
+
  /**
+
  * Starts the agent using the given configuration, creating a new internal thread.
+
  *
+
  * @param controllerCallback
+
  *          reference to the interface AgentControllerCallback
+
  * @param agentState
+
  *          the AgentState
+
  * @param config
+
  *          the DataSourceConnectionConfig
+
  * @param sessionId
+
  *          the delta indexing session id
+
  *
+
  * @throws AgentException
+
  *          if any error occurs
+
  */
+
  void start(final AgentControllerCallback controllerCallback, final AgentState agentState,
+
    final DataSourceConnectionConfig config, final String sessionId) throws AgentException;
+
 
+
  /**
+
  * Stops the agent.
+
  *
+
  * @throws AgentException
+
  *          if any error occurs
+
  */
+
  void stop() throws AgentException;
+
}
+
</source>
+
 
+
  
 
== Architecture ==
 
== Architecture ==
Line 113: Line 70:
 
;DataSourceID
 
;DataSourceID
 
:A description string that is used in the whole framework to separate and address information that apply to the same agent
 
:A description string that is used in the whole framework to separate and address information that apply to the same agent
 +
 
;SchemaID
 
;SchemaID
:The SchemaID contains the whole bundle name of the Agent (e.g. RSS Agent: org.eclipse.smila.connectivity.framework.agent.rss).<br /> The SMILA Framework uses this information to gather the schema for the validation of the  DataSourceConnectionConfig that should be executed.
+
:The SchemaID contains the whole bundle name of the Agent (e.g. FeedAgent: org.eclipse.smila.connectivity.framework.agent.feed).<br /> The SMILA Framework uses this information to gather the schema for the validation of the  DataSourceConnectionConfig that should be executed.
;DataConnectionID: This tag describes if an Agent or Crawler should be used. It contains either of the following tags:
+
 
;:Agent
+
;DataConnectionID
;:Crawler
+
:This tag describes if an Agent or Crawler should be used. It contains either of the following tags:  
The name that is used in these tags is the Service name of the Agent/Crawler.
+
:*<b>Agent</b>
 +
:*<b>Crawler</b>
 +
:The name that is used in these tags is the Service name of the Agent/Crawler.  
 +
 
 +
;RecordBuffer
 +
:Here you can specify settings to optimize record transfer to ConnectivityManager. These settings are not applicable to Agents !
 +
:*Size - the number of records to be send to ConnectivityManager in one block. Default is 1.
 +
:*FlushInterval - a time interval in milliseconds after which to send the current elements of the RecordBuffer to ConnectivityManager. Default is 1000.
 +
 
 +
;DeltaIndexing:
 +
:Configuration options for delta indexing that are to be interpreted by the AgentController. The following values are supported:
 +
:*<tt>full</tt> - delta indexing is fully activated. Records are checked if they need to be updated, entries for new/updated records are added to the deltaIndexingManager, delta-delete is executed if no error occurred
 +
:*<tt>additive</tt> - as <tt>full</tt> but delta-delete is not executed
 +
:*<tt>initial</tt> - For an initial import in an empty index or a new source in an existing index performance can be optimized by NOT checking if a record needs to be updated (we know that all records are new) but adding an entry in the DeltaIndexingManager for each Record. This allows later runs using <tt>full</tt> or <tt>additive</tt> to make use of DeltaIndexing information.
 +
:*<tt>disabled</tt> - delta indexing is fully disabled. No checks are done, no entries are created/updated, no Delta-Delete is executed. Later runs cannot benefit from DeltaIndexing
 +
 
 
;CompoundHandling:
 
;CompoundHandling:
 
:Configuration options for CompoundHandling. See [[SMILA/Documentation/CompoundManagement#Configuration|CompoundManagement]] for details.
 
:Configuration options for CompoundHandling. See [[SMILA/Documentation/CompoundManagement#Configuration|CompoundManagement]] for details.
  
 
;Attributes
 
;Attributes
:Placeholder for each Agent's attribute definition. <br />Each Agent can define here which Attributes it can return. An attribute is a specific information of an entry in the datasource that is crawled by the Agent (E.g. In a filesystem an entry is a file, and attributes of an file are Size, Content, etc.)
+
:Placeholder for each Agent's attribute definition. <br/> The Agent defines here which Attributes it returns. An attribute is a specific information of an entry in the data-source that is crawled by the Agent (E.g. In a filesystem an entry is a file, and attributes of a file are Size, Content, etc.)
  
 
;Process
 
;Process
:Placeholder for Tags that the Agent developer can define. <br /> In this Tag all information can be transferred for an agent that are necessary to start a monitoring process. These information may include connection information to the data source to monitor or filters ( e.g. queries/wildcards/include/excludes).
+
:This element is meant to be extended by the the Agent developer in a derived schema and may be used to define anything that is pertinent of getting the Agent's job done. <br/> These information may include connection information to the data source to monitor or filters such as for instance queries, wild cards, include, excludes, etc.
  
  

Latest revision as of 03:56, 21 April 2011

Contents

[edit] Overview

An Agent monitors a data source for changes, sending both content and metadata of interest about new/modified resources and Ids of deleted resources.


SMILA currently comes with two types of Agents, each for a different datasource type, namely MockAgent (a sample implementation of an agent) and a FeedAgent that enables monitoring of RSS and atom feeds. Furthermore, the Connectivity Framework provides an API for developers to create their own Agents.

[edit] API

An Agent has to implement interface Agent which extends interface Runnable. The easiest way to achieve this is to extend the abstract base class AbstractAgent located in bundle org.eclipse.smila.connectivity.framework. This class already contains handling for the Agents Id, an OSGI service activate method and also default implementations for the start() and stop() methods creating a new Thread for the Agent to run in. So the only method that has to be implemented is method run() of the Runnable interface which contains the processing logic of the agent.

Javadoc org.eclipse.smila.connectivity.framework.Agent

[edit] Architecture

Agents are managed and instantiated by the AgentController. The AgentController communicates with the Agent via interface Agent, starting or stopping the agent. As long as the agent is running it communicates with the AgentController via the callback interface AgentControllerCallback to send add and delete events to the AgentController. The agent itself has no reference to DeltaIndexingManager, only the AgentController who initializes the delta indexing session has one. To identify the session the parameter sessionId is passed in method start(final AgentControllerCallback controllerCallback, final AgentState agentState, final DataSourceConnectionConfig config, final String sessionId) so that the Agent can send it back to the AgentController via interface AgentControllerCallback. Agents extend the Runnable interface and must implement method run(). There is already some functionality included in the abstract base class AbstractAgent for thread handling. In the start() method a new Thread is created for the Agent and stored in a private member variable. It also contains a private boolean flag _stopThread. The run() method should watch this flag using method isStopThread() to check when processing should end. Here is some skelleton code of how the implementation could look like:

  /**
   * Skelleton code for the run() method.
   * @see java.lang.Runnable#run()
   */
  public void run() {
    try {
      while (!isStopThread()) {
        try {
 
            // here goes the agent business logic
 
        } catch (InterruptedException e) {
          if (_log.isTraceEnabled()) {
            _log.trace("agent thread was interrupted ", e);
          }
        }
      }
    } catch (Exception e) {
      throw new RuntimeException(e);
    } catch (Throwable t) {
      throw new RuntimeException(t);
    } finally {
      try {
        stop();
      } catch (Exception e) {
        throw new RuntimeException(e);
      }
    }
  }


Package org.eclipse.smila.connectivity.framework.util provides some factory classes for Agents to create Ids, hashes and DataReference objects.

[edit] Configuration

An Agent is started with a specific, named configuration, that defines what information is to be sent (e.g. content, kinds of metadata) and where to find that data (e.g. file system path, JDBC Connection String). See each Agent documentation for details on configuration options.

Each Agent can define its own configuration because Agents need different information to monitor different data sources. As example a JDBC-Agent need information about which database and which table should be monitored and which columns should be returned.

Therefore the Agent developer defines a schema that contains all interesting information. This schema is based on a root schema that is shared betweeen Agents and Crawler. It declares the generic framework/frame which has to be used to send DataSourceConnectionConfigs to the SMILA framework. The root-schema can be found in: configuration\org.eclipse.smila.connectivity.framework.schema/schemas/RootDataSourceConnectionConfigSchema.xsd.

The root schema looks like as follows:

RootdatasourceConnectionConfig.png

DataSourceID
A description string that is used in the whole framework to separate and address information that apply to the same agent
SchemaID
The SchemaID contains the whole bundle name of the Agent (e.g. FeedAgent: org.eclipse.smila.connectivity.framework.agent.feed).
The SMILA Framework uses this information to gather the schema for the validation of the DataSourceConnectionConfig that should be executed.
DataConnectionID
This tag describes if an Agent or Crawler should be used. It contains either of the following tags:
  • Agent
  • Crawler
The name that is used in these tags is the Service name of the Agent/Crawler.
RecordBuffer
Here you can specify settings to optimize record transfer to ConnectivityManager. These settings are not applicable to Agents !
  • Size - the number of records to be send to ConnectivityManager in one block. Default is 1.
  • FlushInterval - a time interval in milliseconds after which to send the current elements of the RecordBuffer to ConnectivityManager. Default is 1000.
DeltaIndexing
Configuration options for delta indexing that are to be interpreted by the AgentController. The following values are supported:
  • full - delta indexing is fully activated. Records are checked if they need to be updated, entries for new/updated records are added to the deltaIndexingManager, delta-delete is executed if no error occurred
  • additive - as full but delta-delete is not executed
  • initial - For an initial import in an empty index or a new source in an existing index performance can be optimized by NOT checking if a record needs to be updated (we know that all records are new) but adding an entry in the DeltaIndexingManager for each Record. This allows later runs using full or additive to make use of DeltaIndexing information.
  • disabled - delta indexing is fully disabled. No checks are done, no entries are created/updated, no Delta-Delete is executed. Later runs cannot benefit from DeltaIndexing
CompoundHandling
Configuration options for CompoundHandling. See CompoundManagement for details.
Attributes
Placeholder for each Agent's attribute definition.
The Agent defines here which Attributes it returns. An attribute is a specific information of an entry in the data-source that is crawled by the Agent (E.g. In a filesystem an entry is a file, and attributes of a file are Size, Content, etc.)
Process
This element is meant to be extended by the the Agent developer in a derived schema and may be used to define anything that is pertinent of getting the Agent's job done.
These information may include connection information to the data source to monitor or filters such as for instance queries, wild cards, include, excludes, etc.


[edit] Further Information:

  1. See for each Agent Attributes and Process Tags
  2. How to implement an Agent

[edit] Agent lifecycle

The AgentController manages the life cycle of the agent (e.g. start, stop, abort) and may instantiate multiple agents concurrently, even of the same type. This is realised by using OSGi ComponentFactories. Each agent does not automatically start an OSGi service, but registers only an Agent ComponentFactory with the AgentController. Via the ComponentFactory the AgentController can instantiate agents on demand.

Here is a template for an agent OSGi component definition

<component name="%AGENT_TYPE%" immediate="false" factory="AgentFactory">
    <implementation class="%AGENT_IMPLEMENTATION_CLASS%" />
    <service>
         <provide interface="org.eclipse.smila.connectivity.framework.agent"/>
    </service>    
</component>

[edit] See also

More information about the different Agents can be found here: