Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/HowTo/How to implement a crawler"

m
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{note| Outdated | This needs to be updated for v0.8. For now, look at the code of existing crawlers}}
+
{{note|This is deprecated for SMILA 1.0, the connectivity framework is still functional but will aimed to be replaced by scalable import based on SMILAs job management.}}
  
 +
Explains how to implement an [[SMILA/Glossary#C|Crawler]] and [[SMILA/Howto integrate a component in SMILA|add its functionality]] to SMILA.
  
This page explains how to implement a [[SMILA/Glossary#C|crawler]] and [[SMILA/Howto_integrate_a_component_in_SMILA|add its functionality]] to SMILA.
+
== Prepare bundle and manifest  ==
  
== Prepare bundle and manifest ==
+
*Create a new bundle that will contain your crawler. Follow the instructions on [[SMILA/Development Guidelines/Create a bundle (plug-in)|How to create a bundle]]. In this sample we use the prefix <tt>myplugin.crawler.mock</tt> for the name of project.
 +
*For crawler JXB code generation we need to import SMILA.builder project into our workspace.
  
* Create a new bundle that will contain your crawler. Follow the instructions on [[SMILA/Development_Guidelines/Create_a_bundle_%28plug-in%29|How to create a bundle]] and name the project  <tt>org.eclipse.smila.connectivity.framework.crawler</tt>.
+
*Edit the manifest file and add at least the following packages to the ''Import-Package'' section.  
* Edit the manifest file and add the following packages to the ''Import-Package'' section.
+
**<tt>org.eclipse.smila.connectivity;version="1.0.0"</tt>
** <tt>org.eclipse.smila.connectivity.framework;version="0.5.0"</tt>
+
**<tt>org.eclipse.smila.connectivity.framework;version="1.0.0"</tt>
** <tt>org.eclipse.smila.connectivity.framework.indexorder;version="0.5.0"</tt>
+
**<tt>org.eclipse.smila.connectivity.framework.performancecounters;version="1.0.0"</tt>
** <tt>org.eclipse.smila.connectivity.framework.indexorder.messages.interfaces;version="0.5.0"</tt>
+
**<tt>org.eclipse.smila.connectivity.framework.schema;version="1.0.0"</tt>
** <tt>org.eclipse.smila.connectivity.framework.indexorder.schematools;version="0.5.0"</tt>
+
**<tt>org.eclipse.smila.connectivity.framework.schema.config;version="1.0.0"</tt>
** <tt>org.eclipse.smila.connectivity.framework.utils;version="0.5.0"</tt>
+
**<tt>org.eclipse.smila.connectivity.framework.schema.config.interfaces;version="1.0.0"</tt>
** <tt>org.eclipse.smila.datamodel.record;version="0.5.0"</tt>
+
**<tt>org.eclipse.smila.connectivity.framework.util;version="1.0.0"</tt>
** <tt>com.sun.xml.bind.v2;version="2.1.6"</tt>
+
**<tt>org.eclipse.smila.datamodel;version="1.0.0"</tt>
** <tt>javax.xml.bind;version="2.1.0"</tt>
+
** <tt>javax.xml.bind.annotation;version="2.1.0"</tt>
+
** <tt>javax.xml.bind.annotation.adapters;version="2.1.0"</tt>
+
** <tt>javax.xml.stream;version="1.0.0"</tt>
+
** <tt>org.apache.commons.logging;version="1.1.1"</tt>
+
  
* The manifest should now look like this:
+
*you will have to add additional packages to fill you crawler with business logic&nbsp;!
<pre>
+
 
 +
*Now your MANIFEST.MF file should be like
 +
<source lang="text">
 
Manifest-Version: 1.0
 
Manifest-Version: 1.0
 
Bundle-ManifestVersion: 2
 
Bundle-ManifestVersion: 2
Bundle-Name: Filesystem Crawler Plug-in (Incubation)
+
Bundle-Name: Mock Crawler
Bundle-SymbolicName: org.eclipse.smila.connectivity.framework.crawler.filesystem;singleton:=true
+
Bundle-SymbolicName: myplugin.crawler.mock
Bundle-Version: 0.5.0
+
Bundle-Version: 1.0.0
Bundle-Vendor: empolis GmbH and brox IT Solutions GmbH
+
Bundle-RequiredExecutionEnvironment: JavaSE-1.6
Import-Package: com.sun.xml.bind.v2;version="2.1.6",
+
Import-Package:  
javax.xml.bind;version="2.1.0",
+
  org.eclipse.smila.connectivity;version="1.0.0",
javax.xml.bind.annotation;version="2.1.0",
+
  org.eclipse.smila.connectivity.framework;version="1.0.0",
javax.xml.bind.annotation.adapters;version="2.1.0",
+
  org.eclipse.smila.connectivity.framework.performancecounters;version="1.0.0",
javax.xml.stream;version="1.0.1",
+
  org.eclipse.smila.connectivity.framework.schema;version="1.0.0",
org.apache.commons.io;version="1.4.0",
+
  org.eclipse.smila.connectivity.framework.schema.config;version="1.0.0",
org.apache.commons.logging;version="1.1.1",
+
  org.eclipse.smila.connectivity.framework.schema.config.interfaces;version="1.0.0",
  org.eclipse.smila.connectivity.framework;version="0.5.0",
+
  org.eclipse.smila.connectivity.framework.util;version="1.0.0",
  org.eclipse.smila.connectivity.framework.indexorder;version="0.5.0",
+
  org.eclipse.smila.datamodel;version="1.0.0"
  org.eclipse.smila.connectivity.framework.indexorder.messages;version="0.5.0",
+
</source>
  org.eclipse.smila.connectivity.framework.indexorder.messages.interfaces;version="0.5.0",
+
  org.eclipse.smila.connectivity.framework.indexorder.schematools;version="0.5.0",
+
  org.eclipse.smila.connectivity.framework.utils;version="0.5.0",
+
org.eclipse.smila.datamodel.record;version="0.5.0",
+
  org.eclipse.smila.utils.digest;version="0.5.0"
+
Export-Package: org.eclipse.smila.connectivity.framework.crawler.filesystem;version="0.5.0",
+
  org.eclipse.smila.connectivity.framework.crawler.filesystem.messages;version="0.5.0"
+
Service-Component: OSGI-INF/filesystemcrawler.xml
+
Eclipse-LazyStart: false
+
</pre>
+
  
== Prepare Indexorder schema and classes ==
+
== Prepare DataSourceConnect schema and classes ==
  
'''NOTE''' The information in this section is not completely up-to-date. For details about schema definition and compilaton best have a look at the source code of the crawlers that come with SMILA (e.g. the filesystem crawler in bundle <tt>org.eclipse.smila.connectivity.framework.crawler.filesystem</tt>) and use that as a template.
+
*create an additional source folder <tt>code/gen</tt> to contain the generated schema sources
 +
**Right-click your bundle and click ''New &gt; Source Folder''.  
 +
**Enter "code/gen" as the folder name.  
 +
**edit build.properties and add folder <tt>code/gen</tt> to the source folders.
 +
 
 +
<source lang="text">
 +
source.. = code/src/,\
 +
          code/gen/
 +
output.. = code/bin/
 +
</source>
 +
 
 +
<br>
 +
 
 +
*create schema definition
 +
**create a folder <tt>schema</tt> in your bundle
 +
**create file <tt>schemas\MockCrawlerSchema.xsd</tt> to contain the XSD schema for the crawler configuration based on the abstract XSD schema "RootDataSourceConnectionConfigSchema"
 +
**therin you have to provide definitions of "Process" and "Attribute" nodes for crawler specific information
 +
**the following code snippet can be used as a template
  
* Add the <tt>code/gen</tt> folder to the source folders (build.properties : <tt>source.. = code/src/,code/gen/</tt>):
 
** Right-click your bundle and click ''New > Source Folder''.
 
** Enter "code/gen" as the folder name.
 
* Copy the content of the folder <tt>[[Media:SMILA-template-crawler.zip|template-crawler]]</tt> into your crawler bundle folder.
 
* Compile schema into JAXB classes by running <tt>ant</tt>.
 
** Launch <tt>ant</tt> from a cmd console to see the result or error messages.
 
** See [[SMILA/Development_Guidelines/Setup for JAXB code generation]] for instruction on how to setup the JAXB generation tools.
 
* Implement XSD schema for the crawler configuration using the template <tt>schemas\TemplateIndexOrder.xsd</tt>.
 
** Index Order configuration based on XSD schema redefinition of abstract "RootIndexOrderConfiguration" schema
 
** Developer should define redefinition of "Process" and "Attribute" nodes for crawler specific information.
 
 
<source lang="xml">
 
<source lang="xml">
 +
<?xml version="1.0" encoding="UTF-8"?>
 
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:redefine schemaLocation="../../org.eclipse.smila.connectivity.framework.indexorder/schemas/RootIndexOrderConfiguration.xsd">
+
   <xs:redefine schemaLocation="../../org.eclipse.smila.connectivity.framework.schema/schemas/RootDataSourceConnectionConfigSchema.xsd">
 
     <xs:complexType name="Process">
 
     <xs:complexType name="Process">
 
       <xs:annotation>
 
       <xs:annotation>
Line 73: Line 70:
 
       <xs:complexContent>
 
       <xs:complexContent>
 
         <xs:extension base="Process">
 
         <xs:extension base="Process">
  <\!--define process here -->
+
 
 +
      <\!--define crawler specific process here -->
 +
 
 
         </xs:extension>
 
         </xs:extension>
 
       </xs:complexContent>
 
       </xs:complexContent>
Line 80: Line 79:
 
       <xs:complexContent>
 
       <xs:complexContent>
 
         <xs:extension base="Attribute">
 
         <xs:extension base="Attribute">
  <\!--define attribute here -->
+
 
 +
      <\!--define crawler specific attributes here -->
 +
 
 
         </xs:extension>
 
         </xs:extension>
 
       </xs:complexContent>
 
       </xs:complexContent>
Line 86: Line 87:
 
   </xs:redefine>
 
   </xs:redefine>
 
</xs:schema>
 
</xs:schema>
</source>
+
</source>  
* Rename and edit JAXB mapping file used for generating configuration classes (TemplateIndexOrder.jxb).
+
 
** Update package name
+
*create JAXB mapping  
<source lang="xml">
+
**create file <tt>schemas\MockCrawlerSchema.jxb</tt> to contain the JAXB mappings used for generating configuration classes.  
  <jxb:package name="org.eclipse.smila.connectivity.framework.crawler.MYCRAWLER.messages"/>
+
**Here is an example for the <tt>MockCrawler</tt> JXB file you can use as a template, just rename the "schemaLocation" and "package name":
</source>
+
 
:* Update schema location
+
<source lang="xml">
+
<jxb:bindings schemaLocation="TemplateIndexOrder.xsd"
+
</source>
+
* Add a schema location reference in the plug-in implementation (return "schemas/TemplateIndexOrder.xsd").
+
** Create a new class (<tt>IndexOrderSchemaImpl</tt>) which implements the interface <tt>IndexOrderSchema</tt>.
+
** Use the method <tt>String getSchemaLocation()</tt> to return "schemas/TemplateIndexOrder.xsd".
+
** Use the method <tt>String getMessagesPackage()</tt> to return package name"org.eclipse.smila.connectivity.framework.crawler.MYCRAWLER.messages".
+
 
<source lang="xml">
 
<source lang="xml">
package org.eclipse.smila.connectivity.framework.crawler.MYCRAWLER;
+
<jxb:bindings version="1.0"
 +
  xmlns:jxb="http://java.sun.com/xml/ns/jaxb"
 +
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
 +
 +
  <jxb:bindings schemaLocation="MockCrawlerSchema.xsd" node="/xs:schema">
 +
    <jxb:schemaBindings>
 +
      <jxb:package name="mypackage.crawler.mock.messages"/>
 +
    </jxb:schemaBindings>   
 +
    <jxb:globalBindings>
 +
      <jxb:javaType name="java.util.Date" xmlType="xs:dateTime" printMethod="org.eclipse.smila.connectivity.framework.schema.tools.SimpleDateFormatter.print" parseMethod="org.eclipse.smila.connectivity.framework.schema.tools.SimpleDateFormatter.parse"/>
 +
      <jxb:javaType name="org.eclipse.smila.connectivity.framework.schema.config.MimeTypeAttributeType" xmlType="MimeTypeAttributeType" parseMethod="org.eclipse.smila.connectivity.framework.schema.config.MimeTypeAttributeType.fromValue" printMethod="org.eclipse.smila.connectivity.framework.schema.config.MimeTypeAttributeType.toValue"/>
 +
      <jxb:serializable uid="1"/>
 +
    </jxb:globalBindings>
 +
  </jxb:bindings>
 +
</jxb:bindings>
 +
</source>
  
import org.eclipse.smila.connectivity.framework.indexorder.IndexOrderSchema;
+
<br>
 +
 
 +
*Add a schema location reference in the plug-in implementation
 +
**Create a new class (<tt>DataSourceConnectionConfigPluginImpl</tt>) which implements the interface <tt>DataSourceConnectionConfigPlugin</tt>.
 +
**Use the method <tt>String getSchemaLocation()</tt> to return "schemas/MockCrawlerSchema.xsd".
 +
**Use the method <tt>String getMessagesPackage()</tt> to return package name"mypackage.crawler.mock.messages".
 +
 
 +
Here is an example implementation for the <tt>MockCrawler</tt> you can use as a template: <source lang="java">
 +
package mypackage.crawler.mock;
 +
 
 +
import org.eclipse.smila.connectivity.framework.schema.DataSourceConnectionConfigPlugin;
  
 
/**
 
/**
  * The Class IndexOrderSchemaImpl.
+
  * The Class DataSourceConnectionConfigPluginImpl.
 
  */
 
  */
public class IndexOrderSchemaImpl implements IndexOrderSchema {
+
public class DataSourceConnectionConfigPluginImpl implements DataSourceConnectionConfigPlugin {
  
 
   /**
 
   /**
 
   * {@inheritDoc}
 
   * {@inheritDoc}
 
   *  
 
   *  
   * @see org.eclipse.smila.connectivity.framework.indexorder.IndexOrderSchema#getSchemaLocation()
+
   * @see org.eclipse.smila.connectivity.framework.schema.DataSourceConnectionConfigPlugin#getSchemaLocation()
 
   */
 
   */
 
   public String getSchemaLocation() {
 
   public String getSchemaLocation() {
     return "schemas/MYCRAWLER_IndexOrder.xsd";
+
     return "schemas/MockCrawlerSchema.xsd";
 
   }
 
   }
  
Line 122: Line 140:
 
   * {@inheritDoc}
 
   * {@inheritDoc}
 
   *  
 
   *  
   * @see org.eclipse.smila.connectivity.framework.indexorder.IndexOrderSchema#getMessagesPackage()
+
   * @see org.eclipse.smila.connectivity.framework.schema.DataSourceConnectionConfigPlugin#getMessagesPackage()
 
   */
 
   */
 
   public String getMessagesPackage() {
 
   public String getMessagesPackage() {
     return "org.eclipse.smila.connectivity.framework.crawler.MYCRAWLER.messages";
+
     return "mypackage.crawler.mock.messages";
 
   }
 
   }
 
  
 
}
 
}
</source>
+
</source>  
 +
 
 +
*create new file <tt>plugin.xml</tt>
 +
**define the extension for <tt>org.eclipse.smila.connectivity.framework.schema.extension</tt>, using the bundle name as ID and NAME.
 +
**set the schema class to your implmenetation of interface <tt>DataSourceConnectionConfigPlugin</tt>
 +
**Here is an example for the <tt>MockCrawler</tt> <tt>plugin.xml</tt> file you can use as a template:
  
* Implement the extension for <tt>org.eclipse.smila.connectivity.framework.indexorder.schema</tt> with the bundle name used as ID and NAME.
 
* Check the schema classes in the file <tt>plugin.xml</tt> and change if necessary.
 
 
<source lang="java">
 
<source lang="java">
 
<plugin>
 
<plugin>
 
   <extension
 
   <extension
         id="org.eclipse.smila.connectivity.framework.crawler.filesystem"
+
         id="myplugin.crawler.mock"
         name="org.eclipse.smila.connectivity.framework.crawler.filesystem"
+
         name="myplugin.crawler.mock"
         point="org.eclipse.smila.connectivity.framework.indexorder.schema">
+
         point="org.eclipse.smila.connectivity.framework.schema.extension">
 
       <schema
 
       <schema
             class="org.eclipse.smila.connectivity.framework.crawler.filesystem.IndexOrderSchemaImpl">
+
             class="mypackage.crawler.mock.DataSourceConnectionConfigPluginImpl">
 
       </schema>
 
       </schema>
 
   </extension>
 
   </extension>
 
</plugin>
 
</plugin>
</source>
+
</source>  
  
'''Note:''' If you rename the schema file name, make sure to update the following locations:
+
<br>  
* Plug-in implementation classes
+
* <tt>TemplateIndexOrder.jxb</tt> (it also should be renamed with the same name as schema)
+
* <tt>build.xml</tt>
+
  
== OSGi and Declarative Service requirements ==
+
*Compile schema into JAXB classes by using <tt>ant</tt>
 +
**See [[SMILA/Development Guidelines/Setup for JAXB code generation]] for instruction on how to setup the JAXB generation tools. It is advised to let lib outside the workspace, for example in a lower level folder. (my -Dlib.dir=../../
 +
**create a new file <tt>build.xml</tt> to contain JXB build information. Use the following template as the content for file <tt>build.xml</tt> and rename the property value accordingly:
  
* It is not required to implement a BundleActivator.  
+
<source lang="xml">
* Create the top level folder <tt>OSGI-INF</tt>.
+
<project name="sub-build" default="compile-schema-and-decorate" basedir=".">
* Create a Component Description file in <tt>OSGI-INF</tt>. You can name the file as you like, but it is good practice to name it like the crawler. Therein you have to provide a unique component name, it should be the same as the crawler's class name followed by DS (for Declarative Service). Then you have to provide your implementation class and the service interface class, which is always <tt>org.eclipse.smila.connectivity.framework.Crawler</tt>. Here is an example for the <tt>FileSystemCrawler</tt>:
+
 
 +
  <property name="schema.name"  value="MockCrawlerSchema" />
 +
 
 +
  <import file="../SMILA.builder/xjc/build.xml" />
 +
 
 +
</project>
 +
</source>
 +
**Launch <tt>ant -Dlib.dir=../lib</tt> from a cmd console to create the java files or to see any error messages.
 +
 
 +
<br> '''Note:''' If you rename the schema file name, make sure to update the following locations:
 +
*Plug-in implementation classes
 +
*<tt>MockCrawlerSchema.jxb</tt> (it also should be renamed with the same name as schema)
 +
*<tt>build.xml</tt>
 +
 
 +
== OSGi and Declarative Service requirements  ==
 +
 
 +
*It is not required to implement a BundleActivator.  
 +
*Create the top level folder <tt>OSGI-INF</tt>.  
 +
*Create a Component Description file in <tt>OSGI-INF</tt>. You can name the file as you like, but it is good practice to name it like the crawler. Therein you have to provide a unique component name, it should be the same as the crawler's class name. Then you have to provide your implementation class and the service interface class, which is always <tt>org.eclipse.smila.connectivity.framework.Crawler</tt>. Here is an example for the <tt>MockCrawler</tt> component description file you can use as a template:
  
 
<source lang="xml">
 
<source lang="xml">
<?xml version="1.0" encoding="UTF-8"?>
+
<component name="MockCrawler" immediate="false" factory="CrawlerFactory">
<component name="FileSystemCrawlerDS" immediate="true">
+
     <implementation class="mypackage.crawer.mock.MockCrawler" />
     <implementation class="org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler" />
+
 
     <service>
 
     <service>
 
         <provide interface="org.eclipse.smila.connectivity.framework.Crawler"/>
 
         <provide interface="org.eclipse.smila.connectivity.framework.Crawler"/>
     </service>
+
     </service>  
    <property name="org.eclipse.smila.connectivity.framework.crawler.type" value="filesystemcrawler"/>
+
 
</component>
 
</component>
</source>
+
</source>  
  
* Add a ''Service-Component'' entry to your manifest file, e.g.:
+
*Add a ''Service-Component'' entry to your manifest file, e.g.:
<pre>
+
<pre>Service-Component: OSGI-INF/mockcrawler.xml
Service-Component: OSGI-INF/filesystemcrawler.xml
+
</pre>  
</pre>
+
*Open <tt>build.properties</tt> and change the binary build: Add the folders <tt>OSGI-INF</tt> and <tt>schemas</tt> as well as the file <tt>plugin.xml</tt>.
  
* Open <tt>build.properties</tt> and change the binary build: Add the folders <tt>OSGI-INF</tt> and <tt>schemas</tt> as well as the file <tt>plugin.xml</tt>.
 
 
<source lang="xml">
 
<source lang="xml">
source.. = code/src/,\
 
          code/gen/
 
output.. = code/bin/
 
 
bin.includes = META-INF/,\
 
bin.includes = META-INF/,\
 
               .,\
 
               .,\
Line 184: Line 215:
 
               schemas/,\
 
               schemas/,\
 
               OSGI-INF/
 
               OSGI-INF/
</source>
+
</source>  
 +
 
 +
<br>
 +
 
 +
== Implement your crwler  ==
 +
 
 +
*Implement your crawler in a new class extending <tt>org.eclipse.smila.connectivity.framework.AbstractCrawler</tt>.
 +
 
 +
*Integrate your new agent bundle into the build process: Refer to the page [[SMILA/Development Guidelines/How to integrate new bundle into build process|How to integrate new bundle into build process]] for further instructions.
 +
 
 +
* Follow the example of FileSystemCrawler
 +
 
 +
[optional]
 +
 
 +
*Create a JUnit test bundle for this crawler e.g. <tt>myplugin.crawler.mock.test</tt>.
 +
*Integrate your test bundle into the build process: Refer to the page [[SMILA/Development Guidelines/How to integrate test bundle into build process|How to integrate test bundle into build process]]) for further instructions.
 +
 
 +
== Activate your crawler  ==
 +
 
 +
=== Activation SMILA in eclipse  ===
  
== Develop your crawler ==
+
*Open the ''Run'' dialog, switch to the configuration page of ''Bundles'', select your bundle and set the parameter ''Default Auto-Start'' to ''true''.
 +
*Launch <tt>SMILA.launch</tt>.
  
* Implement your crawler in a new class extending <tt>org.eclipse.smila.connectivity.framework.AbstractCrawler</tt>.
+
=== Activation SMILA application  ===
* Create a JUnit test bundle for this crawler e.g. <tt>org.eclipse.smila.connectivity.framework.crawler.filesystem.test</tt>.
+
* Integrate your new crawler bundle into the build process: Refer to the page [[SMILA/Development_Guidelines/How to integrate new bundle into build process|How to integrate new bundle into build process]] for further instructions.
+
* Integrate your test bundle into the build process: Refer to the page [[SMILA/Development_Guidelines/How to integrate test bundle into build process|How to integrate test bundle into build process]]) for further instructions.
+
  
== Run your crawler ==
+
*Insert your bundle , e.g. <tt>myplugin.crawler.mock@4:start</tt>, to the <tt>config.ini</tt> file.
 +
*Launch SMILA by calling either <tt>SMILA.exe</tt> or <tt>eclipse.exe -console</tt>
  
=== Running SMILA in eclipse ===
+
== Run your crawler ==
* Open the ''Run'' dialog, switch to the configuration page of your bundle, set the parameter ''Default Start level'' to ''4'', and the parameter ''Default Auto-Start'' to ''true''.
+
* Launch <tt>SMILA.launch</tt>.
+
  
=== Running SMILA application  ===
+
Information on how to start and run an Crawler can be found in the [[SMILA/Documentation/CrawlerController|CrawlerController]] documentation.  
* Insert <tt>org.eclipse.smila.connectivity.framework.crawler.filesystem@4:start, \</tt> to the <tt>config.ini</tt> file.
+
* Launch SMILA by calling either <tt>SMILA.exe</tt> or <tt>eclipse.exe -console</tt>
+
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Latest revision as of 09:29, 24 January 2012

Note.png
This is deprecated for SMILA 1.0, the connectivity framework is still functional but will aimed to be replaced by scalable import based on SMILAs job management.


Explains how to implement an Crawler and add its functionality to SMILA.

Prepare bundle and manifest

  • Create a new bundle that will contain your crawler. Follow the instructions on How to create a bundle. In this sample we use the prefix myplugin.crawler.mock for the name of project.
  • For crawler JXB code generation we need to import SMILA.builder project into our workspace.
  • Edit the manifest file and add at least the following packages to the Import-Package section.
    • org.eclipse.smila.connectivity;version="1.0.0"
    • org.eclipse.smila.connectivity.framework;version="1.0.0"
    • org.eclipse.smila.connectivity.framework.performancecounters;version="1.0.0"
    • org.eclipse.smila.connectivity.framework.schema;version="1.0.0"
    • org.eclipse.smila.connectivity.framework.schema.config;version="1.0.0"
    • org.eclipse.smila.connectivity.framework.schema.config.interfaces;version="1.0.0"
    • org.eclipse.smila.connectivity.framework.util;version="1.0.0"
    • org.eclipse.smila.datamodel;version="1.0.0"
  • you will have to add additional packages to fill you crawler with business logic !
  • Now your MANIFEST.MF file should be like
Manifest-Version: 1.0
Bundle-ManifestVersion: 2
Bundle-Name: Mock Crawler
Bundle-SymbolicName: myplugin.crawler.mock
Bundle-Version: 1.0.0
Bundle-RequiredExecutionEnvironment: JavaSE-1.6
Import-Package: 
 org.eclipse.smila.connectivity;version="1.0.0",
 org.eclipse.smila.connectivity.framework;version="1.0.0",
 org.eclipse.smila.connectivity.framework.performancecounters;version="1.0.0",
 org.eclipse.smila.connectivity.framework.schema;version="1.0.0",
 org.eclipse.smila.connectivity.framework.schema.config;version="1.0.0",
 org.eclipse.smila.connectivity.framework.schema.config.interfaces;version="1.0.0",
 org.eclipse.smila.connectivity.framework.util;version="1.0.0",
 org.eclipse.smila.datamodel;version="1.0.0"

Prepare DataSourceConnect schema and classes

  • create an additional source folder code/gen to contain the generated schema sources
    • Right-click your bundle and click New > Source Folder.
    • Enter "code/gen" as the folder name.
    • edit build.properties and add folder code/gen to the source folders.
source.. = code/src/,\
           code/gen/
output.. = code/bin/


  • create schema definition
    • create a folder schema in your bundle
    • create file schemas\MockCrawlerSchema.xsd to contain the XSD schema for the crawler configuration based on the abstract XSD schema "RootDataSourceConnectionConfigSchema"
    • therin you have to provide definitions of "Process" and "Attribute" nodes for crawler specific information
    • the following code snippet can be used as a template
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:redefine schemaLocation="../../org.eclipse.smila.connectivity.framework.schema/schemas/RootDataSourceConnectionConfigSchema.xsd">
    <xs:complexType name="Process">
      <xs:annotation>
        <xs:documentation>Process Specification</xs:documentation>
      </xs:annotation>
      <xs:complexContent>
        <xs:extension base="Process">
 
    	  <\!--define crawler specific process here -->
 
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
    <xs:complexType name="Attribute">
      <xs:complexContent>
        <xs:extension base="Attribute">
 
    	  <\!--define crawler specific attributes here -->
 
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
  </xs:redefine>
</xs:schema>
  • create JAXB mapping
    • create file schemas\MockCrawlerSchema.jxb to contain the JAXB mappings used for generating configuration classes.
    • Here is an example for the MockCrawler JXB file you can use as a template, just rename the "schemaLocation" and "package name":
<jxb:bindings version="1.0" 
  xmlns:jxb="http://java.sun.com/xml/ns/jaxb" 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
>  
  <jxb:bindings schemaLocation="MockCrawlerSchema.xsd" node="/xs:schema">
    <jxb:schemaBindings>
      <jxb:package name="mypackage.crawler.mock.messages"/>
    </jxb:schemaBindings>    
    <jxb:globalBindings>
      <jxb:javaType name="java.util.Date" xmlType="xs:dateTime" printMethod="org.eclipse.smila.connectivity.framework.schema.tools.SimpleDateFormatter.print" parseMethod="org.eclipse.smila.connectivity.framework.schema.tools.SimpleDateFormatter.parse"/>
      <jxb:javaType name="org.eclipse.smila.connectivity.framework.schema.config.MimeTypeAttributeType" xmlType="MimeTypeAttributeType" parseMethod="org.eclipse.smila.connectivity.framework.schema.config.MimeTypeAttributeType.fromValue" printMethod="org.eclipse.smila.connectivity.framework.schema.config.MimeTypeAttributeType.toValue"/>
      <jxb:serializable uid="1"/>
    </jxb:globalBindings>
  </jxb:bindings>
</jxb:bindings>


  • Add a schema location reference in the plug-in implementation
    • Create a new class (DataSourceConnectionConfigPluginImpl) which implements the interface DataSourceConnectionConfigPlugin.
    • Use the method String getSchemaLocation() to return "schemas/MockCrawlerSchema.xsd".
    • Use the method String getMessagesPackage() to return package name"mypackage.crawler.mock.messages".
Here is an example implementation for the MockCrawler you can use as a template:
package mypackage.crawler.mock;
 
import org.eclipse.smila.connectivity.framework.schema.DataSourceConnectionConfigPlugin;
 
/**
 * The Class DataSourceConnectionConfigPluginImpl.
 */
public class DataSourceConnectionConfigPluginImpl implements DataSourceConnectionConfigPlugin {
 
  /**
   * {@inheritDoc}
   * 
   * @see org.eclipse.smila.connectivity.framework.schema.DataSourceConnectionConfigPlugin#getSchemaLocation()
   */
  public String getSchemaLocation() {
    return "schemas/MockCrawlerSchema.xsd";
  }
 
  /**
   * {@inheritDoc}
   * 
   * @see org.eclipse.smila.connectivity.framework.schema.DataSourceConnectionConfigPlugin#getMessagesPackage()
   */
  public String getMessagesPackage() {
    return "mypackage.crawler.mock.messages";
  }
 
}
  • create new file plugin.xml
    • define the extension for org.eclipse.smila.connectivity.framework.schema.extension, using the bundle name as ID and NAME.
    • set the schema class to your implmenetation of interface DataSourceConnectionConfigPlugin
    • Here is an example for the MockCrawler plugin.xml file you can use as a template:
<plugin>
   <extension
         id="myplugin.crawler.mock"
         name="myplugin.crawler.mock"
         point="org.eclipse.smila.connectivity.framework.schema.extension">
      <schema
            class="mypackage.crawler.mock.DataSourceConnectionConfigPluginImpl">
      </schema>
   </extension>
</plugin>


  • Compile schema into JAXB classes by using ant
    • See SMILA/Development Guidelines/Setup for JAXB code generation for instruction on how to setup the JAXB generation tools. It is advised to let lib outside the workspace, for example in a lower level folder. (my -Dlib.dir=../../
    • create a new file build.xml to contain JXB build information. Use the following template as the content for file build.xml and rename the property value accordingly:
<project name="sub-build" default="compile-schema-and-decorate" basedir=".">
 
  <property name="schema.name"  value="MockCrawlerSchema" />
 
  <import file="../SMILA.builder/xjc/build.xml" />
 
</project>
    • Launch ant -Dlib.dir=../lib from a cmd console to create the java files or to see any error messages.


Note: If you rename the schema file name, make sure to update the following locations:

  • Plug-in implementation classes
  • MockCrawlerSchema.jxb (it also should be renamed with the same name as schema)
  • build.xml

OSGi and Declarative Service requirements

  • It is not required to implement a BundleActivator.
  • Create the top level folder OSGI-INF.
  • Create a Component Description file in OSGI-INF. You can name the file as you like, but it is good practice to name it like the crawler. Therein you have to provide a unique component name, it should be the same as the crawler's class name. Then you have to provide your implementation class and the service interface class, which is always org.eclipse.smila.connectivity.framework.Crawler. Here is an example for the MockCrawler component description file you can use as a template:
<component name="MockCrawler" immediate="false" factory="CrawlerFactory">
    <implementation class="mypackage.crawer.mock.MockCrawler" />
    <service>
         <provide interface="org.eclipse.smila.connectivity.framework.Crawler"/>
    </service>    
</component>
  • Add a Service-Component entry to your manifest file, e.g.:
Service-Component: OSGI-INF/mockcrawler.xml
  • Open build.properties and change the binary build: Add the folders OSGI-INF and schemas as well as the file plugin.xml.
bin.includes = META-INF/,\
               .,\
               plugin.xml,\
               schemas/,\
               OSGI-INF/


Implement your crwler

  • Implement your crawler in a new class extending org.eclipse.smila.connectivity.framework.AbstractCrawler.
  • Follow the example of FileSystemCrawler

[optional]

Activate your crawler

Activation SMILA in eclipse

  • Open the Run dialog, switch to the configuration page of Bundles, select your bundle and set the parameter Default Auto-Start to true.
  • Launch SMILA.launch.

Activation SMILA application

  • Insert your bundle , e.g. myplugin.crawler.mock@4:start, to the config.ini file.
  • Launch SMILA by calling either SMILA.exe or eclipse.exe -console

Run your crawler

Information on how to start and run an Crawler can be found in the CrawlerController documentation.