Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/HowTo/How to implement a crawler"

(General review on content and style)
Line 1: Line 1:
 
This page explains how to implement a [[SMILA/Glossary#C|crawler]] and [[SMILA/Howto_integrate_a_component_in_SMILA|add its functionality]] to SMILA.
 
This page explains how to implement a [[SMILA/Glossary#C|crawler]] and [[SMILA/Howto_integrate_a_component_in_SMILA|add its functionality]] to SMILA.
  
== Prepare the bundle and manifest ==
+
== Prepare bundle and manifest ==
  
 
* Create a new bundle that will contain your crawler. Follow the instructions on [[SMILA/Development_Guidelines/Create_a_bundle_%28plug-in%29|How to create a bundle]] and name the project  <tt>org.eclipse.smila.connectivity.framework.crawler</tt>.
 
* Create a new bundle that will contain your crawler. Follow the instructions on [[SMILA/Development_Guidelines/Create_a_bundle_%28plug-in%29|How to create a bundle]] and name the project  <tt>org.eclipse.smila.connectivity.framework.crawler</tt>.
Line 40: Line 40:
 
** Enter "code/gen" as the folder name.
 
** Enter "code/gen" as the folder name.
 
* Copy the content of the folder <tt>schema-compile-runtime\schema-pattern\</tt> into your crawler bundle folder.
 
* Copy the content of the folder <tt>schema-compile-runtime\schema-pattern\</tt> into your crawler bundle folder.
* Compile schema into JAXB classes by running <tt>schema.cmd</tt> (The <tt>schema-compile-runtime</tt> folder should be located on the projects level - it will be used for compilation.)
+
* Compile schema into JAXB classes by running <tt>schema.cmd</tt>. The <tt>schema-compile-runtime</tt> folder should be located on the projects level as it will be used for compilation.
 
** Launch the file <tt>schema.cmd</tt> from a cmd console to see the result or error messages.
 
** Launch the file <tt>schema.cmd</tt> from a cmd console to see the result or error messages.
 
* Implement XSD schema for the crawler configuration using the template <tt>schemas\TemplateIndexOrder.xsd</tt>.
 
* Implement XSD schema for the crawler configuration using the template <tt>schemas\TemplateIndexOrder.xsd</tt>.
** Index Order Configaration based on XSD schema redefinition of abstract "RootIndexOrderConfiguration" schema
+
** Index Order configuration based on XSD schema redefinition of abstract "RootIndexOrderConfiguration" schema
 
** Developer should define redefinition of "Process" and "Attribute" nodes for crawler specific information.
 
** Developer should define redefinition of "Process" and "Attribute" nodes for crawler specific information.
 
<source lang="xml">
 
<source lang="xml">
Line 69: Line 69:
 
</source>
 
</source>
  
* Add schema location reference in plug-in implementation (return "schemas/TemplateIndexOrder.xsd")
+
* Add a schema location reference in the plug-in implementation (return "schemas/TemplateIndexOrder.xsd").
** create a new class (IndexOrderSchemaImpl) which implements interface IndexOrderSchema
+
** Create a new class (<tt>IndexOrderSchemaImpl</tt>) which implements the interface <tt>IndexOrderSchema</tt>.
** use method String getSchemaLocation() to return "schemas/TemplateIndexOrder.xsd"
+
** Use the method <tt>String getSchemaLocation()</tt> to return "schemas/TemplateIndexOrder.xsd".
 
<source lang="xml">
 
<source lang="xml">
 
package org.eclipse.smila.connectivity.framework.crawler.filesystem;
 
package org.eclipse.smila.connectivity.framework.crawler.filesystem;
Line 93: Line 93:
 
</source>
 
</source>
  
* Implement extension for "org.eclipse.smila.connectivity.framework.indexorder.schema" with the bundle name used as ID and NAME
+
* Implement the extension for <tt>org.eclipse.smila.connectivity.framework.indexorder.schema</tt> with the bundle name used as ID and NAME.
** in plugin.xml check schema class and change if it is necessary
+
* Check the schema classes in the file <tt>plugin.xml</tt> and change if necessary.
 
+
 
<source lang="java">
 
<source lang="java">
 
<plugin>
 
<plugin>
Line 109: Line 108:
 
</source>
 
</source>
  
* Note: If you rename the schema file name, make sure to update the following locations:
+
Note: If you rename the schema file name, make sure to update the following locations:
** plug-in implementation classes
+
* Plug-in implementation classes
** TemplateIndexOrder.jxb (it also should be also renamed with the same name as schema)
+
* <tt>TemplateIndexOrder.jxb</tt> (it also should be renamed with the same name as schema)
** schema.cmd
+
* <tt>schema.cmd</tt>
  
 
== OSGi and Declarative Service requirements ==
 
== OSGi and Declarative Service requirements ==
  
* It is not required to implement a BundleActivator (this may change if SCA Nodes are used, then it may be required to register the Crawler)
+
* It is not required to implement a BundleActivator. (This may change if SCA Nodes are used; then it may be required to register the crawler.)
* create a top level folder <tt>OSGI-INF</tt>
+
* Create the top level folder <tt>OSGI-INF</tt>.
* create a Component Description file in <tt>OSGI-INF</tt>. You can name the file as you like, but it is good practice to name it like the Crawler. Therein you have to provide a unique component name, it should be the same as the Crawler's class name followed by DS (for DeclarativeService). Then you have to provide your implementation class and the service interface class, which is always <tt>org.eclipse.smila.connectivity.framework.Crawler</tt>. Here is an example for the FileSystemCrawler:
+
* Create a Component Description file in <tt>OSGI-INF</tt>. You can name the file as you like, but it is good practice to name it like the crawler. Therein you have to provide a unique component name, it should be the same as the crawler's class name followed by DS (for Declarative Service). Then you have to provide your implementation class and the service interface class, which is always <tt>org.eclipse.smila.connectivity.framework.Crawler</tt>. Here is an example for the <tt>FileSystemCrawler</tt>:
  
 
<source lang="xml">
 
<source lang="xml">
Line 131: Line 130:
 
</source>
 
</source>
  
* Add a Service-Component entry to your manifest file, e.g.
+
* Add a ''Service-Component'' entry to your manifest file, e.g.:
 
<pre>
 
<pre>
 
Service-Component: OSGI-INF/filesystemcrawler.xml
 
Service-Component: OSGI-INF/filesystemcrawler.xml
 
</pre>
 
</pre>
  
* Open build.properties and change binary build: select folders OSGI-INF, schemas and file plugin.xml
+
* Open <tt>build.properties</tt> and change the binary build: Add the folders <tt>OSGI-INF</tt> and <tt>schemas</tt> as well as the file <tt>plugin.xml</tt>.
 
<source lang="xml">
 
<source lang="xml">
 
source.. = code/src/,\
 
source.. = code/src/,\
Line 150: Line 149:
 
== SCA requirements ==
 
== SCA requirements ==
  
'''Note''': currently SCA is not supported by SMILA! This section can be skipped.
+
'''Note''': SCA is currently not supported by SMILA! This section can be skipped.
  
Most requirements for SCA are already handled by the base class <tt>org.eclipse.smila.connectivity.framework.AbstractCrawler</tt>. You should annotate your implementation with <tt>@AllowsPassByReference</tt>, to allow SCA to pass service parameters by reference when service interactions are within the same adress space. Here is an example for the FileSystemCrawler:
+
Most requirements for SCA are already handled by the base class <tt>org.eclipse.smila.connectivity.framework.AbstractCrawler</tt>. You should annotate your implementation with <tt>@AllowsPassByReference</tt>, to allow SCA to pass service parameters by reference when service interactions are within the same adress space. Here is an example for the <tt>FileSystemCrawler</tt>:
  
 
<source lang="java">
 
<source lang="java">
Line 161: Line 160:
 
</source>
 
</source>
  
For SCA, make sure to set attribute "immediate" of tag "component" to <tt>false</tt> and attribute "servicefactory" of tag "service " to <tt>true</tt> in your xml-file. This is required to let SCA dynamically create ServiceReferences. Finally you have to provide a value for the property <tt>org.eclipse.smila.connectivity.framework.crawler.type</tt>. This is used by SCA to find the correct OSGi Declarative Service during runtime. The value has to be unique (it is used in SCA contribution files) and should be named like the Crawler. Here is an example for the FileSystemCrawler:
+
For SCA, make sure to set the attribute ''immediate'' of the element <tt><component></tt> to ''"false"'' and the attribute ''servicefactory'' of the element <tt><service></tt> to ''"true"'' in your XML file. This is required to let SCA dynamically create ServiceReferences. Finally, you have to provide a value for the property <tt>org.eclipse.smila.connectivity.framework.crawler.type</tt>. This is used by SCA to find the correct OSGi Declarative Service during runtime. The value has to be unique (it is used in SCA contribution files) and should be named like the crawler. Here is an example for the <tt>FileSystemCrawler</tt>:
  
 
<source lang="xml">
 
<source lang="xml">
Line 186: Line 185:
 
=== Running SMILA in eclipse  ===
 
=== Running SMILA in eclipse  ===
 
* Open the ''Run'' dialog, switch to the configuration page of your bundle, set the parameter ''Default Start level'' to ''4'', and the parameter ''Default Auto-Start'' to ''true''.
 
* Open the ''Run'' dialog, switch to the configuration page of your bundle, set the parameter ''Default Start level'' to ''4'', and the parameter ''Default Auto-Start'' to ''true''.
* launch SMILA.launch
+
* Launch <tt>SMILA.launch</tt>.
  
 
=== Running SMILA application  ===
 
=== Running SMILA application  ===
 
* Insert <tt>org.eclipse.smila.connectivity.framework.crawler.filesystem@4:start, \</tt> to the <tt>config.ini</tt> file.
 
* Insert <tt>org.eclipse.smila.connectivity.framework.crawler.filesystem@4:start, \</tt> to the <tt>config.ini</tt> file.
* launch SMILA by calling eclipse.exe -console or launch.cmd.  
+
* Launch SMILA by calling either <tt>eclipse.exe -console</tt> or <tt>launch.cmd</tt>.  
  
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Revision as of 11:50, 6 October 2008

This page explains how to implement a crawler and add its functionality to SMILA.

Prepare bundle and manifest

  • Create a new bundle that will contain your crawler. Follow the instructions on How to create a bundle and name the project org.eclipse.smila.connectivity.framework.crawler.
  • Edit the manifest file and add the following packages to the Import-Package section.
    • org.eclipse.smila.connectivity.framework
    • org.eclipse.smila.connectivity.framework.utils
    • org.eclipse.smila.datamodel.record
    • org.osoa.sca.annotations
    • com.sun.xml.bind.v2;version="2.1.6"
    • javax.xml.bind;version="2.1.0"
    • javax.xml.bind.annotation;version="2.1.0"
    • javax.xml.bind.annotation.adapters;version="2.1.0"
    • javax.xml.stream;version="1.0.0"
    • org.apache.commons.logging;version="1.1.1"
  • Edit the manifest file and add the following to the Require-Bundle section: (Please note that it is a MUST to import this package using Require-Bundle instead of Import-Package.)
    • org.eclipse.smila.connectivity.framework.indexorder
  • The manifest should now look like this:
Require-Bundle: org.eclipse.smila.connectivity.framework.indexorder
Import-Package: com.sun.xml.bind.v2;version="2.1.6",
 javax.xml.bind;version="2.1.0",
 javax.xml.bind.annotation;version="2.1.0",
 javax.xml.bind.annotation.adapters;version="2.1.0",
 javax.xml.stream;version="1.0.0",
 org.apache.commons.io;version="1.4.0",
 org.apache.commons.logging;version="1.1.1",
 org.eclipse.smila.connectivity.framework,
 org.eclipse.smila.connectivity.framework.utils,
 org.eclipse.smila.datamodel.record,
 org.osoa.sca.annotations

Prepare Indexorder schema and classes

  • Add the code/gen folder to the source folders (build.properties : source.. = code/src/,code/gen/):
    • Right-click your bundle and click New > Source Folder.
    • Enter "code/gen" as the folder name.
  • Copy the content of the folder schema-compile-runtime\schema-pattern\ into your crawler bundle folder.
  • Compile schema into JAXB classes by running schema.cmd. The schema-compile-runtime folder should be located on the projects level as it will be used for compilation.
    • Launch the file schema.cmd from a cmd console to see the result or error messages.
  • Implement XSD schema for the crawler configuration using the template schemas\TemplateIndexOrder.xsd.
    • Index Order configuration based on XSD schema redefinition of abstract "RootIndexOrderConfiguration" schema
    • Developer should define redefinition of "Process" and "Attribute" nodes for crawler specific information.
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:redefine schemaLocation="RootIndexOrderConfiguration.xsd">
    <xs:complexType name="Process">
      <xs:annotation>
        <xs:documentation>Process Specification</xs:documentation>
      </xs:annotation>
      <xs:complexContent>
        <xs:extension base="Process">
	  <\!--define process here -->
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
    <xs:complexType name="Attribute">
      <xs:complexContent>
        <xs:extension base="Attribute">
	  <\!--define attribute here -->
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
  </xs:redefine>
</xs:schema>
  • Add a schema location reference in the plug-in implementation (return "schemas/TemplateIndexOrder.xsd").
    • Create a new class (IndexOrderSchemaImpl) which implements the interface IndexOrderSchema.
    • Use the method String getSchemaLocation() to return "schemas/TemplateIndexOrder.xsd".
package org.eclipse.smila.connectivity.framework.crawler.filesystem;
 
import org.eclipse.smila.connectivity.framework.indexorder.IndexOrderSchema;
 
/**
 * The Class IndexOrderSchemaImpl.
 */
public class IndexOrderSchemaImpl implements IndexOrderSchema {
 
  /**
   * {@inheritDoc}
   * 
   * @see org.eclipse.smila.connectivity.framework.indexorder.IndexOrderSchema#getSchemaLocation()
   */
  public String getSchemaLocation() {
    return "schemas/filesystemIndexOrder.xsd";
  }
}
  • Implement the extension for org.eclipse.smila.connectivity.framework.indexorder.schema with the bundle name used as ID and NAME.
  • Check the schema classes in the file plugin.xml and change if necessary.
<plugin>
   <extension
         id="org.eclipse.smila.connectivity.framework.crawler.filesystem"
         name="org.eclipse.smila.connectivity.framework.crawler.filesystem"
         point="org.eclipse.smila.connectivity.framework.indexorder.schema">
      <schema
            class="org.eclipse.smila.connectivity.framework.crawler.filesystem.IndexOrderSchemaImpl">
      </schema>
   </extension>
</plugin>

Note: If you rename the schema file name, make sure to update the following locations:

  • Plug-in implementation classes
  • TemplateIndexOrder.jxb (it also should be renamed with the same name as schema)
  • schema.cmd

OSGi and Declarative Service requirements

  • It is not required to implement a BundleActivator. (This may change if SCA Nodes are used; then it may be required to register the crawler.)
  • Create the top level folder OSGI-INF.
  • Create a Component Description file in OSGI-INF. You can name the file as you like, but it is good practice to name it like the crawler. Therein you have to provide a unique component name, it should be the same as the crawler's class name followed by DS (for Declarative Service). Then you have to provide your implementation class and the service interface class, which is always org.eclipse.smila.connectivity.framework.Crawler. Here is an example for the FileSystemCrawler:
<?xml version="1.0" encoding="UTF-8"?>
<component name="FileSystemCrawlerDS" immediate="true">
    <implementation class="org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler" />
    <service>
         <provide interface="org.eclipse.smila.connectivity.framework.Crawler"/>
    </service>
    <property name="org.eclipse.smila.connectivity.framework.crawler.type" value="filesystemcrawler"/>
</component>
  • Add a Service-Component entry to your manifest file, e.g.:
Service-Component: OSGI-INF/filesystemcrawler.xml
  • Open build.properties and change the binary build: Add the folders OSGI-INF and schemas as well as the file plugin.xml.
source.. = code/src/,\
           code/gen/
output.. = code/bin/
bin.includes = META-INF/,\
               .,\
               plugin.xml,\
               schemas/,\
               OSGI-INF/

SCA requirements

Note: SCA is currently not supported by SMILA! This section can be skipped.

Most requirements for SCA are already handled by the base class org.eclipse.smila.connectivity.framework.AbstractCrawler. You should annotate your implementation with @AllowsPassByReference, to allow SCA to pass service parameters by reference when service interactions are within the same adress space. Here is an example for the FileSystemCrawler:

@AllowsPassByReference
public class FileSystemCrawler extends AbstractCrawler {
// implementation goes here ...
}

For SCA, make sure to set the attribute immediate of the element <component> to "false" and the attribute servicefactory of the element <service> to "true" in your XML file. This is required to let SCA dynamically create ServiceReferences. Finally, you have to provide a value for the property org.eclipse.smila.connectivity.framework.crawler.type. This is used by SCA to find the correct OSGi Declarative Service during runtime. The value has to be unique (it is used in SCA contribution files) and should be named like the crawler. Here is an example for the FileSystemCrawler:

<?xml version="1.0" encoding="UTF-8"?>
<component name="FileSystemCrawlerDS" immediate="false">
    <implementation class="org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler" />
    <service servicefactory="true">
         <provide interface="org.eclipse.smila.connectivity.framework.Crawler"/>
    </service>
    <property name="org.eclipse.smila.connectivity.framework.crawler.type" value="filesystemcrawler"/>
</component>

Develop your crawler

  • Implement your crawler in a new class extending org.eclipse.smila.connectivity.framework.AbstractCrawler.
  • Create a JUnit test bundle for this crawler e.g. org.eclipse.smila.connectivity.framework.crawler.filesystem.test.
  • Integrate your new crawler bundle into the build process: Refer to the page How to integrate new bundle into build process for further instructions.
  • Integrate your test bundle into the build process: Refer to the page How to integrate test bundle into build process) for further instructions.


Run your crawler

Running SMILA in eclipse

  • Open the Run dialog, switch to the configuration page of your bundle, set the parameter Default Start level to 4, and the parameter Default Auto-Start to true.
  • Launch SMILA.launch.

Running SMILA application

  • Insert org.eclipse.smila.connectivity.framework.crawler.filesystem@4:start, \ to the config.ini file.
  • Launch SMILA by calling either eclipse.exe -console or launch.cmd.