Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

m (Configuration overview)
m (Start the File System crawler)
Line 198: Line 198:
 
</pre>  
 
</pre>  
  
If the File System crawler cannot find the folder to index, the log file would look similar to the following:  
+
''Error handling'':
 +
* If you specified a non-existing data source in the JConsole, you will see an appropriate error dialog message.
 +
* If you specified a non-existing or non-started job name, JConsole will show you a success dialog because this error can only be discovered when the first imported data is pushed to the job. So you will see error messages in the log file similar to:
 +
<pre>
 +
...
 +
WARN  bulkbuilder.ConnectivityManagerImpl  - Error while adding record with id 'file:<Path=c:\data\epl-v10.html>' to bulkbuilder
 +
      org.eclipse.smila.bulkbuilder.InvalidJobException: Error getting initial task for job 'noJob'.
 +
        ...
 +
      Caused by: org.eclipse.smila.jobmanager.IllegalJobStateException: Job with name 'noJob' is not running or is already finishing.
 +
ERROR  bulkbuilder.ConnectivityManagerImpl  - Error while adding 2 records to bulkbuilder
 +
      org.eclipse.smila.bulkbuilder.InvalidJobException: No job run info for job 'noJob', job not defined or not active.
 +
...
 +
</pre>
  
 +
* If the File System crawler cannot find the folder to index, the log file also shows an error:
 
<pre>
 
<pre>
 
...
 
...
Line 226: Line 239:
  
 
For more information please check the [[SMILA/Documentation/Management#JMX_Client|JMX Client documentation]].
 
For more information please check the [[SMILA/Documentation/Management#JMX_Client|JMX Client documentation]].
 
  
 
== Search the index ==
 
== Search the index ==

Revision as of 09:06, 2 September 2011


This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.

Download and unpack SMILA

Download the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /about_files
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini

Check the preconditions

To be able to follow the steps below, check the following preconditions:

  • You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5.
    Either:
    • add the path of your local JRE executable to the PATH environment variable
      or
    • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
      Make sure that -vm is indeed the first argument in the file and that there is a line break after it. It should look similar to the following:

-vm
d:/java/jre6/bin/java
...

  • Since we are going to use JConsole as the JMX client later in this tutorial, it is recommended to install and use a Java SE Development Kit (JDK) and not just a Java SE Runtime Environment (JRE) because the latter does not include this application.
  • You need a REST HTTP client to access the SMILA REST API (e.g. "RESTClient" or "Poster" add-on for Firefox browser)
  • When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA
chmod +x ./jmxclient/run.sh

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and call the SMILA executable. Wait until the engine has been fully started. If everything works fine, you should see output similar to that on the following screenshot:

Smila-console-0.9.0.png

Check the log file

Open the SMILA log file in an editor of your choice to find out what is happening in the background. This file is named SMILA.log and can be found in the same directory as the SMILA executable:

/<SMILA>
  /about_files
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini
  -> SMILA.log <-

You should see no stacktraces in the log ;) and it should end with an entry like that:

INFO  ... internal.HttpServiceImpl    - HTTP server started successfully on port 8080.

Prepare indexing job

We define an indexing job based on the predefined asynchronous "indexUpdate" workflow (see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json). This indexing job will process the imported data.

For more information about job management please check the JobManager documentation.

The "indexUpdate" workflow contains a PipelineProcessingWorker worker which executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow:

...
    {
      "name": "indexUpdate",
      ...
      "actions":
      [
          {
              "worker": "pipelineProcessingWorker",
              "parameters": 
              {
                  "pipelineName": "AddPipeline"
              },
      ...
    }
...

Use your favourite REST Client to create a job definition based on the "indexUpdate" workflow:

POST http://localhost:8080/smila/jobmanager/jobs/
  {
    "name":"indexUpdateJob",
    "parameters":{      
      "tempStore": "temp"
     },
    "workflow":"indexUpdate"
  }

Afterwards, start a job run for the defined job:

POST http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob

Your REST client will show a result like that:

{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob/20110901-121343613053/"
}

You will need the "jobId" later on to finish the job run. The job runs can also be found via the monitoring API for the job:

http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob

In the SMILA.log file you will see a message like that:

INFO ... internal.JobManagerImpl    - started job run '20110901-121343613053' for job 'indexUpdateJob'

Configure the File System crawler

Prepare some local folder on your system whose contents we are going to index in the following. Add some text and HTML files to it. The result could look similar to the following:

/home
  /johndoe
    /mydata
      myfile.txt
      someothertxtfile.txt
      myfile.html
      someotherhtmlfile.html

Note: Currently, only plain text and HTML files can be crawled and indexed properly.

Open the configuration file at configuration/org.eclipse.smila.connectivity.framework/file.xml and adapt the BaseDir attribute to point to this folder. Make sure to set an absolute path:

 <Process>
  <BaseDir>/home/johndoe/mydata</BaseDir>
  ...      
 </Process>

Manage CrawlerController via JConsole

Next step is to start a file system crawler job and let SMILA index the configured folder. Crawler runs can be managed via the JMX protocol, therefore you can connect to SMILA using any JMX client you like. We are going to use JConsole in the following because it is included in the Java SE Development Kit.

Start the JConsole executable in your JDK distribution (<JAVA_HOME>/bin/jconsole). If the client is up and running, connect to localhost:9004.

Jconsole.png-0.8.0.png

Next, switch to the MBeans tab, expand the SMILA node in the MBeans tree on the left-hand side, and click the CrawlerController node. This node is used to manage and monitor all crawling activities.

Mbeans-overview-0.8.0.png


Start the File System crawler

To start the File System crawler, select SMILA > CrawlerControl > Operations on the left-hand side, enter "file" into the first text field next to the startCrawlerTask button, and "indexUpdateJob" in the second text field. Then click the button:

Start-file-crawl-0.9.0.png

You should receive a message similar to the following, indicating that the crawler has been successfully started:

Start-crawl-file-result-0.9.0.png

The following entries will appear in the SMILA.log file:

...
INFO  [Thread-21  ]  filesystem.FileSystemCrawler        - Initializing FileSystemCrawler...
...
INFO  [Thread-21  ]  filesystem.FileSystemCrawler        - Closing FileSystemCrawler...
...

Error handling:

  • If you specified a non-existing data source in the JConsole, you will see an appropriate error dialog message.
  • If you specified a non-existing or non-started job name, JConsole will show you a success dialog because this error can only be discovered when the first imported data is pushed to the job. So you will see error messages in the log file similar to:
 ...
WARN   bulkbuilder.ConnectivityManagerImpl  - Error while adding record with id 'file:<Path=c:\data\epl-v10.html>' to bulkbuilder 
       org.eclipse.smila.bulkbuilder.InvalidJobException: Error getting initial task for job 'noJob'.
         ...
       Caused by: org.eclipse.smila.jobmanager.IllegalJobStateException: Job with name 'noJob' is not running or is already finishing.
ERROR  bulkbuilder.ConnectivityManagerImpl  - Error while adding 2 records to bulkbuilder
       org.eclipse.smila.bulkbuilder.InvalidJobException: No job run info for job 'noJob', job not defined or not active.
...
  • If the File System crawler cannot find the folder to index, the log file also shows an error:
...
INFO  [Thread-24  ]  filesystem.FileSystemCrawler        - Initializing FileSystemCrawler...
WARN  [Thread-24  ]  performancecounters.CrawlerControllerPerformanceCounterHelper - Agent location [Crawlers/FileSystem/file - 1491048155] is not found
WARN  [Thread-24  ]  performancecounters.CrawlerControllerPerformanceCounterHelper - Instance agent agent is null
ERROR [Thread-24  ]  impl.CrawlThread                    - 
org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Folder "/home/doe01/doesnotexist" is not found
  at org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler.checkFolders(FileSystemCrawler.java:347)
  at org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler.initialize(FileSystemCrawler.java:176)
  at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:214)
INFO  [Thread-24  ]  filesystem.FileSystemCrawler        - Closing FileSystemCrawler...
...

The error message above states that the crawler tried to index a folder at /home/doe01/doesnotexist but was not able to find it. To solve this, provide data at the mentioned folder or adapt the configuration of the File System crawler accordingly.

Manage CrawlerController using the JMX Client

Instead of managing the crawler runs using JConsole it is also possible to use the JMX Client from the SMILA distribution for the same purpose. The JMX Client is a console application that allows managing crawler runs and creating scripts intended for batch crawler execution. It can be found in the jmxclient directory of the SMILA distribution. Use the appropriate run script for your platform (i.e. run.bat or run.sh) to start the application. For example, to start the File System crawler - as described with JConsole above - use the following command:

 run crawl file indexUpdateJob

For more information please check the JMX Client documentation.

Search the index

To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index Name, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

Smila-search-form.png

Now, let's try the Default stylesheet and enter our first simple search using a word that you expect to be contained in your dummy files. In this tutorial, we assume that there is a match for the term "data" in the indexed documents. First, select the index on which you want to search from the Indexlist column on the left-hand side. Currently, there should be only one in the list, namely an index called "test_index". Note that the selected index name will appear in the Index Name text field of the search form. Then enter the desired term into the Query text field. And finally, click OK to send your query to SMILA. Your result could be similar to the following:

Searching-for-text-in-file.png

Now, let's use the Advanced stylesheet and search for the name of one the files contained in the indexed folder to check whether it was properly indexed. In our example, we are going to search for a file named smila-glossary.html. Click Advanced to switch to the detailed search form, enter the desired file name into the Filename text field, then click OK to submit your search. Your result could be similar to the following:

Searching-by-filename.png


Configure and run the Web crawler

Now that we alreday know how to start and configure the File System crawler and how to search indices, configuring and running the Web crawler is rather straightforward.

There's no need to define a new job or start a new job run here, because we want to use the same asynchronous workflow for indexing as before and so we can use the still running job run here.

Let's have a look at the configuration file of the Web crawler which you can find at configuration/org.eclipse.smila.connectivity.framework/web.xml:

<DataSourceConnectionConfig  ...>
  <DataSourceID>web</DataSourceID>
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.web</SchemaID>
  <DataConnectionID>
    <Crawler>WebCrawler</Crawler>
  </DataConnectionID>
  <RecordBuffer Size="20" FlushInterval="3000" />
  <DeltaIndexing>full</DeltaIndexing>
  <Attributes>
    ....
  </Attributes>
  <Process>
    <WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="http://myReferer">
      <UserAgent Name="Crawler" Version="1.0" Description="teddy crawler" Url="http://www.teddy.com" Email="crawler@teddy.com"/>
      <CrawlingModel Type="MaxDepth" Value="1000"/>
      <CrawlScope Type="Path" />
      <CrawlLimits>
        ...
      </CrawlLimits>
      <Seeds FollowLinks="NoFollow">
        <Seed>http://wiki.eclipse.org/SMILA</Seed>
      </Seeds>
      <Filters>
        <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
        <Filter Type="RegExp" Value="^((?!/SMILA).)*$" WorkType="Unselect"/>
      </Filters>
      <MetaTagFilters>
        <MetaTagFilter Type="Name" Name="robots" Content="noindex,nofollow" WorkType="Unselect"/>
      </MetaTagFilters>      
    </WebSite>
  </Process>

By default, the Web crawler is configured to index the URL http://wiki.eclipse.org/SMILA. To change this, set the content of the <Seed> element to the desired web address and adapt the <Filters> section accordingly. If you require further help on this configuration file refer to the Web crawler documentation. For example, in the following we changed the web address to the main page of Wikipedia and removed one of the <Filter> elements:

 ...
 <Seeds FollowLinks="NoFollow">
   <Seed>http://en.wikipedia.org/wiki/Main_Page</Seed>
 </Seeds>
 <Filters>
   <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
 </Filters>
 ...

To start the crawling process, save the configuration file, go back or reconnect to JConsole, navigate to SMILA > CrawlerController > Operations, type "web" and "indexUpdateJob" into the text fields next to the startCrawlerTask button, then click the button.

Start-web-crawl-0.9.0.png

A message like that will pop up after successfull start:

Start-crawl-web-result-0.9.0.png

Although the default limit for spidered web sites is set to 1,000 in the Web crawler configuration file, it may take a while for the web crawling run to be finished. Click the getCrawlerTasksState button to monitor the crawl processing if you want to find out when it has finished. This will produce an output similar to the following:

SMILA-One-active-crawl-found-0.8.0.png

If you do not want to wait, you may as well stop the crawling run manually. In order to do this, type "web" into the text field next to the stopCrawlerTask button, then click this button.

As soon as the Web crawler's run has finished, go back to the search form to search the generated index:

Smila-search-form-web.png


Stop the indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (please replace <job-id> by the job-id you got before when started the job run)

POST http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob/<job-id>/finish  

You can monitor the job run via browser to see that it finished successful:

http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob/<job-id>

In the SMILA.log file you will see messages like that:

 INFO ... internal.JobManagerImpl   - finish called for job 'indexUpdateJob', run '20110901-141457584011'
 ...
 INFO ... internal.JobManagerImpl   - Completing job run '20110901-141457584011' for job 'indexUpdateJob' with final state SUCCEEDED

Just another 5 minutes to change the workflow

In previous sections all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the BPEL pipeline "AddPipeline". All data was indexed into the same index named "test_index". It is possible, however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.

In the following sections we are going to use the generic asynchronous "importToPipeline" workflow which let you specify the BPEL pipeline to process the data. We create an additional BPEL pipeline for webcrawler records so that webcrawler data will be indexed into a separate index named "web_index".

Configure LuceneIndexService

It's very important to shutdown and restart the SMILA engine after the following configuration changes are done because modified configurations are loaded during startup only.

For more information about the LuceneIndexService, please see LuceneIndexService.

Let's configure our "web_index" index structure and search template. Add the following code to the end of configuration/org.eclipse.smila.search.datadictionary/DataDictionary.xml file before the closing </AnyFinderDataDictionary> tag:

  <Index Name="web_index">
    <Connection xmlns="http://www.anyfinder.de/DataDictionary/Connection" MaxConnections="5"/>
    <IndexStructure xmlns="http://www.anyfinder.de/IndexStructure" Name="web_index">
      <Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <IndexField FieldNo="8" IndexValue="true" Name="MimeType" StoreText="true" Tokenize="true" Type="Text"/>
      <IndexField FieldNo="7" IndexValue="true" Name="Size" StoreText="true" Tokenize="true" Type="Text"/>
      <IndexField FieldNo="6" IndexValue="true" Name="Extension" StoreText="true" Tokenize="true" Type="Text"/>
      <IndexField FieldNo="5" IndexValue="true" Name="Title" StoreText="true" Tokenize="true" Type="Text"/>
      <IndexField FieldNo="4" IndexValue="true" Name="Url" StoreText="true" Tokenize="false" Type="Text">
        <Analyzer ClassName="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
      </IndexField>
      <IndexField FieldNo="3" IndexValue="true" Name="LastModifiedDate" StoreText="true" Tokenize="false" Type="Text"/>
      <IndexField FieldNo="2" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
      <IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
      <IndexField FieldNo="0" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
    </IndexStructure>
    <Configuration xmlns="http://www.anyfinder.de/DataDictionary/Configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.anyfinder.de/DataDictionary/Configuration ../xml/DataDictionaryConfiguration.xsd">
      <DefaultConfig>
        <Field FieldNo="8">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="7">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="6">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>        
        <Field FieldNo="5">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="4">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="3">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="2">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="1">
          <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
          </FieldConfig>
        </Field>
        <Field FieldNo="0">
          <FieldConfig Constraint="required" Weight="1" xsi:type="FTText">
            <NodeTransformer xmlns="http://www.anyfinder.de/Search/ParameterObjects" Name="urn:ExtendedNodeTransformer">
              <ParameterSet xmlns="http://www.brox.de/ParameterSet"/>
            </NodeTransformer>
            <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="AND" Tolerance="exact"/>
          </FieldConfig>
        </Field>
      </DefaultConfig>
   </Configuration>
  </Index>

Now we need to add mapping of attribute and attachment names to Lucene "FieldNo" defined in DataDictionary.xml. Open configuration/org.eclipse.smila.lucene/Mappings.xml file and add the following code to the end of file before closing </Mappings> tag:

  <Mapping indexName="web_index">
    <Attributes>
      <Attribute name="Filename" fieldNo="1" />
      <Attribute name="Path" fieldNo="2" />    
      <Attribute name="LastModifiedDate" fieldNo="3" />
      <Attribute name="Url" fieldNo="4" />
      <Attribute name="Title" fieldNo="5" />    
      <Attribute name="Extension" fieldNo="6" />
      <Attribute name="Size" fieldNo="7" />
      <Attribute name="MimeType" fieldNo="8" />           
    </Attributes>
    <Attachments>
      <Attachment name="Content" fieldNo="0" />      
    </Attachments>
  </Mapping>

Create a new BPEL pipeline

We need to add the AddWebPipeline pipeline to the BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the BPEL WorkflowProcessor documentation. Predefined BPEL WorkflowProcessor configuration files are contained in the configuration/org.eclipse.smila.processing.bpel/pipelines directory. However, we can add new BPEL pipelines with the SMILA REST API.

Start SMILA if it's not yet running, and use your favourite REST client to add the "AddWebPipeline" BPEL pipeline: (the BPEL xml is a little bit unreadable cause we have to escape it for being valid JSON content)

POST http://localhost:8080/smila/pipeline
  {
    "name":"AddWebPipeline",
    "definition":"<?xml version=\"1.0\" encoding=\"utf-8\" ?>\r\n<process name=\"AddWebPipeline\" targetNamespace=\"http://www.eclipse.org/smila/processor\"\r\n    xmlns=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\" \r\n    xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"\r\n    xmlns:proc=\"http://www.eclipse.org/smila/processor\" \r\n    xmlns:rec=\"http://www.eclipse.org/smila/record\">\r\n  <import location=\"processor.wsdl\" namespace=\"http://www.eclipse.org/smila/processor\"\r\n      importType=\"http://schemas.xmlsoap.org/wsdl/\" />\r\n  <partnerLinks>\r\n    <partnerLink name=\"Pipeline\" partnerLinkType=\"proc:ProcessorPartnerLinkType\" myRole=\"service\" />\r\n  </partnerLinks>\r\n  <extensions>\r\n    <extension namespace=\"http://www.eclipse.org/smila/processor\" mustUnderstand=\"no\" />\r\n  </extensions>\r\n  <variables>\r\n    <variable name=\"request\" messageType=\"proc:ProcessorMessage\" />\r\n  </variables>\r\n  <sequence>\r\n    <receive name=\"start\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\" operation=\"process\"\r\n        variable=\"request\" createInstance=\"yes\" />     \r\n    <if name=\"conditionIsText\">\r\n      <condition>starts-with($request.records/rec:Record[1]/rec:Val[@key=\"MimeType\"],\"text/\")</condition>\r\n      <sequence name=\"processTextBasedContent\">           \r\n        <if name=\"conditionIsHtml\">\r\n          <condition>starts-with($request.records/rec:Record[1]/rec:Val[@key=\"MimeType\"],\"text/html\")\r\n            or starts-with($request.records/rec:Record[1]/rec:Val[@key=\"MimeType\"],\"text/xml\")\r\n          </condition>\r\n        </if>        \r\n        <extensionActivity>\r\n          <proc:invokePipelet name=\"invokeHtml2Txt\">\r\n            <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.HtmlToTextPipelet\" />\r\n            <proc:variables input=\"request\" output=\"request\" />\r\n            <proc:configuration>\r\n              <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n              <rec:Val key=\"outputType\">ATTACHMENT</rec:Val>\r\n              <rec:Val key=\"inputName\">Content</rec:Val>\r\n              <rec:Val key=\"outputName\">Content</rec:Val>\r\n              <rec:Val key=\"meta:title\">Title</rec:Val>      \r\n            </proc:configuration>                      \r\n          </proc:invokePipelet>\r\n        </extensionActivity> \r\n        <extensionActivity>\r\n          <proc:invokePipelet name=\"invokeLucenePipelet\">\r\n            <proc:pipelet class=\"org.eclipse.smila.lucene.pipelets.LuceneIndexPipelet\" />\r\n            <proc:variables input=\"request\" output=\"request\" />\r\n            <proc:configuration>\r\n              <rec:Val key=\"indexname\">web_index</rec:Val>\r\n                <rec:Val key=\"executionMode\">ADD</rec:Val>\r\n              </proc:configuration>\r\n          </proc:invokePipelet>\r\n        </extensionActivity>\r\n      </sequence>        \r\n    </if>    \r\n    <reply name=\"end\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\" operation=\"process\" variable=\"request\" />\r\n    <exit />\r\n  </sequence>\r\n</process>\r\n"
  }

Note that we used "web_index" index name for the LuceneService in the BPEL above:

...
<proc:configuration>
  <rec:Val key="indexname">web_index</rec:Val>
  ...
</proc:configuration>
...

You can monitor the defined BPEL pipelines via browser, so you should find your new pipeline there:

http://localhost:8080/smila/pipeline

Create and start a new indexing job

We define an indexing job based on the predefined asynchronous workflow "importToPipeline" (see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json). This indexing job will process the imported data by using our new BPEL pipeline "AddWebPipeline".

The "importToPipeline" workflow contains a PipelineProcessingWorker worker which is not configured for dedicated BPEL pipelines, so the BPEL pipelines handling adds and deletes have to be set via job parameter.

Use your favourite REST Client to create an appropriate job definition:

POST http://localhost:8080/smila/jobmanager/jobs/
  {
    "name":"indexWebJob",
    "parameters":{      
      "tempStore": "temp",
      "addPipeline": "AddWebPipeline",
      "deletePipeline": "DeletePipeline" 
     },
    "workflow":"importToPipeline"
  }

Note that the "DeletePipeline" is not needed for our test szenario here, but we must fulfill all undefined workflow parameters.

Afterwards, start a job run for the defined job:

POST http://localhost:8080/smila/jobmanager/jobs/indexWebJob

Put it all together

Ok, now it seems that we have finally finished configuring SMILA for using separate BPEL pipelines for file system and web crawling and index data from these crawlers into different indices. Here is what we have done so far:

  1. We added the web_index index to the Lucence configuration.
  2. We created a new BPEL pipeline for Web crawler data referencing the new Lucene index.
  3. We used a separate job for web indexing that references the new BPEL pipeline.

Now, run the Web crawler again, remember to use "indexWebJob" as job name parameter!

Go back to your browser at http://localhost:8080/SMILA/search, select the new index "web_index" and run a search:

Web index-search.png

Configuration overview

SMILA configuration files are located in the configuration directory of the SMILA application. The following lists the configuration files and documentation links relevant to this tutorial, regarding SMILA components:

Crawler

  • configuration folder: org.eclipse.smila.connectivity.framework
    • file.xml (FileSystem Crawler)
    • web.xml (Web Crawler)
  • Documentation

Jobmanager

BPEL Pipelines

Lucene Index Service

  • DataDictionary
    • configuration folder: org.eclipse.smila.search.datadictionary
      • DataDictionary.xml
  • Lucene Mappings
    • configuration folder: org.eclipse.smila.lucene
      • Mappings.xml
  • Documentation

Back to the top