Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/2011.Simplification/Documentation for 5 Minutes to Success"

m (Configure and run the web crawler.)
(For SMILA 1.0: Simplification pages are obsolete, redirect to SMILA/Documentation_for_5_Minutes_to_Success)
 
(44 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:SMILA]]
+
#REDIRECT [[SMILA/Documentation_for_5_Minutes_to_Success]]
[[Category:HowTo]]
+
 
+
This page contains installation instructions for the SMILA application and helps you with your first steps in SMILA.
+
 
+
== Download and unpack the SMILA application. ==
+
 
+
After [http://www.eclipse.org/smila/downloads.php downloading] and unpacking you should have the following folder structure.
+
 
+
[[Image:Installation.png]]
+
 
+
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
|
+
If you use Linux version of SMILA please make sure that the file SMILA has executable permissions. If not, you can set it by running ''chmod +x ./SMILA'' command in console.
+
|}
+
 
+
== Start the SMILA engine. ==
+
 
+
To start SMILA engine double-click on SMILA.exe or open an command line, navigate to the directory that contains extracted files, and run SMILA executable. Wait until the engine is fully started. If everything is OK, you should see output similar to the one on the following screenshot:
+
 
+
[[Image:Smila-console-0.8.0.png]]
+
 
+
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
|
+
Note: To run SMILA you need to have jre executable added to your PATH environment variable. The jvm version should be at least java 5.
+
 
+
Optionally you can configure jvm into SMILA.ini file instead of environment variable. Put the argument ''-vm <path/to/jre/executable>'' at the beginning of SMILA.ini, for example:
+
 
+
-vm
+
d:/java/jre1.5.0.16/bin/java
+
...
+
|}
+
 
+
== Check the log file. ==
+
You can check what's happening in the background by opening the SMILA log file in an editor. This file is named <tt>SMILA.log</tt> and can be found in the same directory as the SMILA executable.
+
 
+
[[Image:Smila-log.png]]
+
 
+
== Configure crawling jobs. ==
+
Now when the SMILA engine is up and running we can start the crawling jobs. Crawling jobs are managed over the JMX protocol, that means that we can connect to SMILA with a JMX client of your choice. We will use JConsole for that purpose since this JMX client is already available as a default with the Sun Java distribution.
+
 
+
Start the JConsole executable in your JDK distribution (<tt><JAVA_HOME>/bin/jconsole</tt>). If the client is up and running, select the PID in the ''Connect'' window and click ''Connect''.
+
 
+
[[Image:Jconsole.png]]
+
 
+
Next, switch to the ''MBeans'' tab, expand the SMILA node in the ''MBeans'' tree on the left side of the window, and click the <tt>CrawlerController</tt> node. This node is used to manage and monitor all crawling activities.
+
 
+
[[Image:Mbeans-overview-0.8.0.png]]
+
 
+
== Start the file system crawler. ==
+
To start filesystem crawler, open the ''Operations'' tab on the right pane, type "file" into text field next to the <tt>startCrawl</tt> button and click on <tt>startCrawl</tt> button.
+
 
+
[[Image:Start-file-crawl-0.8.0.png]]
+
 
+
You should receive a message similar to the following, indicating that the crawler has been successfully started:
+
 
+
[[Image:Start-crawl-file-result-0.8.0.png]]
+
 
+
Now we can check the log file to see what happened:
+
 
+
[[Image:File-crawl-log.png]]
+
 
+
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
|
+
The configuration of filesystem crawler crawls the folder c:\data by default.
+
 
+
<Process>
+
    <BaseDir>c:\data</BaseDir>
+
    ...     
+
</Process>
+
|}
+
 
+
== Configuring the filesystem crawler. ==
+
 
+
A possible error message could be that the folder c:\data is not found:
+
 
+
<code>
+
'' 2009-01-27 18:25:59,592 [Thread-12] ERROR impl.CrawlThread - org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found ''
+
</code>
+
 
+
The error message above states that the crawler tried to index folder at <tt>c:\data</tt> but was not able to find it. To solve this, let's create a folder with sample data, say <tt>c:\data</tt>, put some dummy text files into it, and configure the file system crawler to index it.
+
To configure the crawler to index other directories, open the configuration file of the crawler at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>. Modify the ''BaseDir'' attribute by setting its value to an absolute path that points to your new directory. Don't forget to save the file.
+
 
+
{|width="100%" style="background-color:#ffcccc; padding-left:30px;"
+
|
+
Note: Currently only plain text and html files are crawled and indexed correctly by SMILA crawlers.
+
|}
+
 
+
== Searching on the indices. ==
+
To search on the indices those were created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets(SMILASearchDefault and SMILASearchAdvanced)for the search form. If you choose SMILASearchDefault you will be only able to search on the content (field Query) of the files. In the left column below the ''Indexlist'' header you may find the name of all available indices. Currently, there should be only one index in the list (test_index).
+
 
+
[[Image:Smila-search-form.png]]
+
 
+
Now let's try to search for a word that occurs in your dummy files. In this tutorial we assume that there was a word "data" in both text files.
+
 
+
[[Image:Searching-for-text-in-file.png]]
+
 
+
Now pick a file name of the files in your data folde. E.g. there was a file named <tt>glossary.html</tt> in the sample folder. Let's check whether it was indexed. Switch to SMILASearchAdvanced and type "glossary.html" (or the name of a file of your data folder) in the ''Filename'' field and click on Submit button again:
+
 
+
[[Image:Searching-by-filename.png]]
+
 
+
== Configure and run the web crawler. ==
+
Now that we know how to start and configure the file system crawler and how to search on indices configuring and running the web crawler is straightforward:
+
The configuration file of the web crawler is located at <tt>configuration/org.eclipse.smila.connectivity.framework directory</tt> and is named <tt>web.xml</tt>:
+
 
+
[[Image:Webcrawler-config.png]]
+
 
+
By default the web crawler is configured to index the URL ''http://wiki.eclipse.org/SMILA''. To change this, open the file in an editor of your choice and set the content of the <tt>&lt;Seed&gt;</tt> element to the desired web site. Detailed information on the configuration of the web crawler is also available at the [[SMILA/Documentation/Web_Crawler|Web crawler]] configuration page.
+
 
+
To start the crawling process, save the configuration file, open the ''Operations'' tab in JConsole again, type "web" into the text field next to the <tt>startCrawl</tt> button, and click the button.
+
 
+
[[Image:Starting-web-crawler-0.8.0.png]]
+
 
+
 
+
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
|
+
Note: The ''Operations'' tab in JConsole also provides buttons to stop a crawler, get the list of active crawlers and the current status of a particular crawling job.
+
Default limit for downloaded documents is set to 1000 into webcrawler configuration, so it can take a while for web crawling job to finish. You can stop crawling job manually by typing "web" next to <tt>stopCrawl</tt> button and then clicking on this button.
+
As an example the following screenshot shows the result after the <tt>getActiveCrawlsStatus</tt> button has been clicked while the web crawler is running:
+
|}
+
[[Image:One-active-crawl-found.png]]
+
 
+
When the web crawler's job is finished, you can search on the generated index just like described above with the file system crawler (see step 7).
+
 
+
 
+
[[Category:SMILA]]
+
 
+
== Managing CrawlerController with jmxclient. ==
+
 
+
In addition to managing crawling jobs with JConsole it's also possible to use jmxclient from SMILA distribution. Jmxclient is a console application that allows to manage crawl jobs and create scripts for batch crawlers execution. For more information please check [[SMILA/Documentation/Management#JMX_Client|jmxclient documentation ]].  Jmxclient application is located into <tt>jmxclient</tt> directory. You should use appropriate run script (run.bat or run.sh) to start the application.
+
For example, to start filesystem crawler use following command:
+
 
+
<code>
+
''run crawl file
+
</code>''
+
 
+
== 5 Minutes for Changing Workflow  ==
+
In previous sections all data collected by crawlers was processed with the same workflow and indexed into the same index, test_index.
+
It's possible to configure SMILA so that data from different data sources will go through different workflows and will be indexed into different  indices. This will require more advanced configuration than before but still is quite simple.
+
Let's create additional workflow for webcrawler records so that webcrawler data will be indexed into separate index, say web_index.
+
 
+
=== Modify Listener rules. ===
+
 
+
First, lets modify the default add rule in Listener and add another rule that will make webcrawler records to be processed by separate BPEL workflow.
+
For more information about Listener, please see the section [[SMILA/Documentation/QueueWorker/Listener|Listener]] of the [[SMILA/Documentation/QueueWorker|QueueWorker]] documentation.
+
Listener configuration is placed at the
+
<tt>configuration/org.eclipse.smila.connectivity.queue.worker.jms/QueueWorkerListenerConfig.xml</tt>
+
Open that file and edit the <tt><Condition></tt> tag of the Default ADD Rule. The result should be as follows:
+
<source lang="xml">
+
<Rule Name="ADD Rule" WaitMessageTimeout="10" Threads="2">
+
  <Source BrokerId="broker1" Queue="SMILA.connectivity"/>
+
  <Condition>Operation='ADD' and NOT(DataSourceID LIKE 'web%')</Condition>
+
  <Task>
+
    <Process Workflow="AddPipeline"/>
+
  </Task>
+
</Rule>
+
</source>
+
Now add the following new rule to this file:
+
<source lang="xml">
+
<Rule Name="Web ADD Rule" WaitMessageTimeout="10" Threads="2">
+
  <Source BrokerId="broker1" Queue="SMILA.connectivity"/>
+
  <Condition>Operation='ADD' and DataSourceID LIKE 'web%'</Condition>
+
  <Task>
+
    <Process Workflow="AddWebPipeline"/>
+
  </Task>
+
</Rule>
+
</source>
+
Notice that we modified condition in the ADD Rule to skip webcrawler data. Webcrawler data will be processed by new Web ADD Rule.
+
Web ADD Rule defines that webcrawler data will be processed by AddWebPipeline workflow, so next we need to create AddWebPipeline workflow.
+
 
+
=== Create workflow for the BPEL WorkflowProcessor ===
+
We need to add the AddWebPipeline workflow to BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the [[SMILA/Documentation/BPEL_Workflow_Processor|BPEL WorkflowProcessor]] documentation.
+
BPEL WorkflowProcessor configuration files are placed at the <tt>configuration/org.eclipse.smila.processing.bpel/pipelines</tt> directory.
+
There is a file <tt>addpipeline.bpel</tt> that defines AddPipeline process. Let's create the <tt>addwebpipeline.bpel</tt> file that will define AddWebPipeline process and put the following code into it:
+
<source lang="xml">
+
<?xml version="1.0" encoding="utf-8" ?>
+
<process name="AddWebPipeline" targetNamespace="http://www.eclipse.org/smila/processor"
+
    xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
+
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
+
    xmlns:proc="http://www.eclipse.org/smila/processor"
+
    xmlns:rec="http://www.eclipse.org/smila/record">
+
 
+
  <import location="processor.wsdl" namespace="http://www.eclipse.org/smila/processor"
+
      importType="http://schemas.xmlsoap.org/wsdl/" />
+
 
+
  <partnerLinks>
+
    <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" myRole="service" />
+
  </partnerLinks>
+
 
+
  <extensions>
+
    <extension namespace="http://www.eclipse.org/smila/processor" mustUnderstand="no" />
+
  </extensions>
+
 
+
  <variables>
+
    <variable name="request" messageType="proc:ProcessorMessage" />
+
  </variables>
+
 
+
  <sequence>
+
    <receive name="start" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process"
+
        variable="request" createInstance="yes" />
+
 
+
    <!-- only process text based content, skip everything else -->
+
    <if name="conditionIsText">
+
      <condition>
+
        contains($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V, "text/")
+
      </condition>
+
      <sequence name="processTextBasedContent">  
+
+
        <!-- extract txt from html files -->
+
        <if name="conditionIsHtml">
+
          <condition>
+
          ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V[contains(., "text/html")])
+
          or
+
          ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V[contains(., "text/xml")])
+
          </condition>
+
        </if>
+
+
        <extensionActivity name="invokeHtml2Txt">
+
          <proc:invokePipelet>
+
            <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
+
            <proc:variables input="request" output="request" />
+
            <proc:PipeletConfiguration>
+
              <proc:Property name="inputType">
+
                <proc:Value>ATTACHMENT</proc:Value>
+
              </proc:Property>      
+
              <proc:Property name="outputType">
+
                <proc:Value>ATTACHMENT</proc:Value>
+
              </proc:Property>
+
              <proc:Property name="inputName">
+
                <proc:Value>Content</proc:Value>
+
              </proc:Property>
+
              <proc:Property name="outputName">
+
                <proc:Value>Content</proc:Value>
+
              </proc:Property>
+
              <proc:Property name="meta:title">
+
                <proc:Value>Title</proc:Value>
+
              </proc:Property>
+
            </proc:PipeletConfiguration>     
+
          </proc:invokePipelet>
+
        </extensionActivity>
+
 
+
        <extensionActivity name="invokeLucenePipelet">
+
          <proc:invokePipelet>
+
            <proc:pipelet class="org.eclipse.smila.lucene.pipelets.LuceneIndexPipelet" />
+
            <proc:variables input="request" output="request" />
+
            <proc:setAnnotations>
+
              <rec:An n="org.eclipse.smila.lucene.LuceneIndexService">
+
                <rec:V n="indexName">web_index</rec:V>
+
                <rec:V n="executionMode">ADD</rec:V>
+
              </rec:An>
+
            </proc:setAnnotations>
+
          </proc:invokeService>
+
        </extensionActivity>
+
+
      </sequence>
+
    </if>
+
 
+
    <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType"
+
operation="process" variable="request" />
+
    <exit />
+
  </sequence>
+
</process>
+
</source>
+
 
+
Note that we use "web_index" index name for the LuceneService in the code above:
+
<source lang="xml">
+
<rec:An n="org.eclipse.smila.lucene.LuceneIndexService">
+
  <rec:V n="indexName">web_index</rec:V>
+
  <rec:V n="executionMode">ADD</rec:V>
+
</rec:An>
+
</source>
+
 
+
We need to add our pipeline description to the <tt>deploy.xml</tt> file placed in the same directory. Add the following code to the end of <tt>deploy.xml</tt> before the closing <tt></deploy></tt> tag:
+
<source lang="xml">
+
<process name="proc:AddWebPipeline">
+
  <in-memory>true</in-memory>
+
  <provide partnerLink="Pipeline">
+
    <service name="proc:AddWebPipeline" port="ProcessorPort" />
+
  </provide>   
+
</process>
+
</source>
+
 
+
Now we need to add our web_index to LuceneIndexService configuration.
+
 
+
=== LuceneIndexService configuration ===
+
For more information about LuceneIndexService, please see [[SMILA/Documentation/LuceneIndexService|LuceneIndexService]]
+
 
+
Let's configure our web_index index structure and search template. Add the following code to the end of <tt>configuration/org.eclipse.smila.search.datadictionary/DataDictionary.xml</tt> file before the closing <tt></AnyFinderDataDictionary></tt> tag:
+
<source lang="xml">
+
<Index Name="web_index">
+
  <Connection xmlns="http://www.anyfinder.de/DataDictionary/Connection" MaxConnections="5"/>
+
  <IndexStructure xmlns="http://www.anyfinder.de/IndexStructure" Name="web_index">
+
    <Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
+
    <IndexField FieldNo="8" IndexValue="true" Name="MimeType" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="7" IndexValue="true" Name="Size" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="6" IndexValue="true" Name="Extension" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="5" IndexValue="true" Name="Title" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="4" IndexValue="true" Name="Url" StoreText="true" Tokenize="false" Type="Text">
+
      <Analyzer ClassName="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
+
    </IndexField>
+
    <IndexField FieldNo="3" IndexValue="true" Name="LastModifiedDate" StoreText="true" Tokenize="false" Type="Text"/>
+
    <IndexField FieldNo="2" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="0" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
+
  </IndexStructure>
+
  <Configuration xmlns="http://www.anyfinder.de/DataDictionary/Configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+
xsi:schemaLocation="http://www.anyfinder.de/DataDictionary/Configuration ../xml/DataDictionaryConfiguration.xsd">
+
    <DefaultConfig>
+
      <Field FieldNo="8">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="7">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="6">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>       
+
      <Field FieldNo="5">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="4">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="3">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="2">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="1">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="0">
+
        <FieldConfig Constraint="required" Weight="1" xsi:type="FTText">
+
          <NodeTransformer xmlns="http://www.anyfinder.de/Search/ParameterObjects" Name="urn:ExtendedNodeTransformer">
+
            <ParameterSet xmlns="http://www.brox.de/ParameterSet"/>
+
          </NodeTransformer>
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="AND" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
    </DefaultConfig>
+
</Configuration>
+
</Index>
+
</source>
+
Now we need to add mapping of attribute and attachment names to Lucene "FieldNo" defined in <tt>DataDictionary.xml</tt>. Open <tt>configuration/org.eclipse.smila.lucene/Mappings.xml</tt> file and add the following code to the end of file before closing <tt></Mappings></tt> tag:
+
<source lang="xml">
+
<Mapping indexName="web_index">
+
  <Attributes>
+
    <Attribute name="Filename" fieldNo="1" />
+
    <Attribute name="Path" fieldNo="2" />   
+
  <Attribute name="LastModifiedDate" fieldNo="3" />
+
  <Attribute name="Url" fieldNo="4" />
+
  <Attribute name="Title" fieldNo="5" />   
+
  <Attribute name="Extension" fieldNo="6" />
+
  <Attribute name="Size" fieldNo="7" />
+
  <Attribute name="MimeType" fieldNo="8" />         
+
  </Attributes>
+
  <Attachments>
+
    <Attachment name="Content" fieldNo="0" />     
+
  </Attachments>
+
</Mapping>
+
</source>
+
 
+
=== Put it  all together ===
+
Ok, now it seems that we finally finished configuring SMILA for using separate workflows for Filesystem and Web crawlers and index data from these crawlers into different indices.
+
Here is what we have done so far:
+
# Modified Listener rules to use different workflows for Filesystem and Web crawlers
+
# Created new BPEL workflow for Web crawler
+
# Added webcrawler index to the lucence configurations.
+
Now we can start SMILA again and look what's happening when we start Web crawler.
+
 
+
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
|
+
It's very important to shutdown SMILA engine and restart afterwards because modified configurations will load only on startup.
+
|}
+
 
+
Now we can also search on the web_index from browser:
+
 
+
[[Image:Web_index-search.png]]
+
 
+
== Configuration overview ==
+
 
+
SMILA configuration files are placed into <tt>configuration</tt> directory of the SMILA application.
+
Following figure shows configuration files relevant to this tutorial, regarding SMILA components and data lifecycle. SMILA components names are black-colored, directories containing configuration files and filenames are blue-colored.
+
 
+
[[Image:Smila-configuration-overview.jpg]]
+

Latest revision as of 06:15, 19 January 2012

Back to the top