Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

m (1. Download and unpack the SMILA application.)
m (Create your Job)
(206 intermediate revisions by 13 users not shown)
Line 2: Line 2:
 
[[Category:HowTo]]
 
[[Category:HowTo]]
  
This page contains installation instructions for the SMILA application and helps you with your first steps in SMILA.
+
On this page we describe the necessary steps to install and run SMILA in order to create a search index on the [[SMILA]] Eclipsepedia pages and search them.
  
== Download and unpack the SMILA application. ==
+
If you have any troubles or the results differ from what is described here, check the [[SMILA/FAQ|FAQ]].
  
After downloading and unpacking you should have the following folder structure.
+
== Supported Platforms ==
 +
The following platforms are supported:
 +
*Linux 32 Bit
 +
*Linux 64 Bit
 +
*Mac OS X 64 Bit (Cocoa)
 +
*Windows 32 Bit
 +
*Windows 64 Bit
  
[[Image:Installation.png]]
+
== Download and start SMILA ==
  
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
[http://www.eclipse.org/smila/downloads.php Download] the SMILA package matching your [[#Supported_Platforms|operation system]] and unpack it to an arbitrary folder. This will result in the following folder structure:
|
+
If you use Linux version of SMILA please make sure that the file SMILA has executable permissions. If not, you can set it by running ''chmod +x ./SMILA'' command in console.
+
|}
+
  
== 2. Start the SMILA engine. ==
+
<pre>
 +
/<SMILA>
 +
  /configuration
 +
  /features
 +
  /jmxclient
 +
  /plugins
 +
  /workspace
 +
  .eclipseproduct
 +
  ...
 +
  SMILA
 +
  SMILA.ini
 +
</pre>
  
To start SMILA engine double-click on SMILA.exe or open an command line, navigate to the directory that contains extracted files, and run SMILA executable. Wait until the engine is fully started. If everything is OK, you should see output similar to the one on the following screenshot:
+
=== Preconditions ===
 +
To be able to start SMILA, check the following preconditions first:
  
[[Image:Smila-console.png]]
+
==== JRE ====
 +
You will have to provide a JRE executable to be able to run SMILA. The JVM version should be Java 7. You may either:
 +
* add the path of your local JRE executable to the PATH environment variable <br>or<br>
 +
* add the argument <tt>-vm <path/to/jre/executable></tt> right at the top of the file <tt>SMILA.ini</tt>. <br>Make sure that <tt>-vm</tt> is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
 +
<div style="margin-left: 1.5em;">
 +
<source lang="text">
 +
-vm
 +
d:/java/jre7/bin/java
 +
...
 +
</source>
 +
</div>
  
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
==== Linux ====
|
+
When using the Linux distributable of SMILA, make sure that the files <tt>SMILA</tt> and <tt>jmxclient/run.sh</tt> have executable permissions. If not, set the permission by running the following commands in a console:
Note: To run SMILA you need to have jre executable added to your PATH environment variable. The jvm version should be at least java 5.
+
<tt>
 +
chmod +x ./SMILA
 +
chmod +x ./jmxclient/run.sh
 +
</tt>
  
Optionally you can configure jvm into SMILA.ini file instead of environment variable. Put the argument ''-vm <path/to/jre/executable>'' at the beginning of SMILA.ini, for example:
+
==== MacOS ====
 +
When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following command in a console:
 +
<tt>
 +
chmod a+x ./SMILA
 +
</tt>
  
-vm
+
=== Start SMILA ===
d:/java/jre1.5.0.16/bin/java
+
To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the <tt>SMILA</tt> executable. Wait until the engine has been fully started.
...
+
|}
+
  
== 3. Check the log file. ==
+
You can tell if SMILA has fully started if the following line is printed on the OSGI console: <tt>HTTP server started successfully on port 8080</tt> and you can access SMILA's REST API at [http://localhost:8080/smila/ http://localhost:8080/smila/].
You can check what's happening in the background by opening the SMILA log file in an editor. This file is named <tt>SMILA.log</tt> and can be found in the same directory as the SMILA executable.
+
  
[[Image:Smila-log.png]]
+
When using MAC, navigate to <tt>SMILA.app/Contents/MacOS/</tt> in terminal, then start with <tt>./SMILA</tt>
  
== 4. Configure crawling jobs. ==
+
Before continuing, [[SMILA/FAQ#How_can_I_see_that_SMILA_started_correctly.3F|check the log file]] for possible errors.
Now when the SMILA engine is up and running we can start the crawling jobs. Crawling jobs are managed over the JMX protocol, that means that we can connect to SMILA with a JMX client of your choice. We will use JConsole for that purpose since this JMX client is already available as a default with the Sun Java distribution.
+
  
Start the JConsole executable in your JDK distribution (<tt><JAVA_HOME>/bin/jconsole</tt>). If the client is up and running, select the PID in the ''Connect'' window and click ''Connect''.
+
=== Stop SMILA ===
  
[[Image:Jconsole.png]]
+
To stop the SMILA engine, type <tt>close</tt> into the OSGI console and press ''Enter'':
  
Next, switch to the ''MBeans'' tab, expand the SMILA node in the ''MBeans'' tree on the left side of the window, and click the <tt>CrawlerController</tt> node. This node is used to manage and monitor all crawling activities.
+
<source lang="text">
 +
osgi> close
 +
</source>
  
[[Image:Mbeans-overview.png]]
+
For further OSGI console commands, enter <tt>help</tt>:
  
== 5. Start the file system crawler. ==
+
<source lang="text">
To start filesystem crawler, open the ''Operations'' tab on the right pane, type "file" into text field next to the <tt>startCrawl</tt> button and click on <tt>startCrawl</tt> button.
+
osgi> help
 +
</source>
  
[[Image:Start-file-crawl.png]]
+
== Install a REST client ==
  
You should receive a message similar to the following, indicating that the crawler has been successfully started:
+
We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In [[SMILA/Documentation/Using_The_ReST_API#Interactive_Tools|REST Tools]] you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.
  
[[Image:Start-crawl-file-result.png]]
+
== Start Indexing Job and Crawl Import ==
  
Now we can check the log file to see what happened:
+
Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded [[SMILA/Documentation/Solr|Solr integration]].
  
[[Image:File-crawl-log.png]]
+
=== Start indexing job run ===
  
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.
|
+
The configuration of filesystem crawler crawls the folder c:\data by default.
+
  
<Process>
+
Use your favorite REST Client to start a job run for the job "indexUpdate":
    <BaseDir>c:\data</BaseDir>
+
    ...     
+
</Process>
+
|}
+
  
== 6. Configuring the filesystem crawler. ==
+
<source lang="javascript">
 +
#Request
 +
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
 +
</source>
  
A possible error message could be that the folder c:\data is not found:
+
Your REST client will show a result like this:
  
<code>
+
<source lang="javascript">
'' 2009-01-27 18:25:59,592 [Thread-12] ERROR impl.CrawlThread - org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found ''
+
#Response
</code>
+
{
 +
  "jobId" : "20110901-121343613053",
 +
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
 +
}
 +
</source>
  
The error message above states that the crawler tried to index folder at <tt>c:\data</tt> but was not able to find it. To solve this, let's create a folder with sample data, say <tt>c:\data</tt>, put some dummy text files into it, and configure the file system crawler to index it.
+
You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:
To configure the crawler to index other directories, open the configuration file of the crawler at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt>. Modify the ''BaseDir'' attribute by setting its value to an absolute path that points to your new directory. Don't forget to save the file.
+
  
{|width="100%" style="background-color:#ffcccc; padding-left:30px;"
+
<source lang="javascript">
|
+
#Request
Note: Currently only plain text and html files are crawled and indexed correctly by SMILA crawlers.
+
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
|}
+
</source>
  
== 7. Searching on the indices. ==
+
In the <tt>SMILA.log</tt> file you will see a message like that:
To search on the indices those were created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets(SMILASearchDefault and SMILASearchAdvanced)for the search form. If you choose SMILASearchDefault you will be only able to search on the content (field Query) of the files. In the left column below the ''Indexlist'' header you may find the name of all available indices. Currently, there should be only one index in the list (test_index).
+
<pre>
 +
INFO ... internal.JobRunEngineImpl  - started job run '20110901-121343613053' for job 'indexUpdate'
 +
</pre>
  
[[Image:Smila-search-form.png]]
+
'''Further information''': The "indexUpdate" workflow uses the [[SMILA/Documentation/Worker/PipelineProcessorWorker|PipelineProcessorWorker]] that executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt> and <tt>jobs.json</tt>). For more information about job management in general please check the [[SMILA/Documentation/JobManager|JobManager documentation]].
  
Now let's try to search for a word that occurs in your dummy files. In this tutorial we assume that there was a word "data" in both text files.
+
=== Start the crawler ===
  
[[Image:Searching-for-text-in-file.png]]
+
Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now.  For more information about crawl jobs please see [[SMILA/Documentation/Importing/Concept|Importing Concept]]. For more information on jobs and tasks in general visit the [[SMILA/Documentation/JobManager|JobManager manual]].
  
There was a file named <tt>glossary.html</tt> in the sample folder. Let's check whether it was indexed. Switch to SMILASearchAdvanced and type "glossary.html" in the ''Filename'' field and click on Submit button again:
+
To start the job run, POST the following JSON fragment with your REST client to SMILA:
 +
<source lang="javascript">
 +
#Request
 +
POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/
 +
</source>
  
[[Image:Searching-by-filename.png]]
+
This starts the job <tt>crawlSmilaWiki</tt>, which crawls the [[SMILA|SMILA Eclipsepedia]] starting with <tt>http://wiki.eclipse.org/SMILA</tt> and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.
  
== 8. Configure and run the web crawler. ==
+
If you like, you can monitor both job runs with your REST client at the following URIs:
Now that we know how to start and configure the file system crawler and how to search on indices configuring and running the web crawler is straightforward:  
+
* Crawl job: [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]
The configuration file of the web crawler is located at <tt>configuration/org.eclipse.smila.connectivity.framework directory</tt> and is named <tt>web.xml</tt>:
+
* Import job: [http://localhost:8080/smila/jobmanager/jobs/indexUpdate http://localhost:8080/smila/jobmanager/jobs/indexUpdate]
 +
Or both in one overview at
 +
* [http://localhost:8080/smila/jobmanager/jobs/ http://localhost:8080/smila/jobmanager/jobs/]
  
[[Image:Webcrawler-config.png]]
+
The crawling of the wikipedia page should take some time. If all pages are processed, the status of the [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki crawlSmilaWiki]'s job run will change to {{code|SUCCEEDED}}. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.
  
By default the web crawler is configured to index the URL ''http://wiki.eclipse.org/SMILA''. To change this, open the file in an editor of your choice and set the content of the <tt>&lt;Seed&gt;</tt> element to the desired web site. Detailed information on the configuration of the web crawler is also available at the [[SMILA/Documentation/Web_Crawler|Web crawler]] configuration page.
+
'''Further information:''' You can find details about the relevant [[SMILA/Manual#Importing|Import concepts here]].
  
To start the crawling process, save the configuration file, open the ''Operations'' tab in JConsole again, type "web" into the text field next to the <tt>startCrawl</tt> button, and click the button.
+
== Search the index ==
  
[[Image:Starting-web-crawler.png]]
+
{{note|Since SMILA uses [[SMILA/Documentation/Solr#solrconfig.xml|Solr's autocommit feature]] (which is configured in <tt>solrconfig.xml</tt> to a period of 30 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.}}
  
 +
To search the index which was created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The ''Default'' stylesheet shows a reduced search form with text fields like ''Query'', ''Result Size'', and ''Index'', adequate to query the full-text content of the indexed documents. The ''Advanced'' stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example ''Path'', ''MimeType'', ''Filename'', and other document attributes.
  
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
'''To use the ''Default'' Stylesheet''':
|
+
#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
Note: The ''Operations'' tab in JConsole also provides buttons to stop a crawler, get the list of active crawlers and the current status of a particular crawling job.
+
#Enter a word that you expect to be contained in your dummy files into the ''Query'' text field.
Default limit for downloaded documents is set to 1000 into webcrawler configuration, so it can take a while for web crawling job to finish. You can stop crawling job manually by typing "web" next to <tt>stopCrawl</tt> button and then clicking on this button.
+
# Click ''OK'' to send your query to SMILA.  
As an example the following screenshot shows the result after the <tt>getActiveCrawlsStatus</tt> button has been clicked while the web crawler is running:
+
|}
+
[[Image:One-active-crawl-found.png]]
+
  
When the web crawler's job is finished, you can search on the generated index just like described above with the file system crawler (see step 7).
+
'''To use the ''Advanced'' Stylesheet''':
 +
#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
 +
#Click ''Advanced'' to switch to the detailed search form.
 +
#For example, to find a file by its name, enter the file name into the ''Filename'' text field, then click ''OK'' to submit your search.
  
 +
== Stop indexing job run ==
  
[[Category:SMILA]]
+
Although there's no need for it, we can finish our previously started indexing job run via REST client now:
 +
(replace <job-id> with the job-id you got before when [[#Start_indexing_job_run|you started the job run]]).
  
== 9. Managing CrawlerController with jmxclient. ==
+
<source lang="javascript">
 +
#Request
 +
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish 
 +
</source>
  
In addition to managing crawling jobs with JConsole it's also possible to use jmxclient from SMILA distribution. Jmxclient is a console application that allows to manage crawl jobs and create scripts for batch crawlers execution. For more information please check [[SMILA/Documentation/Management#JMX_Client|jmxclient documentation ]].  Jmxclient application is located into <tt>jmxclient</tt> directory. You should use appropriate run script (run.bat or run.sh) to start the application.
+
You can monitor the job run via your browser to see that it has finished successfully:
For example, to start filesystem crawler use following command:
+
<source lang="javascript">
 
+
#Request
<code>
+
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>
''run crawl file
+
</code>''
+
 
+
== 5 Minutes for Changing Workflow  ==
+
In previous sections all data collected by crawlers was processed with the same workflow and indexed into the same index, test_index.
+
It's possible to configure SMILA so that data from different data sources will go through different workflows and will be indexed into different  indices. This will require more advanced configuration than before but still is quite simple.
+
Let's create additional workflow for webcrawler records so that webcrawler data will be indexed into separate index, say web_index.
+
 
+
=== 1. Modify Listener rules. ===
+
 
+
First, lets modify the default add rule in Listener and add another rule that will make webcrawler records to be processed by separate BPEL workflow.
+
For more information about Listener, please see the section [[SMILA/Documentation/QueueWorker/Listener|Listener]] of the [[SMILA/Documentation/QueueWorker|QueueWorker]] documentation.
+
Listener configuration is placed at the
+
<tt>configuration/org.eclipse.smila.connectivity.queue.worker/QueueWorkerListenerConfig.xml</tt>
+
Open that file and edit the <tt><Condition></tt> tag of the Default ADD Rule. The result should be as follows:
+
<source lang="xml">
+
<Rule Name="ADD Rule" WaitMessageTimeout="10" Threads="2">
+
  <Source BrokerId="broker1" Queue="SMILA.connectivity"/>
+
  <Condition>Operation='ADD' and NOT(DataSourceID LIKE 'web%')</Condition>
+
  <Task>
+
    <Process Workflow="AddPipeline"/>
+
  </Task>
+
</Rule>
+
 
</source>
 
</source>
Now add the following new rule to this file:
 
<source lang="xml">
 
<Rule Name="Web ADD Rule" WaitMessageTimeout="10" Threads="2">
 
  <Source BrokerId="broker1" Queue="SMILA.connectivity"/>
 
  <Condition>Operation='ADD' and DataSourceID LIKE 'web%'</Condition>
 
  <Task>
 
    <Process Workflow="AddWebPipeline"/>
 
  </Task>
 
</Rule>
 
</source>
 
Notice that we modified condition in the ADD Rule to skip webcrawler data. Webcrawler data will be processed by new Web ADD Rule.
 
Web ADD Rule defines that webcrawler data will be processed by AddWebPipeline workflow, so next we need to create AddWebPipeline workflow.
 
  
=== 2. Create workflow for the BPEL WorkflowProcessor ===
+
In the <tt>SMILA.log</tt> file you will see messages like this:
We need to add the AddWebPipeline workflow to BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the [[SMILA/Documentation/BPEL_Workflow_Processor|BPEL WorkflowProcessor]] documentation.
+
<pre>
BPEL WorkflowProcessor configuration files are placed at the <tt>configuration/org.eclipse.smila.processing.bpel/pipelines</tt> directory.
+
INFO ... internal.JobRunEngineImpl  - finish called for job 'indexUpdate', run '20110901-141457584011'
There is a file <tt>addpipeline.bpel</tt> that defines AddPipeline process. Let's create the <tt>addwebpipeline.bpel</tt> file that will define AddWebPipeline process and put the following code into it:
+
...
<source lang="xml">
+
INFO ... internal.JobRunEngineImpl  - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED
<?xml version="1.0" encoding="utf-8" ?>
+
</pre>
<process name="AddWebPipeline" targetNamespace="http://www.eclipse.org/smila/processor"
+
    xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
+
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
+
    xmlns:proc="http://www.eclipse.org/smila/processor"
+
    xmlns:rec="http://www.eclipse.org/smila/record">
+
  
  <import location="processor.wsdl" namespace="http://www.eclipse.org/smila/processor"
+
Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit [[SMILA/Manual|SMILA Manual]].
      importType="http://schemas.xmlsoap.org/wsdl/" />
+
  
  <partnerLinks>
+
== Further steps ==
    <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" myRole="service" />
+
  </partnerLinks>
+
  
  <extensions>
+
=== Crawl the filesystem ===
    <extension namespace="http://www.eclipse.org/smila/processor" mustUnderstand="no" />
+
  </extensions>
+
  
  <variables>
+
SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.
    <variable name="request" messageType="proc:ProcessorMessage" />
+
  </variables>
+
  
  <sequence>
+
We will settle for the second option, because it does not need that you stop and restart SMILA.
    <receive name="start" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process"
+
        variable="request" createInstance="yes" />
+
  
    <!-- only process text based content, skip everything else -->
+
==== Create your Job ====
    <if name="conditionIsText">
+
POST the following job description to [[SMILA/Documentation/JobDefinitions#List.2C_create.2C_modify_jobs|SMILA's Job API]] at <tt>http://localhost:8080/smila/jobmanager/jobs</tt>. Adapt the <tt>rootFolder</tt> parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, office docs or HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. <tt>c:\\data\\files</tt>.
      <condition>
+
<source lang="javascript">
        contains($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V, "text/")
+
#Request
      </condition>
+
POST http://localhost:8080/smila/jobmanager/jobs/
      <sequence name="processTextBasedContent">  
+
{
+
  "name":"crawlFilesAtData",
        <!-- extract txt from html files -->
+
  "workflow":"fileCrawling",
        <if name="conditionIsHtml">
+
  "parameters":{
          <condition>
+
    "tempStore":"temp",
          ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V[contains(., "text/html")])
+
    "dataSource":"file",
          or
+
    "rootFolder":"/data",
          ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V[contains(., "text/xml")])
+
    "jobToPushTo":"indexUpdate",
          </condition>
+
    "mapping":{
        </if>
+
      "fileContent":"Content",
+
      "filePath":"Path",     
        <extensionActivity name="invokeHtml2Txt">
+
       "fileName":"Filename",     
          <proc:invokePipelet>
+
      "fileExtension":"Extension",
            <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
+
      "fileLastModified":"LastModifiedDate"
            <proc:variables input="request" output="request" />
+
     }
            <proc:PipeletConfiguration>
+
   }
              <proc:Property name="inputType">
+
}
                <proc:Value>ATTACHMENT</proc:Value>
+
              </proc:Property>      
+
              <proc:Property name="outputType">
+
                <proc:Value>ATTACHMENT</proc:Value>
+
              </proc:Property>
+
              <proc:Property name="inputName">
+
                <proc:Value>Content</proc:Value>
+
              </proc:Property>
+
              <proc:Property name="outputName">
+
                <proc:Value>Content</proc:Value>
+
              </proc:Property>
+
              <proc:Property name="meta:title">
+
                <proc:Value>Title</proc:Value>
+
              </proc:Property>
+
            </proc:PipeletConfiguration>     
+
          </proc:invokePipelet>
+
        </extensionActivity>
+
 
+
        <extensionActivity name="invokeLuceneService">
+
          <proc:invokeService>
+
            <proc:service name="LuceneIndexService" />
+
            <proc:variables input="request" output="request" />
+
            <proc:setAnnotations>
+
              <rec:An n="org.eclipse.smila.lucene.LuceneIndexService">
+
                <rec:V n="indexName">web_index</rec:V>
+
                <rec:V n="executionMode">ADD</rec:V>
+
              </rec:An>
+
            </proc:setAnnotations>
+
          </proc:invokeService>
+
        </extensionActivity>
+
+
       </sequence>
+
    </if>
+
 
+
    <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType"  
+
operation="process" variable="request" />
+
     <exit />
+
   </sequence>
+
</process>
+
 
</source>
 
</source>
  
Note that we use "web_index" index name for the LuceneService in the code above:
+
''Hint: Not all file formats are supported by SMILA out-of-the-box. Have a look [[SMILA/Documentation/TikaPipelet#Supported_document_types | here]] for details.''
<source lang="xml">
+
<rec:An n="org.eclipse.smila.lucene.LuceneIndexService">
+
  <rec:V n="indexName">web_index</rec:V>
+
  <rec:V n="executionMode">ADD</rec:V>
+
</rec:An>
+
</source>
+
  
We need to add our pipeline description to the <tt>deploy.xml</tt> file placed in the same directory. Add the following code to the end of <tt>deploy.xml</tt> before the closing <tt></deploy></tt> tag:
+
==== Start your jobs ====
<source lang="xml">
+
<process name="proc:AddWebPipeline">
+
  <in-memory>true</in-memory>
+
  <provide partnerLink="Pipeline">
+
    <service name="proc:AddWebPipeline" port="ProcessorPort" />
+
  </provide>   
+
</process>
+
</source>
+
  
Now we need to add our web_index to LuceneIndexService configuration.
+
*Start the <tt>indexUpdate</tt> (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine:  
 
+
<div style="margin-left: 1.5em;">
=== 3. LuceneIndexService configuration ===
+
<source lang="javascript">
For more information about LuceneIndexService, please see [[SMILA/Documentation/LuceneIndexService|LuceneIndexService]]
+
#Request
 
+
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
Let's configure our web_index index structure and search template. Add the following code to the end of <tt>configuration/org.eclipse.smila.search.datadictionary/DataDictionary.xml</tt> file before the closing <tt></AnyFinderDataDictionary></tt> tag:
+
<source lang="xml">
+
<Index Name="web_index">
+
  <Connection xmlns="http://www.anyfinder.de/DataDictionary/Connection" MaxConnections="5"/>
+
  <IndexStructure xmlns="http://www.anyfinder.de/IndexStructure" Name="web_index">
+
    <Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
+
    <IndexField FieldNo="8" IndexValue="true" Name="MimeType" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="7" IndexValue="true" Name="Size" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="6" IndexValue="true" Name="Extension" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="5" IndexValue="true" Name="Title" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="4" IndexValue="true" Name="Url" StoreText="true" Tokenize="false" Type="Text">
+
      <Analyzer ClassName="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
+
    </IndexField>
+
    <IndexField FieldNo="3" IndexValue="true" Name="LastModifiedDate" StoreText="true" Tokenize="false" Type="Text"/>
+
    <IndexField FieldNo="2" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
+
    <IndexField FieldNo="0" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
+
  </IndexStructure>
+
  <Result>
+
    <Field FieldNo="0" Name="ID"/>
+
  </Result>
+
  <Configuration xmlns="http://www.anyfinder.de/DataDictionary/Configuration"
+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+
xsi:schemaLocation="http://www.anyfinder.de/DataDictionary/Configuration ../xml/DataDictionaryConfiguration.xsd">
+
    <DefaultConfig>
+
      <Field FieldNo="8">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="7">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="6">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>       
+
      <Field FieldNo="5">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="4">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="3">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="2">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="1">
+
        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
      <Field FieldNo="0">
+
        <FieldConfig Constraint="required" Weight="1" xsi:type="FTText">
+
          <NodeTransformer xmlns="http://www.anyfinder.de/Search/ParameterObjects"
+
Name="urn:ExtendedNodeTransformer">
+
            <ParameterSet xmlns="http://www.brox.de/ParameterSet"/>
+
          </NodeTransformer>
+
          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="AND" Tolerance="exact"/>
+
        </FieldConfig>
+
      </Field>
+
    </DefaultConfig>
+
    <Result Name="">
+
      <ResultField FieldNo="8" Name="MimeType"/>
+
      <ResultField FieldNo="7" Name="Size"/>
+
      <ResultField FieldNo="6" Name="Extension"/>
+
      <ResultField FieldNo="5" Name="Title"/>
+
      <ResultField FieldNo="4" Name="Url"/>
+
      <ResultField FieldNo="3" Name="LastModifiedDate"/>
+
      <ResultField FieldNo="2" Name="Path"/>
+
      <ResultField FieldNo="1" Name="Filename"/>
+
    </Result>
+
    <HighlightingResult Name="">
+
      <HighlightingResultField FieldNo="0" Name="Content" xsi:type="HLTextField">
+
        <HighlightingTransformer Name="urn:Sentence">
+
          <ParameterSet xmlns="http://www.brox.de/ParameterSet">
+
            <Parameter Name="MaxLength" xsi:type="Integer">
+
              <Value>300</Value>
+
            </Parameter>
+
            <Parameter Name="MaxHLElements" xsi:type="Integer">
+
              <Value>999</Value>
+
            </Parameter>
+
            <Parameter Name="MaxSucceedingCharacters" xsi:type="Integer">
+
              <Value>30</Value>
+
            </Parameter>
+
            <Parameter Name="SucceedingCharacters" xsi:type="String">
+
              <Value>...</Value>
+
            </Parameter>
+
            <Parameter Name="SortAlgorithm" xsi:type="String">
+
              <Value>Occurrence</Value>
+
            </Parameter>
+
            <Parameter Name="TextHandling" xsi:type="String">
+
              <Value>ReturnSnipplet</Value>
+
            </Parameter>
+
          </ParameterSet>
+
        </HighlightingTransformer>
+
        <HighlightingParameter xmlns="http://www.anyfinder.de/DataDictionary/Configuration/TextHighlighting"/>
+
      </HighlightingResultField>
+
    </HighlightingResult>
+
  </Configuration>
+
</Index>
+
 
</source>
 
</source>
Now we need to add mapping of attribute and attachment names to Lucene "FieldNo" defined in <tt>DataDictionary.xml</tt>. Open <tt>configuration/org.eclipse.smila.lucene/Mappings.xml</tt> file and add the following code to the end of file before closing <tt></Mappings></tt> tag:
+
</div>
<source lang="xml">
+
*Start your <tt>crawlFilesAtData</tt> job similar to [[#Start_the_crawler|Start the crawler]] but now use the job name <tt>crawlFilesAtData</tt> instead of <tt>crawlSmilaWiki</tt>. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your {{code|rootFolder}}.
<Mapping indexName="web_index">
+
<div style="margin-left: 1.5em;">
  <Attributes>
+
<source lang="javascript">
    <Attribute name="Filename" fieldNo="1" />
+
#Request
    <Attribute name="Path" fieldNo="2" />  
+
POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/
  <Attribute name="LastModifiedDate" fieldNo="3" />
+
  <Attribute name="Url" fieldNo="4" />
+
  <Attribute name="Title" fieldNo="5" />   
+
  <Attribute name="Extension" fieldNo="6" />
+
  <Attribute name="Size" fieldNo="7" />
+
  <Attribute name="MimeType" fieldNo="8" />         
+
  </Attributes>
+
  <Attachments>
+
    <Attachment name="Content" fieldNo="0" />     
+
  </Attachments>
+
</Mapping>
+
 
</source>
 
</source>
 +
</div>
  
=== 4. Put it  all together ===
+
==== Search for your new data ====
Ok, now it seems that we finally finished configuring SMILA for using separate workflows for Filesystem and Web crawlers and index data from these crawlers into different indices.
+
#After the job run's finished, wait a bit, then check whether the data has been indexed (see [[#Search_the_index|Search the index]] for help).
Here is what we have done so far:
+
#It is also a good idea to check the log file for errors.
# Modified Listener rules to use different workflows for Filesystem and Web crawlers
+
# Created new BPEL workflow for Web crawler
+
# Added webcrawler index to the lucence configurations.
+
No we can start SMILA again and look what's happening when we start Web crawler:
+
 
+
It's very important to shutdown SMILA engine and restart afterwards because modified configurations will load only on startup.
+
 
+
 
+
[[Image:Web_index.png]]
+
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
|
+
To get output like on screenshot above you need to add line <tt>"log4j.logger.org.eclipse.smila.xmlstorage.internal.impl=DEBUG, file"</tt>  to the <tt>log4j.properties</tt> file from the root of the SMILA directory.
+
 
+
It will enable logging of records that are committed to xml storage.
+
|}
+
 
+
 
+
 
+
Web crawler records now contain web_index attribute as expected. Now we can also search on the web_index from browser:
+
 
+
 
+
[[Image:Web_index-search.png]]
+
 
+
== Configuration overview ==
+
  
SMILA configuration files are placed into <tt>configuration</tt> directory of the SMILA application.
+
=== 5 more minutes to change the workflow ===
Following figure shows configuration files relevant to this tutorial, regarding SMILA components and data lifecycle. SMILA components names are black-colored, directories containing configuration files and filenames are blue-colored.
+
  
[[Image:Smila-configuration-overview.jpg]]
+
The [[SMILA/Documentation/5 more minutes to change the workflow|5 more minutes to change the workflow]] show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.
 +
(see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine:

Revision as of 03:47, 9 April 2013


On this page we describe the necessary steps to install and run SMILA in order to create a search index on the SMILA Eclipsepedia pages and search them.

If you have any troubles or the results differ from what is described here, check the FAQ.

Supported Platforms

The following platforms are supported:

  • Linux 32 Bit
  • Linux 64 Bit
  • Mac OS X 64 Bit (Cocoa)
  • Windows 32 Bit
  • Windows 64 Bit

Download and start SMILA

Download the SMILA package matching your operation system and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini

Preconditions

To be able to start SMILA, check the following preconditions first:

JRE

You will have to provide a JRE executable to be able to run SMILA. The JVM version should be Java 7. You may either:

  • add the path of your local JRE executable to the PATH environment variable
    or
  • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
    Make sure that -vm is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
-vm
d:/java/jre7/bin/java
...

Linux

When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA
chmod +x ./jmxclient/run.sh

MacOS

When using MAC switch to SMILA.app/Contents/MacOS/ and set the permission by running the following command in a console:

chmod a+x ./SMILA

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the SMILA executable. Wait until the engine has been fully started.

You can tell if SMILA has fully started if the following line is printed on the OSGI console: HTTP server started successfully on port 8080 and you can access SMILA's REST API at http://localhost:8080/smila/.

When using MAC, navigate to SMILA.app/Contents/MacOS/ in terminal, then start with ./SMILA

Before continuing, check the log file for possible errors.

Stop SMILA

To stop the SMILA engine, type close into the OSGI console and press Enter:

osgi> close

For further OSGI console commands, enter help:

osgi> help

Install a REST client

We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In REST Tools you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.

Start Indexing Job and Crawl Import

Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr integration.

Start indexing job run

We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.

Use your favorite REST Client to start a job run for the job "indexUpdate":

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

Your REST client will show a result like this:

#Response
{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
}

You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:

#Request
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

In the SMILA.log file you will see a message like that:

INFO ... internal.JobRunEngineImpl   - started job run '20110901-121343613053' for job 'indexUpdate'

Further information: The "indexUpdate" workflow uses the PipelineProcessorWorker that executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Start the crawler

Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now. For more information about crawl jobs please see Importing Concept. For more information on jobs and tasks in general visit the JobManager manual.

To start the job run, POST the following JSON fragment with your REST client to SMILA:

#Request
POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/

This starts the job crawlSmilaWiki, which crawls the SMILA Eclipsepedia starting with http://wiki.eclipse.org/SMILA and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.

If you like, you can monitor both job runs with your REST client at the following URIs:

Or both in one overview at

The crawling of the wikipedia page should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.

Further information: You can find details about the relevant Import concepts here.

Search the index

Note.png
Since SMILA uses Solr's autocommit feature (which is configured in solrconfig.xml to a period of 30 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.


To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

To use the Default Stylesheet:

  1. Point your browser to http://localhost:8080/SMILA/search.
  2. Enter a word that you expect to be contained in your dummy files into the Query text field.
  3. Click OK to send your query to SMILA.

To use the Advanced Stylesheet:

  1. Point your browser to http://localhost:8080/SMILA/search.
  2. Click Advanced to switch to the detailed search form.
  3. For example, to find a file by its name, enter the file name into the Filename text field, then click OK to submit your search.

Stop indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (replace <job-id> with the job-id you got before when you started the job run).

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish

You can monitor the job run via your browser to see that it has finished successfully:

#Request
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>

In the SMILA.log file you will see messages like this:

 INFO ... internal.JobRunEngineImpl   - finish called for job 'indexUpdate', run '20110901-141457584011'
 ...
 INFO ... internal.JobRunEngineImpl   - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED

Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit SMILA Manual.

Further steps

Crawl the filesystem

SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.

We will settle for the second option, because it does not need that you stop and restart SMILA.

Create your Job

POST the following job description to SMILA's Job API at http://localhost:8080/smila/jobmanager/jobs. Adapt the rootFolder parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, office docs or HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. c:\\data\\files.

#Request
POST http://localhost:8080/smila/jobmanager/jobs/
{
  "name":"crawlFilesAtData",
  "workflow":"fileCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"file",
    "rootFolder":"/data",
    "jobToPushTo":"indexUpdate",
    "mapping":{
      "fileContent":"Content",
      "filePath":"Path",       
      "fileName":"Filename",       
      "fileExtension":"Extension",
      "fileLastModified":"LastModifiedDate"
    }
  }
}

Hint: Not all file formats are supported by SMILA out-of-the-box. Have a look here for details.

Start your jobs

  • Start the indexUpdate (see Start indexing job run), if you have already stopped it. If it is still running, that's fine:
#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
  • Start your crawlFilesAtData job similar to Start the crawler but now use the job name crawlFilesAtData instead of crawlSmilaWiki. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your rootFolder.
#Request
POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/

Search for your new data

  1. After the job run's finished, wait a bit, then check whether the data has been indexed (see Search the index for help).
  2. It is also a good idea to check the log file for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.

(see Start indexing job run), if you have already stopped it. If it is still running, that's fine:

Back to the top