Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

(Search the index)
Line 4: Line 4:
 
This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.
 
This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.
  
== Download and unpack SMILA ==
+
If you have trouble or the results differ from what is described here, check the [[SMILA/FAQ|FAQ]].
 +
 
 +
== Download and start SMILA ==
  
 
[http://www.eclipse.org/smila/downloads.php Download] the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:
 
[http://www.eclipse.org/smila/downloads.php Download] the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:
Line 22: Line 24:
 
</pre>
 
</pre>
  
== Check the preconditions ==
+
=== Preconditions ===
 
+
To be able to start SMILA, check the following preconditions:
To be able to follow the steps below, check the following preconditions:
+
  
 +
==== JRE ====
 
* You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. <br> Either:
 
* You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. <br> Either:
 
** add the path of your local JRE executable to the PATH environment variable <br>or<br>
 
** add the path of your local JRE executable to the PATH environment variable <br>or<br>
Line 34: Line 36:
 
  ...
 
  ...
 
</tt>
 
</tt>
* Since we are going to use <tt>JConsole</tt> as the JMX client later in this tutorial, it is recommended to install and use a Java SE Development Kit (JDK) and not just a Java SE Runtime Environment (JRE) because the latter does not include this application.
+
 
* You need a REST HTTP client to access the SMILA REST API (e.g. "RESTClient" or "Poster" add-on for Firefox browser)
+
==== Linux ====
 
* When using the Linux distributable of SMILA, make sure that the files <tt>SMILA</tt> and <tt>jmxclient/run.sh</tt> have executable permissions. If not, set the permission by running the following commands in a console:  
 
* When using the Linux distributable of SMILA, make sure that the files <tt>SMILA</tt> and <tt>jmxclient/run.sh</tt> have executable permissions. If not, set the permission by running the following commands in a console:  
 
<tt>
 
<tt>
Line 41: Line 43:
 
  chmod +x ./jmxclient/run.sh
 
  chmod +x ./jmxclient/run.sh
 
</tt>
 
</tt>
 +
 +
==== MacOS ====
 
* When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following commands in a console:
 
* When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following commands in a console:
 
<tt>
 
<tt>
Line 46: Line 50:
 
</tt>
 
</tt>
  
== Start SMILA ==
+
=== Start SMILA ===
 
+
To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the <tt>SMILA</tt> executable. Wait until the engine has been fully started. You can tell if SMILA has fully started if the following line is printed in the console window: <tt>HTTP server started successfully on port 8080.</tt> and you can access SMILA's REST API from [http://localhost:8080/smila/].
To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and call the <tt>SMILA</tt> executable. Wait until the engine has been fully started. If everything works fine, you should see output similar to that on the following screenshot:
+
 
+
[[Image:Smila-console-0.9.0.png]]
+
  
 
When using MAC switch in terminal to <tt>SMILA.app/Contents/MacOS/</tt> and then start with <tt>./SMILA</tt>
 
When using MAC switch in terminal to <tt>SMILA.app/Contents/MacOS/</tt> and then start with <tt>./SMILA</tt>
  
'''Check the log file'''
+
Now you should [[SMILA/FAQ#How_can_I_see_that_SMILA_started_correctly.3F|check the log file]] for errors that might have occurred before moving on.
  
Open the SMILA log file in an editor of your choice to find out what is happening in the background. This file is named <tt>SMILA.log</tt> and can be found in the same directory as the <tt>SMILA</tt> executable:
+
== Install a REST client ==
  
<pre>
+
Will use SMILA's REST API in order to start and stop jobs, so you need a REST client. In [[SMILA/Documentation/Using_The_ReST_API#Interactive_Tools|REST Tools]] you find a selection of browser plugins for the case that you do not already have a suitable REST client.
/<SMILA>
+
  /about_files
+
  /configuration
+
  /features
+
  /jmxclient
+
  /plugins
+
  /workspace
+
  .eclipseproduct
+
  ...
+
  SMILA
+
  SMILA.ini
+
  -> SMILA.log <-
+
</pre>
+
  
You should see no stacktraces in the log ;) and it should end with an entry like that:
+
== Start Indexing Job and Crawl Import ==
  
<pre>
+
Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr.
INFO  ... internal.HttpServiceImpl    - HTTP server started successfully on port 8080.
+
</pre>
+
  
== Start indexing job run ==
+
=== Start indexing job run ===
  
 
We are going to start the predefined indexing job "indexUpdateJob" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.
 
We are going to start the predefined indexing job "indexUpdateJob" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.
Line 111: Line 97:
 
</pre>
 
</pre>
  
== Configure the File System crawler ==
+
=== Start the crawler ===
  
Prepare some local folder on your system whose contents we are going to index in the following. Add some text and HTML files to it. The result could look similar to the following:<tt><br></tt>
+
Now the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages, which we are going to start now.
  
<pre>/home
+
We need to start this job in the so-called {{code|runOnce}} mode, which is a special mode where tasks are generated by the system rather than by an input trigger and the jobs finished automatically. For more information why this is the case, please see [[SMILA/Documentation/Importing/Concept|Importing Concept]]. For more information on jobs and tasks, visit the [[SMILA/Documentation/JobManager|JobManager manual]].
  /johndoe
+
    /mydata
+
      myfile.txt
+
      someothertxtfile.txt
+
      myfile.html
+
      someotherhtmlfile.html</pre>
+
  
{| width="100%" style="background-color:#ffcccc; padding-left:30px;"
+
Please POST the following json fragment with your REST client to the SMILA job REST API at [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]:
|-
+
<source lang="javascript">
|
+
Note: Currently, only plain text and HTML files can be crawled and indexed properly.
+
 
+
|}
+
 
+
Open the configuration file at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt> and adapt the ''BaseDir'' attribute to point to this folder. Make sure to set an absolute path:
+
 
+
<pre>
+
&lt;Process&gt;
+
  &lt;BaseDir&gt;/home/johndoe/mydata&lt;/BaseDir&gt;
+
  ...     
+
&lt;/Process&gt;
+
</pre>
+
 
+
For details on the configuration please refer to the documentation of the [[SMILA/Documentation/Filesystem_Crawler|Filesystem Crawler]].
+
 
+
== Start the File System crawler ==
+
 
+
Next step is to start a file system crawler job and let SMILA index the configured folder. There are three alternatives how to interact with crawlers:
+
* SMILA REST API
+
* JMX via JConsole
+
* SMILA JMX Client
+
 
+
All three ways will be described in the following sections.
+
 
+
=== Using SMILA REST API ===
+
 
+
You can monitor the state of the configured crawlers via SMILA REST API:
+
<pre>
+
  http://localhost:8080/smila/crawlers
+
</pre>
+
 
+
Use your favourite REST Client to start an import run for data source "file" and the job "indexUpdateJob":
+
 
+
<pre>
+
POST http://localhost:8080/smila/crawlers/file
+
  {
+
    "jobName":"indexUpdateJob"
+
  }
+
</pre>
+
 
+
A successful call will return the import run id:
+
<pre>
+
 
{
 
{
   "importRunId" : 1285938333
+
   "mode": "runOnce"
 
}
 
}
</pre>
+
</source>
  
The state of the current import run can be monitored via:
+
This will now start the job <tt>crawlSmilaWiki</tt> which crawls the [[SMILA|SMILA Eclipsepedia]] starting from <tt>http://wiki.eclipse.org/SMILA</tt> and following only links with the same prefix.
<pre>
+
  http://localhost:8080/smila/crawlers/file
+
</pre>
+
  
 +
All pages that have the specified prefix will be pushed to the importing job.
  
More information about the Crawler(Controller) REST API can be found in the [[SMILA/Documentation/CrawlerController#HTTP ReST JSON interface | CrawlerController documentation]]
+
If you like, you can monitor these two jobs with your REST client at the following URIs:
 +
* crawl job: [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]
 +
* import job: [http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob]
  
=== Using JConsole ===
+
The crawling of the wikipedia page should take some time. If all pages are processed, the status of the [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki crawlSmilaWiki]'s job run will change to {{code|SUCCEEDED}}. We could have a look at SMILA's search page to find out if some of the pages have already made their progress into the Solr index.
 
+
Crawler runs can be managed via the JMX protocol, therefore you can connect to SMILA using any JMX client you like. We are going to use JConsole in the following because it is included in the Java SE Development Kit.
+
 
+
Start the JConsole executable in your JDK distribution (<tt><JAVA_HOME>/bin/jconsole</tt>). If the client is up and running, connect to <tt>localhost:9004</tt>.
+
 
+
[[Image:Jconsole.png-0.8.0.png]]
+
 
+
Next, switch to the ''MBeans'' tab, expand the ''SMILA'' node in the ''MBeans'' tree on the left-hand side, and click the ''CrawlerController'' node. This node is used to manage and monitor all crawling activities.
+
 
+
[[Image:Mbeans-overview-0.8.0.png]]
+
 
+
To start the File System crawler, select ''SMILA &gt; CrawlerControl &gt; Operations'' on the left-hand side, enter "file" into the first text field next to the ''startCrawlerTask'' button, and "indexUpdateJob" in the second text field. Then click the button:
+
 
+
[[Image:Start-file-crawl-0.9.0.png]]
+
 
+
You should receive a message similar to the following, indicating that the crawler has been successfully started:
+
 
+
[[Image:Start-crawl-file-result-0.9.0.png]]
+
 
+
The following entries will appear in the <tt>SMILA.log</tt> file:  
+
 
+
<pre>
+
...
+
INFO  [Thread-21  ]  filesystem.FileSystemCrawler        - Initializing FileSystemCrawler...
+
...
+
INFO  [Thread-21  ]  filesystem.FileSystemCrawler        - Closing FileSystemCrawler...
+
...
+
</pre>
+
 
+
=== Using JMX Client  ===
+
 
+
Instead of managing the crawler runs using JConsole it is also possible to use the JMX Client from the SMILA distribution for the same purpose. The JMX Client is a console application that allows managing crawler runs and creating scripts intended for batch crawler execution. It can be found in the <tt>jmxclient</tt> directory of the SMILA distribution. Use the appropriate run script for your platform (i.e. <tt>run.bat</tt> or <tt>run.sh</tt>) to start the application.
+
For example, to start the File System crawler - as described with JConsole above - use the following command:
+
 
+
<tt>
+
  run crawl file indexUpdateJob
+
</tt>
+
 
+
For more information please check the [[SMILA/Documentation/Management#JMX_Client|JMX Client documentation]].
+
 
+
=== Error handling ===
+
 
+
The following errors can occur when starting the crawler via SMILA REST API resp. JConsole:
+
 
+
* If you specified a non-existing data source, you will see an appropriate error message.
+
* If you specified a non-existing or non-started job name, you will see an appropriate error message.
+
* If the File System crawler cannot find the folder to index, the log file also shows an error:
+
<pre>
+
...
+
INFO  [Thread-24  ]  filesystem.FileSystemCrawler        - Initializing FileSystemCrawler...
+
WARN  [Thread-24  ]  performancecounters.CrawlerControllerPerformanceCounterHelper - Agent location [Crawlers/FileSystem/file - 1491048155] is not found
+
WARN  [Thread-24  ]  performancecounters.CrawlerControllerPerformanceCounterHelper - Instance agent agent is null
+
ERROR [Thread-24  ]  impl.CrawlThread                    -
+
org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Folder "/home/doe01/doesnotexist" is not found
+
  at org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler.checkFolders(FileSystemCrawler.java:347)
+
  at org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler.initialize(FileSystemCrawler.java:176)
+
  at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:214)
+
INFO  [Thread-24  ]  filesystem.FileSystemCrawler        - Closing FileSystemCrawler...
+
...
+
</pre>
+
  
The error message above states that the crawler tried to index a folder at <tt>/home/doe01/doesnotexist</tt> but was not able to find it. To solve this, provide data at the mentioned folder or [[#Configure_the_File_System_crawler|adapt the configuration of the File System crawler accordingly]].
+
You can find more information about the [[SMILA/Manual#Importing|Importing here]].
  
 
== Search the index ==
 
== Search the index ==
Line 254: Line 131:
  
 
Now, let's use the ''Advanced'' stylesheet and search for the name of one the files contained in the indexed folder to check whether it was properly indexed. Click ''Advanced'' to switch to the detailed search form, enter the desired file name into the ''Filename'' text field, then click ''OK'' to submit your search.
 
Now, let's use the ''Advanced'' stylesheet and search for the name of one the files contained in the indexed folder to check whether it was properly indexed. Click ''Advanced'' to switch to the detailed search form, enter the desired file name into the ''Filename'' text field, then click ''OK'' to submit your search.
 
== Configure and run the Web crawler ==
 
 
Now that we already know how to start and configure the File System crawler and how to search indices, configuring and running the Web crawler is rather straightforward.
 
 
'''Note:''' There's no need to define a new job or start a new job run here, because we want to use the same asynchronous workflow for indexing as before and so we can use the still running job run here.
 
 
=== Configure the Web crawler ===
 
 
Let's have a look at the configuration file of the Web crawler which you can find at <tt>configuration/org.eclipse.smila.connectivity.framework/web.xml</tt>:
 
 
<source lang="xml">
 
<DataSourceConnectionConfig  ...>
 
  <DataSourceID>web</DataSourceID>
 
  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.web</SchemaID>
 
  <DataConnectionID>
 
    <Crawler>WebCrawler</Crawler>
 
  </DataConnectionID>
 
  <RecordBuffer Size="20" FlushInterval="3000" />
 
  <DeltaIndexing>full</DeltaIndexing>
 
  <Attributes>
 
    ....
 
  </Attributes>
 
  <Process>
 
    <WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="http://myReferer">
 
      <UserAgent Name="Crawler" Version="1.0" Description="teddy crawler" Url="http://www.teddy.com" Email="crawler@teddy.com"/>
 
      <CrawlingModel Type="MaxDepth" Value="1000"/>
 
      <CrawlScope Type="Path" />
 
      <CrawlLimits>
 
        ...
 
      </CrawlLimits>
 
      <Seeds FollowLinks="NoFollow">
 
        <Seed>http://wiki.eclipse.org/SMILA</Seed>
 
      </Seeds>
 
      <Filters>
 
        <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
 
        <Filter Type="RegExp" Value="^((?!/SMILA).)*$" WorkType="Unselect"/>
 
      </Filters>
 
      <MetaTagFilters>
 
        <MetaTagFilter Type="Name" Name="robots" Content="noindex,nofollow" WorkType="Unselect"/>
 
      </MetaTagFilters>     
 
    </WebSite>
 
  </Process>
 
</source>
 
 
By default, the Web crawler is configured to index the URL ''http://wiki.eclipse.org/SMILA''. To change this, set the content of the <tt>&lt;Seed&gt;</tt> element to the desired web address and adapt the <tt><Filters></tt> section accordingly. If you require further help on this configuration file refer to the [[SMILA/Documentation/Web_Crawler|Web crawler documentation]]. For example, in the following we changed the web address to the main page of Wikipedia and removed one of the <tt>&lt;Filter&gt;</tt> elements:
 
 
<source lang="xml">
 
...
 
<Seeds FollowLinks="NoFollow">
 
  <Seed>http://en.wikipedia.org/wiki/Main_Page</Seed>
 
</Seeds>
 
<Filters>
 
  <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
 
</Filters>
 
...
 
</source>
 
 
=== Start the Web crawler ===
 
 
To start the crawling process, we again have the alternatives as described for the File System Crawler start [[#Start the File System crawler | above]].
 
 
===== Using SMILA REST API to start/stop the Web Crawler =====
 
 
To start the web crawler via REST API:
 
<pre>
 
POST http://localhost:8080/smila/crawlers/web
 
  {
 
    "jobName":"indexUpdateJob"
 
  }
 
</pre>
 
 
Although the default limit for spidered web sites is set to 1,000 in the Web crawler configuration file, it may take a while for the web crawling run to be finished.
 
The state of the current import run can be monitored via:
 
<pre>
 
  http://localhost:8080/smila/crawlers/web
 
</pre>
 
 
If you do not want to wait, you may as well stop the crawling run:
 
<pre>
 
POST http://localhost:8080/smila/crawlers/web/finish
 
</pre>
 
 
===== Using JConsole to start/stop the Web Crawler =====
 
 
For using JConsole to start the web crawler, navigate to ''SMILA'' > ''CrawlerController'' > ''Operations'', type "web" and "indexUpdateJob" into the text fields next to the <tt>startCrawlerTask</tt> button, then click the button.
 
 
[[Image:Start-web-crawl-0.9.0.png]]
 
 
A message like that will pop up after successfull start:
 
 
[[Image:Start-crawl-web-result-0.9.0.png]]
 
 
Click the <tt>getCrawlerTasksState</tt> button to monitor the crawl processing if you want to find out when it has finished. This will produce an output similar to the following:
 
 
[[Image:SMILA-One-active-crawl-found-0.8.0.png]]
 
 
To stop the import run, type "web" into the text field next to the <tt>stopCrawlerTask</tt> button, then click this button.
 
 
=== Search the index ===
 
 
As soon as the Web crawler's run has finished, go back to the search form: <tt>http://localhost:8080/SMILA/search</tt>
 
 
 
[[Category:SMILA]]
 
  
 
== Stop indexing job run ==
 
== Stop indexing job run ==
Line 381: Line 153:
 
</pre>
 
</pre>
  
= Just another 5 minutes to change the workflow  =
+
Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit [[SMILA#Documentation]].
  
In previous sections all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the BPEL pipeline "AddPipeline". All data was indexed into the same solr/lucene index "DefaultCore".
+
== Further steps ==
It is possible, however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.
+
  
In the following sections we are going to use the generic asynchronous "importToPipeline" workflow which let you specify the BPEL pipeline to process the data. We create an additional BPEL pipeline for webcrawler records so that webcrawler data will be indexed into a separate index named "WebCore".
+
=== Crawl the filesystem ===
  
== Configure new solr index ==
+
SMILA has also a predefined job to crawl the file system, but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.
  
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
+
We will settle for the second option, because it does not need that you stop and restart SMILA.
|
+
It's very important to shutdown and restart the SMILA engine after the following configuration changes are done because modified configurations are loaded during startup only.
+
|}
+
  
To configure your own index "WebCore" follow the description in the SMILA documentation for [[SMILA/Documentation/Solr#Setup_another_core|creating your own solr index]].
+
==== Create your Job ====
 
+
POST the following job description to SMILA's job API at <tt>http://localhost:8080/smila/jobmanager/jobs</tt> (the name is just an example, as well as the {{code|rootFoler}}, which you should set to an existing folder on your machine where some data files (e.g. text, PDF etc.) reside.
We need a new index field "Url" in our index scheme, so
+
<source lang="javascript">
* open <tt>SMILA/configuration/org.eclipse.smila.solr/WebCore/conf/schema.xml</tt>
+
    {
* add field "Url" to the index fields:
+
      "name":"crawlFilesAtData",
<pre>
+
      "workflow":"fileCrawling",
...
+
      "parameters":{
<fields>
+
        "tempStore":"temp",
    ...
+
        "dataSource":"file",
    <field name="Url" type="text_path" indexed="true" stored="true"
+
        "rootFolder":"/data",
termVectors="true" termPositions="true" termOffsets="true" />
+
        "jobToPushTo":"indexUpdate"
    ...
+
      }
</fields>
+
    }
...
+
</pre>
+
 
+
For more information about the solr indexing, please see the [[SMILA/Documentation/Solr|SMILA solr documentation]].
+
 
+
== Create a new BPEL pipeline ==
+
 
+
We need to add the ''AddWebPipeline'' pipeline to the BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the [[SMILA/Documentation/BPEL_Workflow_Processor|BPEL WorkflowProcessor]] documentation.
+
Predefined BPEL WorkflowProcessor configuration files are contained in the <tt>configuration/org.eclipse.smila.processing.bpel/pipelines</tt> directory. However, we can add new BPEL pipelines with the SMILA REST API.
+
 
+
Start SMILA if it's not yet running, and use your favourite REST client to add the "AddWebPipeline" BPEL pipeline: (the BPEL XML is a little bit unreadable cause we have to escape it for being valid JSON content; after posting the new pipeline you can get a readable version via monitoring REST API - see below)
+
 
+
<pre>
+
POST http://localhost:8080/smila/pipeline
+
  {
+
    "name":"AddWebPipeline",
+
    "definition":"<?xml version=\"1.0\" encoding=\"utf-8\" ?>\r\n<process name=\"AddWebPipeline\" targetNamespace=\"http://www.eclipse.org/smila/processor\"\r\n  xmlns=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"\r\n  xmlns:proc=\"http://www.eclipse.org/smila/processor\" xmlns:rec=\"http://www.eclipse.org/smila/record\"\r\n  xmlns:bpel=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\">\r\n\r\n  <import location=\"processor.wsdl\" namespace=\"http://www.eclipse.org/smila/processor\"\r\n    importType=\"http://schemas.xmlsoap.org/wsdl/\" />\r\n\r\n  <partnerLinks>\r\n    <partnerLink name=\"Pipeline\" partnerLinkType=\"proc:ProcessorPartnerLinkType\" myRole=\"service\" />\r\n  </partnerLinks>\r\n\r\n  <extensions>\r\n    <extension namespace=\"http://www.eclipse.org/smila/processor\" mustUnderstand=\"no\" />\r\n  </extensions>\r\n\r\n  <variables>\r\n    <variable name=\"request\" messageType=\"proc:ProcessorMessage\" />\r\n  </variables>\r\n\r\n  <sequence name=\"AddWebPipeline\">\r\n    <receive name=\"start\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\"\r\n      operation=\"process\" variable=\"request\" createInstance=\"yes\" />\r\n\r\n    <forEach counterName=\"index\" parallel=\"yes\" name=\"iterateRecords\">\r\n      <startCounterValue>1</startCounterValue>\r\n      <finalCounterValue>count($request.records/rec:Record)</finalCounterValue>\r\n      <scope>\r\n        <sequence>\r\n          <if name=\"HasMimeType\">\r\n            <condition>not($request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"])</condition>\r\n            <extensionActivity>\r\n              <proc:invokePipelet name=\"detectMimeType\">\r\n                <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet\" />\r\n                <proc:variables input=\"request\" index=\"index\" />\r\n                <proc:configuration>\r\n                  <rec:Val key=\"FileExtensionAttribute\">Extension</rec:Val>\r\n                  <rec:Val key=\"MetaDataAttribute\">MetaData</rec:Val>\r\n                  <rec:Val key=\"MimeTypeAttribute\">MimeType</rec:Val>\r\n                </proc:configuration>\r\n              </proc:invokePipelet>\r\n            </extensionActivity>\r\n          </if>\r\n\r\n          <!-- only process text based content, skip everything else -->\r\n          <if name=\"IsText\">\r\n            <condition>starts-with($request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"],\"text/\")\r\n            </condition>\r\n            <if name=\"IsHTML\">\r\n              <condition>$request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"] = \"text/html\"\r\n                or $request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"] = \"text/xml\"\r\n              </condition>\r\n              <!-- extract txt from html and xml files -->\r\n              <extensionActivity>\r\n                <proc:invokePipelet name=\"invokeHtml2Txt\">\r\n                  <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.HtmlToTextPipelet\" />\r\n                  <proc:variables input=\"request\" index=\"index\" />\r\n                  <proc:configuration>\r\n                    <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n                    <rec:Val key=\"outputType\">ATTRIBUTE</rec:Val>\r\n                    <rec:Val key=\"inputName\">Content</rec:Val>\r\n                    <rec:Val key=\"outputName\">Content</rec:Val>\r\n                    <rec:Val key=\"meta:title\">Title</rec:Val>\r\n                  </proc:configuration>\r\n                </proc:invokePipelet>\r\n              </extensionActivity>\r\n\t\t\t  <else>\r\n                <!-- copy txt from attachment to attribute -->\r\n                <extensionActivity>\r\n                  <proc:invokePipelet name=\"invokeCopyContent\">\r\n                    <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.CopyPipelet\" />\r\n                    <proc:variables input=\"request\" index=\"index\" />\r\n                    <proc:configuration>\r\n                      <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n                      <rec:Val key=\"outputType\">ATTRIBUTE</rec:Val>\r\n                      <rec:Val key=\"inputName\">Content</rec:Val>\r\n                      <rec:Val key=\"outputName\">Content</rec:Val>\r\n                      <rec:Val key=\"mode\">COPY</rec:Val>\r\n                    </proc:configuration>\r\n                  </proc:invokePipelet>\r\n                </extensionActivity>\r\n              </else>\r\n            </if>\r\n          </if>\r\n        </sequence>\r\n      </scope>\r\n    </forEach>\r\n\r\n    <extensionActivity>\r\n      <proc:invokePipelet name=\"SolrIndexPipelet\">\r\n        <proc:pipelet class=\"org.eclipse.smila.solr.index.SolrIndexPipelet\" />\r\n        <proc:variables input=\"request\" output=\"request\" />\r\n        <proc:configuration>\r\n          <rec:Val key=\"ExecutionMode\">ADD</rec:Val>\r\n          <rec:Val key=\"CoreName\">WebCore</rec:Val>\r\n          <rec:Seq key=\"CoreFields\">\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Path</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Filename</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Url</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">MimeType</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Size</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">LastModifiedDate</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Content</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Extension</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Title</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Author</rec:Val>\r\n            </rec:Map>\r\n          </rec:Seq>\r\n        </proc:configuration>\r\n      </proc:invokePipelet>\r\n    </extensionActivity>\r\n\r\n    <reply name=\"end\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\" operation=\"process\"\r\n      variable=\"request\" />\r\n  </sequence>\r\n</process>\r\n"
+
  }
+
</pre>
+
 
+
You can monitor the defined BPEL pipelines via browser, so you should find your new pipeline there:
+
<pre>
+
http://localhost:8080/smila/pipeline
+
</pre>
+
 
+
Note that we used "WebCore" index name for the Solr index in the BPEL above:
+
<source lang="xml">
+
...
+
<proc:configuration>
+
  <rec:Val key="CoreName">WebCore</rec:Val>
+
  ...
+
</proc:configuration>
+
...
+
 
</source>
 
</source>
  
== Create and start a new indexing job ==
+
==== Start your jobs ====
 
+
We define an indexing job based on the predefined asynchronous workflow "importToPipeline" (see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt>). This indexing job will process the imported data by using our new BPEL pipeline "AddWebPipeline".
+
 
+
The "importToPipeline" workflow contains a [[SMILA/Documentation/Worker/PipelineProcessorWorker|PipelineProcessorWorker worker]] which is not configured for dedicated BPEL pipelines, so the BPEL pipelines handling adds and deletes have to be set via job parameter.
+
 
+
Use your favourite REST Client to create an appropriate job definition:
+
 
+
<pre>
+
POST http://localhost:8080/smila/jobmanager/jobs/
+
  {
+
    "name":"indexWebJob",
+
    "parameters":{     
+
      "tempStore": "temp",
+
      "addPipeline": "AddWebPipeline",
+
      "deletePipeline": "DeletePipeline"
+
    },
+
    "workflow":"importToPipeline"
+
  }
+
</pre>
+
 
+
Note that the "DeletePipeline" is not needed for our test szenario here, but we must fulfill all undefined workflow parameters.
+
 
+
Afterwards, start a job run for the defined job:
+
 
+
<pre>
+
POST http://localhost:8080/smila/jobmanager/jobs/indexWebJob
+
</pre>
+
 
+
== Put it  all together ==
+
 
+
Ok, now it seems that we have finally finished configuring SMILA for using separate BPEL pipelines for file system and web crawling and index data from these crawlers into different indices.
+
Here is what we have done so far:
+
# We added the <tt>WebCore</tt> index to the Solr configuration.
+
# We created a new BPEL pipeline for Web crawler data referencing the new Lucene index.
+
# We used a separate job for web indexing that references the new BPEL pipeline.
+
 
+
Now, [[#Configure and run the Web crawler|run the Web crawler]] again, remember to use "indexWebJob" as job name parameter!
+
  
Go back to your browser at http://localhost:8080/SMILA/search, select the new index "WebCore" and run a search:
+
You have to start the <tt>indexUpdateJob</tt> (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine.
  
= Configuration overview =
+
Now start your <tt>crawlFilesAtData</tt> job similar to [[#Start_the_crawler|Start the crawler]] but now use the new job's name <tt>crawlFilesAtData</tt> instead of <tt>crawlSmilaWiki</tt>.
  
SMILA configuration files are located in the <tt>configuration</tt> directory of the SMILA application.
+
This new job behaves just like the web crawling, but its run time might be shorter, depending on how much data actually is at your {{code|rootFilder}}.
The following lists the configuration files and documentation links relevant to this tutorial, regarding SMILA components:
+
  
'''Crawler'''
+
==== Search for your new data ====
* configuration folder: <tt>org.eclipse.smila.connectivity.framework</tt>
+
After the job run's finished, wait a bit and then you should check if the data has been indexed. (see [[#Search_the_index|Search the index]]
** <tt>file.xml</tt> (FileSystem Crawler)
+
** <tt>web.xml</tt> (Web Crawler)
+
* Documentation
+
** [[SMILA/Documentation/Filesystem_Crawler|Filesystem Crawler]]
+
** [[SMILA/Documentation/Web_Crawler|Web Crawler]]
+
  
'''Jobmanager'''
+
It is also a good idea to check the log file again for errors.
* configuration folder: <tt>org.eclipse.smila.jobmanager</tt>
+
** <tt>workflows.json</tt> (Predefined asynchronous workflows)
+
* Documentation
+
** [[SMILA/Documentation/JobManager|JobManager]]
+
** [[SMILA/Documentation/Worker/PipelineProcessorWorker|PipelineProcessorWorker]]
+
* REST API: http://localhost:8080/smila/jobmanager
+
  
'''BPEL Pipelines'''
+
=== 5 more minutes to change the workflow ===
* configuration folder: <tt>org.eclipse.smila.processing.bpel</tt>
+
** <tt>pipelines/*.bpel</tt> (Predefined BPEL pipelines)
+
* Documentation
+
** [[SMILA/Documentation/BPEL_Workflow_Processor|BPELWorkflowProcessor]]
+
** [[SMILA/Documentation/Processing/JSON REST API for BPEL pipelines|JSON REST API for BPEL pipelines]]
+
* REST API: http://localhost:8080/smila/pipeline
+
  
'''Solr'''
+
The [[SMILA/Documentation/5 more minutes to change the workflow|5 more minutes to change the workflow]] show, how you can change the workflow the data that has been crawled will be processed.
* DataDictionary
+
** configuration folder: <tt>org.eclipse.smila.solr</tt>
+
* Documentation
+
** [[SMILA/Documentation/Solr]]
+

Revision as of 06:46, 24 January 2012


This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.

If you have trouble or the results differ from what is described here, check the FAQ.

Download and start SMILA

Download the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /about_files
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini

Preconditions

To be able to start SMILA, check the following preconditions:

JRE

  • You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5.
    Either:
    • add the path of your local JRE executable to the PATH environment variable
      or
    • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
      Make sure that -vm is indeed the first argument in the file and that there is a line break after it. It should look similar to the following:

-vm
d:/java/jre6/bin/java
...

Linux

  • When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA
chmod +x ./jmxclient/run.sh

MacOS

  • When using MAC switch to SMILA.app/Contents/MacOS/ and set the permission by running the following commands in a console:

chmod a+x ./SMILA

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the SMILA executable. Wait until the engine has been fully started. You can tell if SMILA has fully started if the following line is printed in the console window: HTTP server started successfully on port 8080. and you can access SMILA's REST API from [1].

When using MAC switch in terminal to SMILA.app/Contents/MacOS/ and then start with ./SMILA

Now you should check the log file for errors that might have occurred before moving on.

Install a REST client

Will use SMILA's REST API in order to start and stop jobs, so you need a REST client. In REST Tools you find a selection of browser plugins for the case that you do not already have a suitable REST client.

Start Indexing Job and Crawl Import

Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr.

Start indexing job run

We are going to start the predefined indexing job "indexUpdateJob" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.

The "indexUpdate" workflow contains a PipelineProcessorWorker worker which executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" and "indexUpdateJob" definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Use your favourite REST Client to start a job run for the job "indexUpdateJob":

POST http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob

Your REST client will show a result like that:

{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob/20110901-121343613053/"
}

You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:

http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob

In the SMILA.log file you will see a message like that:

INFO ... internal.JobManagerImpl    - started job run '20110901-121343613053' for job 'indexUpdateJob'

Start the crawler

Now the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages, which we are going to start now.

We need to start this job in the so-called runOnce mode, which is a special mode where tasks are generated by the system rather than by an input trigger and the jobs finished automatically. For more information why this is the case, please see Importing Concept. For more information on jobs and tasks, visit the JobManager manual.

Please POST the following json fragment with your REST client to the SMILA job REST API at [2]:

{
  "mode": "runOnce"
}

This will now start the job crawlSmilaWiki which crawls the SMILA Eclipsepedia starting from http://wiki.eclipse.org/SMILA and following only links with the same prefix.

All pages that have the specified prefix will be pushed to the importing job.

If you like, you can monitor these two jobs with your REST client at the following URIs:

  • crawl job: [3]
  • import job: [4]

The crawling of the wikipedia page should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. We could have a look at SMILA's search page to find out if some of the pages have already made their progress into the Solr index.

You can find more information about the Importing here.

Search the index

Note.png
Since SMILA uses Solr's autocommit (which is configured in solrconfig.xml to a period of 60 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.


To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index Name, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

Now, let's try the Default stylesheet and enter our first simple search using a word that you expect to be contained in your dummy files. Enter the desired term into the Query text field. And finally, click OK to send your query to SMILA. Your should see some results.

Now, let's use the Advanced stylesheet and search for the name of one the files contained in the indexed folder to check whether it was properly indexed. Click Advanced to switch to the detailed search form, enter the desired file name into the Filename text field, then click OK to submit your search.

Stop indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (please replace <job-id> by the job-id you got before when started the job run)

POST http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob/<job-id>/finish  

You can monitor the job run via browser to see that it finished successful:

http://localhost:8080/smila/jobmanager/jobs/indexUpdateJob/<job-id>

In the SMILA.log file you will see messages like that:

 INFO ... internal.JobManagerImpl   - finish called for job 'indexUpdateJob', run '20110901-141457584011'
 ...
 INFO ... internal.JobManagerImpl   - Completing job run '20110901-141457584011' for job 'indexUpdateJob' with final state SUCCEEDED

Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit SMILA#Documentation.

Further steps

Crawl the filesystem

SMILA has also a predefined job to crawl the file system, but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.

We will settle for the second option, because it does not need that you stop and restart SMILA.

Create your Job

POST the following job description to SMILA's job API at http://localhost:8080/smila/jobmanager/jobs (the name is just an example, as well as the rootFoler, which you should set to an existing folder on your machine where some data files (e.g. text, PDF etc.) reside.

    {
      "name":"crawlFilesAtData",
      "workflow":"fileCrawling",
      "parameters":{
        "tempStore":"temp",
        "dataSource":"file",
        "rootFolder":"/data",
        "jobToPushTo":"indexUpdate"
      }
    }

Start your jobs

You have to start the indexUpdateJob (see Start indexing job run), if you have already stopped it. If it is still running, that's fine.

Now start your crawlFilesAtData job similar to Start the crawler but now use the new job's name crawlFilesAtData instead of crawlSmilaWiki.

This new job behaves just like the web crawling, but its run time might be shorter, depending on how much data actually is at your rootFilder.

Search for your new data

After the job run's finished, wait a bit and then you should check if the data has been indexed. (see Search the index

It is also a good idea to check the log file again for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show, how you can change the workflow the data that has been crawled will be processed.

Back to the top