Difference between revisions of "SMILA/5 Minutes Tutorial"

Revision as of 03:47, 9 April 2013

On this page we describe the necessary steps to install and run SMILA in order to create a search index on the SMILA Eclipsepedia pages and search them.

If you have any troubles or the results differ from what is described here, check the FAQ.

Supported Platforms

The following platforms are supported:

Linux 32 Bit
Linux 64 Bit
Mac OS X 64 Bit (Cocoa)
Windows 32 Bit
Windows 64 Bit

Download and start SMILA

Download the SMILA package matching your operation system and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini

Preconditions

To be able to start SMILA, check the following preconditions first:

JRE

You will have to provide a JRE executable to be able to run SMILA. The JVM version should be Java 7. You may either:

add the path of your local JRE executable to the PATH environment variable
or
add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
Make sure that -vm is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:

-vm
d:/java/jre7/bin/java
...

Linux

When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA chmod +x ./jmxclient/run.sh

MacOS

When using MAC switch to SMILA.app/Contents/MacOS/ and set the permission by running the following command in a console:

chmod a+x ./SMILA

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the SMILA executable. Wait until the engine has been fully started.

You can tell if SMILA has fully started if the following line is printed on the OSGI console: HTTP server started successfully on port 8080 and you can access SMILA's REST API at http://localhost:8080/smila/.

When using MAC, navigate to SMILA.app/Contents/MacOS/ in terminal, then start with ./SMILA

Before continuing, check the log file for possible errors.

Stop SMILA

To stop the SMILA engine, type close into the OSGI console and press Enter:

osgi> close

For further OSGI console commands, enter help:

osgi> help

Install a REST client

We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In REST Tools you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.

Start Indexing Job and Crawl Import

Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr integration.

Start indexing job run

We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.

Use your favorite REST Client to start a job run for the job "indexUpdate":

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

Your REST client will show a result like this:

#Response
{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
}

You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:

#Request
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

In the SMILA.log file you will see a message like that:

INFO ... internal.JobRunEngineImpl   - started job run '20110901-121343613053' for job 'indexUpdate'

Further information: The "indexUpdate" workflow uses the PipelineProcessorWorker that executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Start the crawler

Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now. For more information about crawl jobs please see Importing Concept. For more information on jobs and tasks in general visit the JobManager manual.

To start the job run, POST the following JSON fragment with your REST client to SMILA:

#Request
POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/

This starts the job crawlSmilaWiki, which crawls the SMILA Eclipsepedia starting with http://wiki.eclipse.org/SMILA and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.

If you like, you can monitor both job runs with your REST client at the following URIs:

Crawl job: http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki
Import job: http://localhost:8080/smila/jobmanager/jobs/indexUpdate

Or both in one overview at

http://localhost:8080/smila/jobmanager/jobs/

The crawling of the wikipedia page should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.

Further information: You can find details about the relevant Import concepts here.

Search the index

Since SMILA uses Solr's autocommit feature (which is configured in solrconfig.xml to a period of 30 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.

To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

To use the Default Stylesheet:

Point your browser to http://localhost:8080/SMILA/search.
Enter a word that you expect to be contained in your dummy files into the Query text field.
Click OK to send your query to SMILA.

To use the Advanced Stylesheet:

Point your browser to http://localhost:8080/SMILA/search.
Click Advanced to switch to the detailed search form.
For example, to find a file by its name, enter the file name into the Filename text field, then click OK to submit your search.

Stop indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (replace <job-id> with the job-id you got before when you started the job run).

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish

You can monitor the job run via your browser to see that it has finished successfully:

#Request
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>

In the SMILA.log file you will see messages like this:

 INFO ... internal.JobRunEngineImpl   - finish called for job 'indexUpdate', run '20110901-141457584011'
 ...
 INFO ... internal.JobRunEngineImpl   - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED

Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit SMILA Manual.

Further steps

Crawl the filesystem

SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.

We will settle for the second option, because it does not need that you stop and restart SMILA.

Create your Job

POST the following job description to SMILA's Job API at http://localhost:8080/smila/jobmanager/jobs. Adapt the rootFolder parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, office docs or HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. c:\\data\\files.

#Request
POST http://localhost:8080/smila/jobmanager/jobs/
{
  "name":"crawlFilesAtData",
  "workflow":"fileCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"file",
    "rootFolder":"/data",
    "jobToPushTo":"indexUpdate",
    "mapping":{
      "fileContent":"Content",
      "filePath":"Path",       
      "fileName":"Filename",       
      "fileExtension":"Extension",
      "fileLastModified":"LastModifiedDate"
    }
  }
}

Hint: Not all file formats are supported by SMILA out-of-the-box. Have a look here for details.

Start your jobs

Start the indexUpdate (see Start indexing job run), if you have already stopped it. If it is still running, that's fine:

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

Start your crawlFilesAtData job similar to Start the crawler but now use the job name crawlFilesAtData instead of crawlSmilaWiki. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your rootFolder.

#Request
POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/

Search for your new data

After the job run's finished, wait a bit, then check whether the data has been indexed (see Search the index for help).
It is also a good idea to check the log file for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.

(see Start indexing job run), if you have already stopped it. If it is still running, that's fine:

@@ Line 2: / Line 2: @@
 [[Category:HowTo]]
-This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.
+On this page we describe the necessary steps to install and run SMILA in order to create a search index on the [[SMILA]] Eclipsepedia pages and search them.
-== Download and unpack SMILA ==
+If you have any troubles or the results differ from what is described here, check the [[SMILA/FAQ|FAQ]].
-[http://www.eclipse.org/smila/downloads.php Download] the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:
+== Supported Platforms ==
+The following platforms are supported:
+*Linux 32 Bit
+*Linux 64 Bit
+*Mac OS X 64 Bit (Cocoa)
+*Windows 32 Bit
+*Windows 64 Bit
-[[Image:Installation.png]]
+== Download and start SMILA ==
-== Check the preconditions ==
+[http://www.eclipse.org/smila/downloads.php Download] the SMILA package matching your [[#Supported_Platforms|operation system]] and unpack it to an arbitrary folder. This will result in the following folder structure:
-To be able to follow the steps below, check the following preconditions:
+<pre>
+/<SMILA>
+  /configuration
+  /features
+  /jmxclient
+  /plugins
+  /workspace
+  .eclipseproduct
+  ...
+  SMILA
+  SMILA.ini
+</pre>
-* You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. <br> Either:
+=== Preconditions ===
-** add the path of your local JRE executable to the PATH environment variable <br>or<br>
+To be able to start SMILA, check the following preconditions first:
-** add the argument <tt>-vm <path/to/jre/executable></tt> right at the top of the file <tt>SMILA.ini</tt>. <br>Make sure that <tt>-vm</tt> is indeed the first argument in the file and that there is a line break after it. It should look similar to the following:
-<tt>
+==== JRE ====
- -vm
+You will have to provide a JRE executable to be able to run SMILA. The JVM version should be Java 7. You may either:
- d:/java/jre6/bin/java
+* add the path of your local JRE executable to the PATH environment variable <br>or<br>
- ...
+* add the argument <tt>-vm <path/to/jre/executable></tt> right at the top of the file <tt>SMILA.ini</tt>. <br>Make sure that <tt>-vm</tt> is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
-</tt>
+<div style="margin-left: 1.5em;">
-* Since we are going to use <tt>Jconsole</tt> as the JMX client later in this tutorial, it is recommended to install and use a Java SE Development Kit (JDK) and not just a Java SE Runtime Environment (JRE) because the latter does not include this application.
+<source lang="text">
-*When using the Linux distributable of SMILA, make sure that the files <tt>SMILA</tt> and <tt>jmxclient/run.sh</tt> have executable permissions. If not, set the permission by running the following commands in a console:
+-vm
+d:/java/jre7/bin/java
+...
+</source>
+</div>
+==== Linux ====
+When using the Linux distributable of SMILA, make sure that the files <tt>SMILA</tt> and <tt>jmxclient/run.sh</tt> have executable permissions. If not, set the permission by running the following commands in a console:
 <tt>
   chmod +x ./SMILA
@@ Line 29: / Line 53: @@
 </tt>
-== Start SMILA ==
+==== MacOS ====
+When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following command in a console:
+<tt>
+ chmod a+x ./SMILA
+</tt>
-To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and call the <tt>SMILA</tt> executable. Wait until the engine has been fully started. If everything works fine, you should see output similar to that on the following screenshot:
+=== Start SMILA ===
+To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the <tt>SMILA</tt> executable. Wait until the engine has been fully started.
-[[Image:Smila-console-0.8.0.png]]
+You can tell if SMILA has fully started if the following line is printed on the OSGI console: <tt>HTTP server started successfully on port 8080</tt> and you can access SMILA's REST API at [http://localhost:8080/smila/ http://localhost:8080/smila/].
-== Check the log file ==
+When using MAC, navigate to <tt>SMILA.app/Contents/MacOS/</tt> in terminal, then start with <tt>./SMILA</tt>
-Open the SMILA log file in an editor of your choice to find out what is happening in the background. This file is named <tt>SMILA.log</tt> and can be found in the same directory as the SMILA executable.
-[[Image:Smila-log.png]]
+Before continuing, [[SMILA/FAQ#How_can_I_see_that_SMILA_started_correctly.3F|check the log file]] for possible errors.
-== Configure the File System crawler  ==
+=== Stop SMILA ===
-Prepare some local folder on your system whose contents we are going to index in the following. Add some text and HTML files to it, or if you do not have any at hand, create some files. The result could look similar to the following:<tt><br></tt>
+To stop the SMILA engine, type <tt>close</tt> into the OSGI console and press ''Enter'':
- <pre>/home
+<source lang="text">
-  /johndoe
+osgi> close
-    /mydata
+</source>
-      myfile.txt
-      someothertxtfile.txt
-      myfile.html
-      someotherhtmlfile.html</pre>
-{| width="100%" style="background-color:#ffcccc; padding-left:30px;"
+For further OSGI console commands, enter <tt>help</tt>:
-|-
-|
-Note: Currently, only plain text and HTML files can be crawled and indexed properly.
-|}
+<source lang="text">
+osgi> help
+</source>
-Open the configuration file at <tt>configuration/org.eclipse.smila.connectivity.framework/file.xml</tt> and adapt the ''BaseDir'' attribute to point to this folder. Make sure to set an absolute path:
+== Install a REST client ==
-<pre>
+We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In [[SMILA/Documentation/Using_The_ReST_API#Interactive_Tools|REST Tools]] you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.
- &lt;Process&gt;
-  &lt;BaseDir&gt;/home/johndoe/mydata&lt;/BaseDir&gt;
-  ...
- &lt;/Process&gt;
-</pre>
-== Control crawler jobs ==
+== Start Indexing Job and Crawl Import ==
-Next step is to start a file system crawler job and let SMILA index the configured folder. Crawler jobs can be managed via the JMX protocol, therefore you can connect to SMILA using any JMX client you like. We are going to use JConsole in the following because it is included in the Java SE Development Kit.
-Start the JConsole executable in your JDK distribution (<tt><JAVA_HOME>/bin/jconsole</tt>). If the client is up and running, connect to <tt>localhost:9004</tt>.
+Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded [[SMILA/Documentation/Solr|Solr integration]].
-[[Image:Jconsole.png-0.8.0.png]]
+=== Start indexing job run ===
-Next, switch to the ''MBeans'' tab, expand the ''SMILA'' node in the ''MBeans'' tree on the left-hand side, and click the ''CrawlerController'' node. This node is used to manage and monitor all crawling activities.
+We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.
-[[Image:Mbeans-overview-0.8.0.png]]
+Use your favorite REST Client to start a job run for the job "indexUpdate":
-== Start the File System crawler  ==
+<source lang="javascript">
+#Request
+POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
+</source>
-To start the File System crawler, select ''SMILA &gt; CrawlerControl &gt; Operations'' on the left-hand side, enter "file" into the text field next to the ''startCrawlerTask'' button, then click the button:
+Your REST client will show a result like this:
-[[Image:Start-file-crawl-0.8.0.png]]
+<source lang="javascript">
+#Response
+{
+  "jobId" : "20110901-121343613053",
+  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
+}
+</source>
-You should receive a message similar to the following, indicating that the crawler has been successfully started:
+You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:
-[[Image:Start-crawl-file-result-0.8.0.png]]
+<source lang="javascript">
+#Request
-Now let's check the <tt>SMILA.log</tt> file to see what has happened in the background:
+GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
+</source>
+In the <tt>SMILA.log</tt> file you will see a message like that:
 <pre>
-...
+INFO ... internal.JobRunEngineImpl   - started job run '20110901-121343613053' for job 'indexUpdate'
-INFO  [Thread-21  ]  filesystem.FileSystemCrawler        - Initializing FileSystemCrawler...
-...
-INFO  [Thread-21  ]  Records                             - Record is routed with rule [Default Route Rule] and operation [null], record id=file:<Path=/home/doe01/mydata/smila-pipelets.html>
-INFO  [Thread-21  ]  Records                             - Record is routed with rule [Default Route Rule] and operation [null], record id=file:<Path=/home/doe01/mydata/smila-web-crawler.txt>
-...
-INFO  [Thread-21  ]  filesystem.FileSystemCrawler        - Closing FileSystemCrawler...
-...
-</pre>
-If the File System crawler cannot find the folder to index, the log file would look similar to the following:
-<pre>
-...
-INFO  [Thread-24  ]  filesystem.FileSystemCrawler        - Initializing FileSystemCrawler...
-WARN  [Thread-24  ]  performancecounters.CrawlerControllerPerformanceCounterHelper - Agent location [Crawlers/FileSystem/file - 1491048155] is not found
-WARN  [Thread-24  ]  performancecounters.CrawlerControllerPerformanceCounterHelper - Instance agent agent is null
-ERROR [Thread-24  ]  impl.CrawlThread                    -
-org.eclipse.smila.connectivity.framework.CrawlerCriticalException: Folder "/home/doe01/doesnotexist" is not found
-  at org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler.checkFolders(FileSystemCrawler.java:347)
-  at org.eclipse.smila.connectivity.framework.crawler.filesystem.FileSystemCrawler.initialize(FileSystemCrawler.java:176)
-  at org.eclipse.smila.connectivity.framework.impl.CrawlThread.run(CrawlThread.java:214)
-INFO  [Thread-24  ]  filesystem.FileSystemCrawler        - Closing FileSystemCrawler...
-...
 </pre>
-The error message above states that the crawler tried to index a folder at <tt>/home/doe01/doesnotexist</tt> but was not able to find it. To solve this, provide data at the mentioned folder or [[#Configure_the_File_System_crawler|adapt the configuration of the File System crawler accordingly]].
+'''Further information''': The "indexUpdate" workflow uses the [[SMILA/Documentation/Worker/PipelineProcessorWorker|PipelineProcessorWorker]] that executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt> and <tt>jobs.json</tt>). For more information about job management in general please check the [[SMILA/Documentation/JobManager|JobManager documentation]].
-== Search the index ==
-To search the index which was created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The ''Default'' stylesheet shows a reduced search form with text fields like ''Query'', ''Result Size'', and ''Index Name'', adequate to query the full-text content of the indexed documents. The ''Advanced'' stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example ''Path'', ''MimeType'', ''Filename'', and other document attributes.
-[[Image:Smila-search-form.png]]
-Now, let's try the ''Default'' stylesheet and enter our first simple search using a word that you expect to be contained in your dummy files. In this tutorial, we assume that there is a match for the term "data" in the indexed documents. First, select the index on which you want to search from the ''Indexlist'' column on the left-hand side. Currently, there should be only one in the list, namely an index called "test_index". Note that the selected index name will appear in the ''Index Name'' text field of the search form. Then enter the desired term into the ''Query'' text field. And finally, click ''OK'' to send your query to SMILA. Your result could be similar to the following:
-[[Image:Searching-for-text-in-file.png]]
-Now, let's use the ''Advanced'' stylesheet and search for the name of one the files contained in the indexed folder to check whether it was properly indexed. In our example, we are going to search for a file named <tt>smila-glossary.html</tt>. Click ''Advanced'' to switch to the detailed search form, enter the desired file name into the ''Filename'' text field, then click ''OK'' to submit your search. Your result could be similar to the following:
-[[Image:Searching-by-filename.png]]
-== Configure and run the Web crawler ==
+=== Start the crawler ===
-Now that we alreday know how to start and configure the File System crawler and how to search indices, configuring and running the Web crawler is rather straightforward:
-First, let's have a look at the configuration file of the Web crawler which you can find at <tt>configuration/org.eclipse.smila.connectivity.framework/web.xml</tt>:
+Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now.  For more information about crawl jobs please see [[SMILA/Documentation/Importing/Concept|Importing Concept]]. For more information on jobs and tasks in general visit the [[SMILA/Documentation/JobManager|JobManager manual]].
-<source lang="xml">
+To start the job run, POST the following JSON fragment with your REST client to SMILA:
-<DataSourceConnectionConfig  ...>
+<source lang="javascript">
-  <DataSourceID>web</DataSourceID>
+#Request
-  <SchemaID>org.eclipse.smila.connectivity.framework.crawler.web</SchemaID>
+POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/
-  <DataConnectionID>
-    <Crawler>WebCrawler</Crawler>
-  </DataConnectionID>
-  <RecordBuffer Size="20" FlushInterval="3000" />
-  <DeltaIndexing>full</DeltaIndexing>
-  <Attributes>
-    ....
-  </Attributes>
-  <Process>
-    <WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="http://myReferer">
-      <UserAgent Name="Crawler" Version="1.0" Description="teddy crawler" Url="http://www.teddy.com" Email="crawler@teddy.com"/>
-      <CrawlingModel Type="MaxDepth" Value="1000"/>
-      <CrawlScope Type="Path" />
-      <CrawlLimits>
-        ...
-      </CrawlLimits>
-      <Seeds FollowLinks="NoFollow">
-        <Seed>http://wiki.eclipse.org/SMILA</Seed>
-      </Seeds>
-      <Filters>
-        <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
-        <Filter Type="RegExp" Value="^((?!/SMILA).)*$" WorkType="Unselect"/>
-      </Filters>
-      <MetaTagFilters>
-        <MetaTagFilter Type="Name" Name="robots" Content="noindex,nofollow" WorkType="Unselect"/>
-      </MetaTagFilters>
-    </WebSite>
-  </Process>
 </source>
-By default, the Web crawler is configured to index the URL ''http://wiki.eclipse.org/SMILA''. To change this, set the content of the <tt>&lt;Seed&gt;</tt> element to the desired web address and adapt the <tt><Filters></tt> section accordingly. If you require further help on this configuration file refer to the [[SMILA/Documentation/Web_Crawler|Web crawler documentation]]. For example, in the following we changed the web address to the main page of Wikipedia and removed one of the <tt>&lt;Filter&gt;</tt> elements:
+This starts the job <tt>crawlSmilaWiki</tt>, which crawls the [[SMILA|SMILA Eclipsepedia]] starting with <tt>http://wiki.eclipse.org/SMILA</tt> and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.
-<source lang="xml">
+If you like, you can monitor both job runs with your REST client at the following URIs:
- ...
+* Crawl job: [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]
- <Seeds FollowLinks="NoFollow">
+* Import job: [http://localhost:8080/smila/jobmanager/jobs/indexUpdate http://localhost:8080/smila/jobmanager/jobs/indexUpdate]
-   <Seed>http://en.wikipedia.org/wiki/Main_Page</Seed>
+Or both in one overview at
- </Seeds>
+* [http://localhost:8080/smila/jobmanager/jobs/ http://localhost:8080/smila/jobmanager/jobs/]
- <Filters>
-   <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/>
- </Filters>
- ...
-</source>
-To start the crawling process, save the configuration file, go back or reconnect to Jconsole, navigate to ''SMILA'' > ''CrawlerControl'' > ''Operations'', type "web" into the text field next to the <tt>startCrawlerTask</tt> button, then click the button.
+The crawling of the wikipedia page should take some time. If all pages are processed, the status of the [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki crawlSmilaWiki]'s job run will change to {{code|SUCCEEDED}}. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.
-[[Image:Starting-web-crawler-0.8.0.png]]
+'''Further information:''' You can find details about the relevant [[SMILA/Manual#Importing|Import concepts here]].
-Although the default limit for spidered web sites is set to 1,000 in the Web crawler configuration file, it may take a while for the web crawling job to be finished. Click the <tt>getCrawlerTasksState</tt> button to monitor the job processing if you want to find out when it has finished. This will produce an output similar to the following:
+== Search the index ==
-[[Image:SMILA-One-active-crawl-found-0.8.0.png]]
+{{note|Since SMILA uses [[SMILA/Documentation/Solr#solrconfig.xml|Solr's autocommit feature]] (which is configured in <tt>solrconfig.xml</tt> to a period of 30 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.}}
-If you do not want to wait, you may as well stop the crawling job manually. In order to do this, type "web" into the text field next to the (<tt>stopCrawlerTask</tt>) button, then click this button.
+To search the index which was created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The ''Default'' stylesheet shows a reduced search form with text fields like ''Query'', ''Result Size'', and ''Index'', adequate to query the full-text content of the indexed documents. The ''Advanced'' stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example ''Path'', ''MimeType'', ''Filename'', and other document attributes.
-As soon as the Web crawler's job has finished, go back to the search form to [[SMILA/Documentation_for_5_Minutes_to_Success#Search the index|search the generated index]].
+'''To use the ''Default'' Stylesheet''':
+#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
+#Enter a word that you expect to be contained in your dummy files into the ''Query'' text field.
+# Click ''OK'' to send your query to SMILA.
-[[Category:SMILA]]
+'''To use the ''Advanced'' Stylesheet''':
+#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
+#Click ''Advanced'' to switch to the detailed search form.
+#For example, to find a file by its name, enter the file name into the ''Filename'' text field, then click ''OK'' to submit your search.
-== Manage CrawlerController using the JMX Client  ==
+== Stop indexing job run ==
-Instead of managing the crawler jobs using JConsole it is also possible to use the JMX Client from the SMILA distribution for the same purpose. The JMX Client is a console application that allows managing crawler jobs and creating scripts intended for batch crawler execution. It can be found in the <tt>jmxclient</tt> directory of the SMILA distribution. Use the appropriate run script for your platform (i.e. <tt>run.bat</tt> or <tt>run.sh</tt>) to start the application.
+Although there's no need for it, we can finish our previously started indexing job run via REST client now:
-For example, to start the File System crawler use the following command:
+(replace <job-id> with the job-id you got before when [[#Start_indexing_job_run|you started the job run]]).
-<tt>
+<source lang="javascript">
-  run crawl file
+#Request
-</tt>
+POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish
+</source>
-For more information please check the [[SMILA/Documentation/Management#JMX_Client|JMX Client documentation]].
+You can monitor the job run via your browser to see that it has finished successfully:
+<source lang="javascript">
+#Request
+GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>
+</source>
-== 5 Minutes for changing the workflow  ==
+In the <tt>SMILA.log</tt> file you will see messages like this:
-In previous sections all data collected by crawlers was processed with the same workflow and was indexed into the same index named "test_index".
+<pre>
-It is possible, however, to configure SMILA so that data from different data sources will go through different workflows and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.
+ INFO ... internal.JobRunEngineImpl   - finish called for job 'indexUpdate', run '20110901-141457584011'
+ ...
+ INFO ... internal.JobRunEngineImpl   - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED
+</pre>
-In the following sections we are going to create an additional workflow for webcrawler records so that webcrawler data will be indexed into a separate index named "web_index".
+Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit [[SMILA/Manual|SMILA Manual]].
-=== Modify Listener rules ===
+== Further steps ==
-The first step includes modifying and extending the Listener rules so that webcrawler records are to be processed by their own BPEL workflow. For more information on the Listener component, please see section [[SMILA/Documentation/QueueWorker/Listener|Listener]] of the [[SMILA/Documentation/QueueWorker|QueueWorker]] documentation.
+=== Crawl the filesystem ===
-Open the configuration of the Listener from <tt>configuration/org.eclipse.smila.connectivity.queue.worker.jms/QueueWorkerListenerConfig.xml</tt> and edit the <tt><Condition></tt> tag of the existing ''ADD Rule'' to skip webcrawler data. The result should be as follows:
+SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.
-<source lang="xml">
-<Rule Name="ADD Rule" WaitMessageTimeout="10" Threads="4" MaxMessageBlockSize="20">
-  <Source BrokerId="broker1" Queue="SMILA.connectivity"/>
-  <Condition>Operation='ADD' and NOT(DataSourceID LIKE '%feeds%')
-    and NOT(DataSourceID LIKE '%xmldump%')
-    and NOT (DataSourceID LIKE 'web%')</Condition>
-  <Task>
-    <Process Workflow="AddPipeline"/>
-  </Task>
-</Rule>
-</source>
-Now add the following new rule:
-<source lang="xml">
-<Rule Name="Web ADD Rule" WaitMessageTimeout="10" Threads="2">
-  <Source BrokerId="broker1" Queue="SMILA.connectivity"/>
-  <Condition>Operation='ADD'
-    and DataSourceID LIKE 'web%'</Condition>
-  <Task>
-    <Process Workflow="AddWebPipeline"/>
-  </Task>
-</Rule>
-</source>
-This rule defines that webcrawler data will be processed by the ''AddWebPipeline'' workflow, which we will have to create in the next step.
-=== Create workflow for the BPEL WorkflowProcessor ===
+We will settle for the second option, because it does not need that you stop and restart SMILA.
-We need to add the ''AddWebPipeline'' workflow to the BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the [[SMILA/Documentation/BPEL_Workflow_Processor|BPEL WorkflowProcessor]] documentation.
-BPEL WorkflowProcessor configuration files are contained in the <tt>configuration/org.eclipse.smila.processing.bpel/pipelines</tt> directory.
-There is a file called <tt>addpipeline.bpel</tt> which defines the "AddPipeline" process. Let's create the <tt>addwebpipeline.bpel</tt> file that will define the "AddWebPipeline" process and put the following code into it:
-<source lang="xml">
-<?xml version="1.0" encoding="utf-8" ?>
-<process name="AddWebPipeline" targetNamespace="http://www.eclipse.org/smila/processor"
-    xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable"
-    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
-    xmlns:proc="http://www.eclipse.org/smila/processor"
-    xmlns:rec="http://www.eclipse.org/smila/record">
-  <import location="processor.wsdl" namespace="http://www.eclipse.org/smila/processor"
-      importType="http://schemas.xmlsoap.org/wsdl/" />
-  <partnerLinks>
-    <partnerLink name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" myRole="service" />
-  </partnerLinks>
-  <extensions>
-    <extension namespace="http://www.eclipse.org/smila/processor" mustUnderstand="no" />
-  </extensions>
-  <variables>
-    <variable name="request" messageType="proc:ProcessorMessage" />
-  </variables>
-  <sequence>
-    <receive name="start" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process"
-        variable="request" createInstance="yes" />
-    <!-- only process text based content, skip everything else -->
-    <if name="conditionIsText">
-      <condition>starts-with($request.records/rec:Record[1]/rec:Val[@key="MimeType"],"text/")</condition>
-      <sequence name="processTextBasedContent">
-        <!-- extract txt from html files -->
-        <if name="conditionIsHtml">
-          <condition>starts-with($request.records/rec:Record[1]/rec:Val[@key="MimeType"],"text/html")
-            or
-            starts-with($request.records/rec:Record[1]/rec:Val[@key="MimeType"],"text/xml")
-          </condition>
-        </if>
-        <extensionActivity>
-          <proc:invokePipelet name="invokeHtml2Txt">
-            <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
-            <proc:variables input="request" output="request" />
-            <proc:configuration>
-              <rec:Val key="inputType">ATTACHMENT</rec:Val>
-              <rec:Val key="outputType">ATTACHMENT</rec:Val>
-              <rec:Val key="inputName">Content</rec:Val>
-              <rec:Val key="outputName">Content</rec:Val>
-              <rec:Val key="meta:title">Title</rec:Val>
-            </proc:configuration>
-          </proc:invokePipelet>
-        </extensionActivity>
-        <extensionActivity>
-          <proc:invokePipelet name="invokeLucenePipelet">
-            <proc:pipelet class="org.eclipse.smila.lucene.pipelets.LuceneIndexPipelet" />
-            <proc:variables input="request" output="request" />
-            <proc:configuration>
-              <rec:Map key="_indexing">
-                <rec:Val key="indexname">web_index</rec:Val>
-                <rec:Val key="executionMode">ADD</rec:Val>
-              </rec:Map>
-            </proc:configuration>
-          </proc:invokePipelet>
-        </extensionActivity>
-      </sequence>
-    </if>
-    <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType"
-operation="process" variable="request" />
-    <exit />
-  </sequence>
-</process>
-</source>
-Note that we use "web_index" index name for the LuceneService in the code above:
+==== Create your Job ====
-<source lang="xml">
+POST the following job description to [[SMILA/Documentation/JobDefinitions#List.2C_create.2C_modify_jobs|SMILA's Job API]] at <tt>http://localhost:8080/smila/jobmanager/jobs</tt>. Adapt the <tt>rootFolder</tt> parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, office docs or HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. <tt>c:\\data\\files</tt>.
-<proc:configuration>
+<source lang="javascript">
-   <rec:Map key="_indexing">
+#Request
-     <rec:Val key="indexname">web_index</rec:Val>
+POST http://localhost:8080/smila/jobmanager/jobs/
-     <rec:Val key="executionMode">ADD</rec:Val>
+{
-  </rec:Map>
+   "name":"crawlFilesAtData",
-</proc:configuration>
+  "workflow":"fileCrawling",
+  "parameters":{
+     "tempStore":"temp",
+    "dataSource":"file",
+    "rootFolder":"/data",
+    "jobToPushTo":"indexUpdate",
+     "mapping":{
+      "fileContent":"Content",
+      "filePath":"Path",
+      "fileName":"Filename",
+      "fileExtension":"Extension",
+      "fileLastModified":"LastModifiedDate"
+    }
+  }
+}
 </source>
-We need to add our pipeline description to the <tt>deploy.xml</tt> file placed in the same directory. Add the following code to the end of <tt>deploy.xml</tt> before the closing <tt></deploy></tt> tag:
+''Hint: Not all file formats are supported by SMILA out-of-the-box. Have a look [[SMILA/Documentation/TikaPipelet#Supported_document_types | here]] for details.''
-<source lang="xml">
-<process name="proc:AddWebPipeline">
-  <in-memory>true</in-memory>
-  <provide partnerLink="Pipeline">
-    <service name="proc:AddWebPipeline" port="ProcessorPort" />
-  </provide>
-</process>
-</source>
-Now we need to add our "web_index" to the LuceneIndexService configuration.
+==== Start your jobs ====
-=== Configure LuceneIndexService ===
+*Start the <tt>indexUpdate</tt> (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine:
-For more information about the LuceneIndexService, please see [[SMILA/Documentation/LuceneIndexService|LuceneIndexService]].
+<div style="margin-left: 1.5em;">
+<source lang="javascript">
-Let's configure our "web_index" index structure and search template. Add the following code to the end of <tt>configuration/org.eclipse.smila.search.datadictionary/DataDictionary.xml</tt> file before the closing <tt></AnyFinderDataDictionary></tt> tag:
+#Request
-<source lang="xml">
+POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
-<Index Name="web_index">
-  <Connection xmlns="http://www.anyfinder.de/DataDictionary/Connection" MaxConnections="5"/>
-  <IndexStructure xmlns="http://www.anyfinder.de/IndexStructure" Name="web_index">
-    <Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
-    <IndexField FieldNo="8" IndexValue="true" Name="MimeType" StoreText="true" Tokenize="true" Type="Text"/>
-    <IndexField FieldNo="7" IndexValue="true" Name="Size" StoreText="true" Tokenize="true" Type="Text"/>
-    <IndexField FieldNo="6" IndexValue="true" Name="Extension" StoreText="true" Tokenize="true" Type="Text"/>
-    <IndexField FieldNo="5" IndexValue="true" Name="Title" StoreText="true" Tokenize="true" Type="Text"/>
-    <IndexField FieldNo="4" IndexValue="true" Name="Url" StoreText="true" Tokenize="false" Type="Text">
-      <Analyzer ClassName="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
-    </IndexField>
-    <IndexField FieldNo="3" IndexValue="true" Name="LastModifiedDate" StoreText="true" Tokenize="false" Type="Text"/>
-    <IndexField FieldNo="2" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
-    <IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
-    <IndexField FieldNo="0" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
-  </IndexStructure>
-  <Configuration xmlns="http://www.anyfinder.de/DataDictionary/Configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-xsi:schemaLocation="http://www.anyfinder.de/DataDictionary/Configuration ../xml/DataDictionaryConfiguration.xsd">
-    <DefaultConfig>
-      <Field FieldNo="8">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="7">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="6">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="5">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="4">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="3">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="2">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="1">
-        <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="OR" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-      <Field FieldNo="0">
-        <FieldConfig Constraint="required" Weight="1" xsi:type="FTText">
-          <NodeTransformer xmlns="http://www.anyfinder.de/Search/ParameterObjects" Name="urn:ExtendedNodeTransformer">
-            <ParameterSet xmlns="http://www.brox.de/ParameterSet"/>
-          </NodeTransformer>
-          <Parameter xmlns="http://www.anyfinder.de/Search/TextField" Operator="AND" Tolerance="exact"/>
-        </FieldConfig>
-      </Field>
-    </DefaultConfig>
- </Configuration>
-</Index>
 </source>
-Now we need to add mapping of attribute and attachment names to Lucene "FieldNo" defined in <tt>DataDictionary.xml</tt>. Open <tt>configuration/org.eclipse.smila.lucene/Mappings.xml</tt> file and add the following code to the end of file before closing <tt></Mappings></tt> tag:
+</div>
-<source lang="xml">
+*Start your <tt>crawlFilesAtData</tt> job similar to [[#Start_the_crawler|Start the crawler]] but now use the job name <tt>crawlFilesAtData</tt> instead of <tt>crawlSmilaWiki</tt>. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your {{code|rootFolder}}.
-<Mapping indexName="web_index">
+<div style="margin-left: 1.5em;">
-  <Attributes>
+<source lang="javascript">
-    <Attribute name="Filename" fieldNo="1" />
+#Request
-    <Attribute name="Path" fieldNo="2" />
+POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/
-  <Attribute name="LastModifiedDate" fieldNo="3" />
-  <Attribute name="Url" fieldNo="4" />
-  <Attribute name="Title" fieldNo="5" />
-  <Attribute name="Extension" fieldNo="6" />
-  <Attribute name="Size" fieldNo="7" />
-  <Attribute name="MimeType" fieldNo="8" />
-  </Attributes>
-  <Attachments>
-    <Attachment name="Content" fieldNo="0" />
-  </Attachments>
-</Mapping>
 </source>
+</div>
-=== Put it  all together ===
+==== Search for your new data ====
-Ok, now it seems that we have finally finished configuring SMILA for using separate workflows for file system and web crawling and index data from these crawlers into different indices.
+#After the job run's finished, wait a bit, then check whether the data has been indexed (see [[#Search_the_index|Search the index]] for help).
-Here is what we have done so far:
+#It is also a good idea to check the log file for errors.
-# We modified the Listener rules in order to use different workflows for web and file system crawling.
-# We created a new BPEL workflow for Web crawler data.
-# We added the <tt>web_index</tt> index to the Lucence configuration.
-Now we can start SMILA again and observe what will happen when starting the Web crawler.
-{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
-|
-It's very important to shutdown the SMILA engine and restart it afterwards because modified configurations are loaded during startup only.
-|}
-Now you can search the new index "web_index" using your browser:
-[[Image:Web_index-search.png]]
-== Configuration overview ==
-SMILA configuration files are located in the <tt>configuration</tt> directory of the SMILA application.
+=== 5 more minutes to change the workflow ===
-The following figure shows the configuration files relevant to this tutorial, regarding SMILA components and the data lifecycle. The names of SMILA components are formatted in black font, directories containing configuration files and filenames are shown in blue.
-[[Image:Smila-configuration-overview.jpg]]
+The [[SMILA/Documentation/5 more minutes to change the workflow|5 more minutes to change the workflow]] show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.
+ (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine:

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.