Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

m (Start indexing job run)
(JRE)
(46 intermediate revisions by 4 users not shown)
Line 2: Line 2:
 
[[Category:HowTo]]
 
[[Category:HowTo]]
  
This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.
+
On this page we describe the necessary steps to install and run SMILA in order to create a search index on the [[SMILA]] Eclipsepedia pages and search them.
  
 
If you have any troubles or the results differ from what is described here, check the [[SMILA/FAQ|FAQ]].
 
If you have any troubles or the results differ from what is described here, check the [[SMILA/FAQ|FAQ]].
 +
 +
== Supported Platforms ==
 +
The following platforms are supported:
 +
*Linux 32 Bit
 +
*Linux 64 Bit
 +
*Mac OS X 64 Bit (Cocoa)
 +
*Windows 32 Bit
 +
*Windows 64 Bit
  
 
== Download and start SMILA ==
 
== Download and start SMILA ==
  
[http://www.eclipse.org/smila/downloads.php Download] the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:
+
[http://www.eclipse.org/smila/downloads.php Download] the SMILA package matching your [[#Supported_Platforms|operation system]] and unpack it to an arbitrary folder. This will result in the following folder structure:
  
 
<pre>
 
<pre>
Line 25: Line 33:
  
 
=== Preconditions ===
 
=== Preconditions ===
To be able to start SMILA, check the following preconditions:
+
To be able to start SMILA, check the following preconditions first:
 
+
==== Supported Platforms ====
+
The following platforms are supported:
+
*Linux 32 Bit
+
*Linux 64 Bit
+
*Mac OS X 64 Bit (Cocoa)
+
*Windows 32 Bit
+
*Windows 64 Bit
+
  
 
==== JRE ====
 
==== JRE ====
 
You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. You may either:
 
You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. You may either:
 
* add the path of your local JRE executable to the PATH environment variable <br>or<br>
 
* add the path of your local JRE executable to the PATH environment variable <br>or<br>
* add the argument <tt>-vm <path/to/jre/executable></tt> right at the top of the file <tt>SMILA.ini</tt>. <br>Make sure that <tt>-vm</tt> is indeed the first argument in the file and that there is a line break after it. It should look similar to the following:
+
* add the argument <tt>-vm <path/to/jre/executable></tt> right at the top of the file <tt>SMILA.ini</tt>. <br>Make sure that <tt>-vm</tt> is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
 
<div style="margin-left: 1.5em;">
 
<div style="margin-left: 1.5em;">
 
<source lang="text">
 
<source lang="text">
-vm
+
-vm
d:/java/jre6/bin/java
+
d:/java/jre7/bin/java
...
+
...
 
</source>
 
</source>
 
</div>
 
</div>
Line 55: Line 55:
  
 
==== MacOS ====
 
==== MacOS ====
When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following commands in a console:
+
When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following command in a console:
 
<tt>
 
<tt>
 
  chmod a+x ./SMILA
 
  chmod a+x ./SMILA
Line 61: Line 61:
  
 
=== Start SMILA ===
 
=== Start SMILA ===
To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the <tt>SMILA</tt> executable. Wait until the engine has been fully started. You can tell if SMILA has fully started if the following line is printed in the console window: <tt>HTTP server started successfully on port 8080</tt> and you can access SMILA's REST API at [http://localhost:8080/smila/ http://localhost:8080/smila/].
+
To start the SMILA engine, simply double-click the <tt>SMILA</tt> executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the <tt>SMILA</tt> executable. Wait until the engine has been fully started.
 +
 
 +
You can tell if SMILA has fully started if the following line is printed on the OSGI console: <tt>HTTP server started successfully on port 8080</tt> and you can access SMILA's REST API at [http://localhost:8080/smila/ http://localhost:8080/smila/].
  
 
When using MAC, navigate to <tt>SMILA.app/Contents/MacOS/</tt> in terminal, then start with <tt>./SMILA</tt>
 
When using MAC, navigate to <tt>SMILA.app/Contents/MacOS/</tt> in terminal, then start with <tt>./SMILA</tt>
  
 
Before continuing, [[SMILA/FAQ#How_can_I_see_that_SMILA_started_correctly.3F|check the log file]] for possible errors.
 
Before continuing, [[SMILA/FAQ#How_can_I_see_that_SMILA_started_correctly.3F|check the log file]] for possible errors.
 +
 +
=== Stop SMILA ===
 +
 +
To stop the SMILA engine, type <tt>close</tt> into the OSGI console and press ''Enter'':
 +
 +
<source lang="text">
 +
osgi> close
 +
</source>
 +
 +
For further OSGI console commands, enter <tt>help</tt>:
 +
 +
<source lang="text">
 +
osgi> help
 +
</source>
  
 
== Install a REST client ==
 
== Install a REST client ==
Line 73: Line 89:
 
== Start Indexing Job and Crawl Import ==
 
== Start Indexing Job and Crawl Import ==
  
Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr.
+
Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded [[SMILA/Documentation/Solr|Solr integration]].
  
 
=== Start indexing job run ===
 
=== Start indexing job run ===
Line 79: Line 95:
 
We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.
 
We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.
  
Use your favorite REST Client to start a job run for the job "indexUpdateJob":
+
Use your favorite REST Client to start a job run for the job "indexUpdate":
  
<pre>
+
<source lang="javascript">
 +
#Request
 
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
 
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
</pre>
+
</source>
  
 
Your REST client will show a result like this:
 
Your REST client will show a result like this:
  
<pre>
+
<source lang="javascript">
 +
#Response
 
{
 
{
 
   "jobId" : "20110901-121343613053",
 
   "jobId" : "20110901-121343613053",
 
   "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
 
   "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
 
}
 
}
</pre>
+
</source>
  
 
You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:
 
You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:
  
<pre>
+
<source lang="javascript">
http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
+
#Request
</pre>
+
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
 +
</source>
  
 
In the <tt>SMILA.log</tt> file you will see a message like that:
 
In the <tt>SMILA.log</tt> file you will see a message like that:
Line 109: Line 128:
 
=== Start the crawler ===
 
=== Start the crawler ===
  
Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now. We need to start this job in the so-called {{code|runOnce}} mode, which is a special mode where tasks are generated by the system rather than by an input trigger and where the jobs are finished automatically. For more information why this is the case, please see [[SMILA/Documentation/Importing/Concept|Importing Concept]]. For more information on jobs and tasks, visit the [[SMILA/Documentation/JobManager|JobManager manual]].
+
Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now. For more information about crawl jobs please see [[SMILA/Documentation/Importing/Concept|Importing Concept]]. For more information on jobs and tasks in general visit the [[SMILA/Documentation/JobManager|JobManager manual]].
  
To start the job run, POST the following JSON fragment with your REST client to the SMILA job REST API at [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]:
+
To start the job run, POST the following JSON fragment with your REST client to SMILA:
 
<source lang="javascript">
 
<source lang="javascript">
{
+
#Request
  "mode": "runOnce"
+
POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/
}
+
 
</source>
 
</source>
  
This starts the job <tt>crawlSmilaWiki</tt>, which crawls the [[SMILA|SMILA Eclipsepedia]] starting at <tt>http://wiki.eclipse.org/SMILA</tt> and following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.
+
This starts the job <tt>crawlSmilaWiki</tt>, which crawls the [[SMILA|SMILA Eclipsepedia]] starting with <tt>http://wiki.eclipse.org/SMILA</tt> and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.
  
 
If you like, you can monitor both job runs with your REST client at the following URIs:
 
If you like, you can monitor both job runs with your REST client at the following URIs:
* crawl job: [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]
+
* Crawl job: [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]
* import job: [http://localhost:8080/smila/jobmanager/jobs/indexUpdate http://localhost:8080/smila/jobmanager/jobs/indexUpdate]
+
* Import job: [http://localhost:8080/smila/jobmanager/jobs/indexUpdate http://localhost:8080/smila/jobmanager/jobs/indexUpdate]
Or just in one overview at
+
Or both in one overview at
 
* [http://localhost:8080/smila/jobmanager/jobs/ http://localhost:8080/smila/jobmanager/jobs/]
 
* [http://localhost:8080/smila/jobmanager/jobs/ http://localhost:8080/smila/jobmanager/jobs/]
  
 
The crawling of the wikipedia page should take some time. If all pages are processed, the status of the [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki crawlSmilaWiki]'s job run will change to {{code|SUCCEEDED}}. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.
 
The crawling of the wikipedia page should take some time. If all pages are processed, the status of the [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki crawlSmilaWiki]'s job run will change to {{code|SUCCEEDED}}. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.
  
You can find more information about the [[SMILA/Manual#Importing|Importing here]].
+
'''Further information:''' You can find details about the relevant [[SMILA/Manual#Importing|Import concepts here]].
  
 
== Search the index ==
 
== Search the index ==
Line 134: Line 152:
 
{{note|Since SMILA uses [[SMILA/Documentation/Solr#solrconfig.xml|Solr's autocommit feature]] (which is configured in <tt>solrconfig.xml</tt> to a period of 60 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.}}
 
{{note|Since SMILA uses [[SMILA/Documentation/Solr#solrconfig.xml|Solr's autocommit feature]] (which is configured in <tt>solrconfig.xml</tt> to a period of 60 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.}}
  
To search the index which was created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The ''Default'' stylesheet shows a reduced search form with text fields like ''Query'', ''Result Size'', and ''Index Name'', adequate to query the full-text content of the indexed documents. The ''Advanced'' stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example ''Path'', ''MimeType'', ''Filename'', and other document attributes.  
+
To search the index which was created by the crawlers, point your browser to <tt>http://localhost:8080/SMILA/search</tt>. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The ''Default'' stylesheet shows a reduced search form with text fields like ''Query'', ''Result Size'', and ''Index'', adequate to query the full-text content of the indexed documents. The ''Advanced'' stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example ''Path'', ''MimeType'', ''Filename'', and other document attributes.  
  
 
'''To use the ''Default'' Stylesheet''':
 
'''To use the ''Default'' Stylesheet''':
Line 144: Line 162:
 
#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
 
#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
 
#Click ''Advanced'' to switch to the detailed search form.
 
#Click ''Advanced'' to switch to the detailed search form.
#To find a file by its name, enter the file name into the ''Filename'' text field, then click ''OK'' to submit your search.
+
#For example, to find a file by its name, enter the file name into the ''Filename'' text field, then click ''OK'' to submit your search.
  
 
== Stop indexing job run ==
 
== Stop indexing job run ==
  
 
Although there's no need for it, we can finish our previously started indexing job run via REST client now:
 
Although there's no need for it, we can finish our previously started indexing job run via REST client now:
(replace <job-id> with the job-id you got before when you started the job run).
+
(replace <job-id> with the job-id you got before when [[#Start_indexing_job_run|you started the job run]]).
  
<pre>
+
<source lang="javascript">
 +
#Request
 
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish   
 
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish   
</pre>
+
</source>
  
 
You can monitor the job run via your browser to see that it has finished successfully:
 
You can monitor the job run via your browser to see that it has finished successfully:
<pre>
+
<source lang="javascript">
http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>
+
#Request
</pre>
+
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>
 +
</source>
  
 
In the <tt>SMILA.log</tt> file you will see messages like this:
 
In the <tt>SMILA.log</tt> file you will see messages like this:
Line 173: Line 193:
 
=== Crawl the filesystem ===
 
=== Crawl the filesystem ===
  
SMILA has also a predefined job to crawl the file system, but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.
+
SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.
  
 
We will settle for the second option, because it does not need that you stop and restart SMILA.
 
We will settle for the second option, because it does not need that you stop and restart SMILA.
  
 
==== Create your Job ====
 
==== Create your Job ====
POST the following job description to [[SMILA/Documentation/JobDefinitions#List.2C_create.2C_modify_jobs|SMILA's Job API]] at <tt>http://localhost:8080/smila/jobmanager/jobs</tt> (the name is just an example, as well as the {{code|rootFolder}}, which you should set to an existing folder on your machine where some data files (e.g. plain text, HTML files) reside.
+
POST the following job description to [[SMILA/Documentation/JobDefinitions#List.2C_create.2C_modify_jobs|SMILA's Job API]] at <tt>http://localhost:8080/smila/jobmanager/jobs</tt>. Adapt the <tt>rootFolder</tt> parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. <tt>c:\\data\files</tt>.
 
<source lang="javascript">
 
<source lang="javascript">
 +
#Request
 +
POST http://localhost:8080/smila/jobmanager/jobs/
 
{
 
{
 
   "name":"crawlFilesAtData",
 
   "name":"crawlFilesAtData",
Line 187: Line 209:
 
     "dataSource":"file",
 
     "dataSource":"file",
 
     "rootFolder":"/data",
 
     "rootFolder":"/data",
     "jobToPushTo":"indexUpdate"
+
     "jobToPushTo":"indexUpdate",
 +
    "mapping":{
 +
      "fileContent":"Content",
 +
      "filePath":"Path",     
 +
      "fileName":"Filename",     
 +
      "fileExtension":"Extension",
 +
      "fileLastModified":"LastModifiedDate"
 +
    }
 
   }
 
   }
 
}
 
}
Line 196: Line 225:
 
==== Start your jobs ====
 
==== Start your jobs ====
  
#Start the <tt>indexUpdateJob</tt> (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine.
+
*Start the <tt>indexUpdate</tt> (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine:
# Start your <tt>crawlFilesAtData</tt> job similar to [[#Start_the_crawler|Start the crawler]] but now use the job name <tt>crawlFilesAtData</tt> instead of <tt>crawlSmilaWiki</tt>.<br> This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your {{code|rootFolder}}.
+
<div style="margin-left: 1.5em;">
 +
<source lang="javascript">
 +
#Request
 +
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
 +
</source>
 +
</div>
 +
*Start your <tt>crawlFilesAtData</tt> job similar to [[#Start_the_crawler|Start the crawler]] but now use the job name <tt>crawlFilesAtData</tt> instead of <tt>crawlSmilaWiki</tt>. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your {{code|rootFolder}}.
 +
<div style="margin-left: 1.5em;">
 +
<source lang="javascript">
 +
#Request
 +
POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/
 +
</source>
 +
</div>
  
 
==== Search for your new data ====
 
==== Search for your new data ====
Line 206: Line 247:
  
 
The [[SMILA/Documentation/5 more minutes to change the workflow|5 more minutes to change the workflow]] show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.
 
The [[SMILA/Documentation/5 more minutes to change the workflow|5 more minutes to change the workflow]] show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.
 +
(see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. If it is still running, that's fine:

Revision as of 08:41, 3 July 2012


On this page we describe the necessary steps to install and run SMILA in order to create a search index on the SMILA Eclipsepedia pages and search them.

If you have any troubles or the results differ from what is described here, check the FAQ.

Supported Platforms

The following platforms are supported:

  • Linux 32 Bit
  • Linux 64 Bit
  • Mac OS X 64 Bit (Cocoa)
  • Windows 32 Bit
  • Windows 64 Bit

Download and start SMILA

Download the SMILA package matching your operation system and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /about_files
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini

Preconditions

To be able to start SMILA, check the following preconditions first:

JRE

You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5. You may either:

  • add the path of your local JRE executable to the PATH environment variable
    or
  • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
    Make sure that -vm is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
-vm
d:/java/jre7/bin/java
...

Linux

When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA
chmod +x ./jmxclient/run.sh

MacOS

When using MAC switch to SMILA.app/Contents/MacOS/ and set the permission by running the following command in a console:

chmod a+x ./SMILA

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the SMILA executable. Wait until the engine has been fully started.

You can tell if SMILA has fully started if the following line is printed on the OSGI console: HTTP server started successfully on port 8080 and you can access SMILA's REST API at http://localhost:8080/smila/.

When using MAC, navigate to SMILA.app/Contents/MacOS/ in terminal, then start with ./SMILA

Before continuing, check the log file for possible errors.

Stop SMILA

To stop the SMILA engine, type close into the OSGI console and press Enter:

osgi> close

For further OSGI console commands, enter help:

osgi> help

Install a REST client

We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In REST Tools you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.

Start Indexing Job and Crawl Import

Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr integration.

Start indexing job run

We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.

Use your favorite REST Client to start a job run for the job "indexUpdate":

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

Your REST client will show a result like this:

#Response
{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
}

You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:

#Request
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

In the SMILA.log file you will see a message like that:

INFO ... internal.JobRunEngineImpl   - started job run '20110901-121343613053' for job 'indexUpdate'

Further information: The "indexUpdate" workflow uses the PipelineProcessorWorker that executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Start the crawler

Now that the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages which we are going to start right now. For more information about crawl jobs please see Importing Concept. For more information on jobs and tasks in general visit the JobManager manual.

To start the job run, POST the following JSON fragment with your REST client to SMILA:

#Request
POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/

This starts the job crawlSmilaWiki, which crawls the SMILA Eclipsepedia starting with http://wiki.eclipse.org/SMILA and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.

If you like, you can monitor both job runs with your REST client at the following URIs:

Or both in one overview at

The crawling of the wikipedia page should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. You can have a look at SMILA's search page to find out if some of the pages have already made their way into the Solr index.

Further information: You can find details about the relevant Import concepts here.

Search the index

Note.png
Since SMILA uses Solr's autocommit feature (which is configured in solrconfig.xml to a period of 60 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.


To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

To use the Default Stylesheet:

  1. Point your browser to http://localhost:8080/SMILA/search.
  2. Enter a word that you expect to be contained in your dummy files into the Query text field.
  3. Click OK to send your query to SMILA.

To use the Advanced Stylesheet:

  1. Point your browser to http://localhost:8080/SMILA/search.
  2. Click Advanced to switch to the detailed search form.
  3. For example, to find a file by its name, enter the file name into the Filename text field, then click OK to submit your search.

Stop indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (replace <job-id> with the job-id you got before when you started the job run).

#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish

You can monitor the job run via your browser to see that it has finished successfully:

#Request
GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>

In the SMILA.log file you will see messages like this:

 INFO ... internal.JobRunEngineImpl   - finish called for job 'indexUpdate', run '20110901-141457584011'
 ...
 INFO ... internal.JobRunEngineImpl   - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED

Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit SMILA Manual.

Further steps

Crawl the filesystem

SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.

We will settle for the second option, because it does not need that you stop and restart SMILA.

Create your Job

POST the following job description to SMILA's Job API at http://localhost:8080/smila/jobmanager/jobs. Adapt the rootFolder parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. c:\\data\files.

#Request
POST http://localhost:8080/smila/jobmanager/jobs/
{
  "name":"crawlFilesAtData",
  "workflow":"fileCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"file",
    "rootFolder":"/data",
    "jobToPushTo":"indexUpdate",
    "mapping":{
      "fileContent":"Content",
      "filePath":"Path",       
      "fileName":"Filename",       
      "fileExtension":"Extension",
      "fileLastModified":"LastModifiedDate"
    }
  }
}

For text files other than plain text and HTML you cannot search inside the document's text (at least not right now, but you might have a look at Aperture Pipelet which addresses this problem).

Start your jobs

  • Start the indexUpdate (see Start indexing job run), if you have already stopped it. If it is still running, that's fine:
#Request
POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
  • Start your crawlFilesAtData job similar to Start the crawler but now use the job name crawlFilesAtData instead of crawlSmilaWiki. This new job behaves just like the web crawling job, but its run time might be shorter, depending on how much data actually is at your rootFolder.
#Request
POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/

Search for your new data

  1. After the job run's finished, wait a bit, then check whether the data has been indexed (see Search the index for help).
  2. It is also a good idea to check the log file for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show how you can configure the system so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices.

(see Start indexing job run), if you have already stopped it. If it is still running, that's fine:

Copyright © Eclipse Foundation, Inc. All Rights Reserved.