Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

m
(Stop SMILA)
 
(319 intermediate revisions by 18 users not shown)
Line 1: Line 1:
=== 1. Download and unpack EILF application. ===
+
[[Category:SMILA]]
 +
[[Category:HowTo]]
  
 +
On this page we describe the necessary steps to install and run SMILA in order to create a search index on the [[SMILA]] Eclipsepedia pages and search them.
  
[[Image:Save-and-extract.png]]
+
If you have any troubles or the results differ from what is described here, check the [[SMILA/FAQ|FAQ]].
  
 +
== Supported Platforms ==
 +
The following platforms are supported:
 +
*Linux 32 Bit
 +
*Linux 64 Bit
 +
*Mac OS X 64 Bit (Cocoa)
 +
*Windows 32 Bit
 +
*Windows 64 Bit
  
=== 2. Start EILF engine. ===
+
== Download and start SMILA ==
To start EILF engine open terminal, navigate to directory that contains extracted files and run EILF executable. Wait until the engine is fully started. You should see output similar to the screenshot below
+
  
[[Image:Start-engine.png]]
+
[http://www.eclipse.org/smila/downloads.php Download] the SMILA package matching your [[#Supported_Platforms|operation system]] and unpack it to an arbitrary folder. This will result in the following folder structure:
  
 +
<pre>
 +
/<SMILA>
 +
  /configuration   
 +
  ...
 +
  SMILA
 +
  SMILA.ini
 +
</pre>
  
=== 3. Log file. ===
+
=== Preconditions ===
You can check what's happening from the log file. Log file is placed into the same directory as EILF executable and is named EILF.log
+
To be able to start SMILA, check the following preconditions first:
  
[[Image:Log-file.png]]
+
==== JRE ====
 +
You will have to provide a JRE executable to be able to run SMILA. The JVM version should be Java 7 (or newer). You may either:
 +
* add the path of your local JRE executable to the PATH environment variable <br>or<br>
 +
* add the argument <tt>-vm <path/to/jre/executable></tt> right at the top of the file <tt>SMILA.ini</tt>. <br>Make sure that <tt>-vm</tt> is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
 +
<div style="margin-left: 1.5em;">
 +
<source lang="text">
 +
-vm
 +
d:/java/jre7/bin/java
 +
...
 +
</source>
 +
</div>
  
 +
==== Linux ====
 +
When using Linux, make sure that the file <tt>SMILA</tt> has executable permissions. If not, set the permission by running the following commands in a console:
 +
<tt>
 +
chmod +x ./SMILA
 +
</tt>
  
=== 4. Managing crawling jobs. ===
+
==== MacOS ====
Now when EILF engine is up and running we can start and manage crawling jobs. Crawling jobs are managed over JMX protocol, that means that we need to connect to EILF with some JMX client.
+
When using MAC, switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following command in a console:
Let's use JConsole that is available with Sun Java distribution.
+
<tt>
 +
chmod a+x ./SMILA
 +
</tt>
  
[[Image:Jconsole-connect.png]]
+
=== Start SMILA ===
 +
To start SMILA, simply start the <tt>SMILA</tt> executable.  
  
 +
You can see that SMILA has fully started if the following line is printed on the OSGI console:
 +
<tt>
 +
  ...
 +
  HTTP server started successfully on port 8080
 +
</tt>
 +
and you can access SMILA's REST API at [http://localhost:8080/smila/ http://localhost:8080/smila/].
  
Next, open MBeans tab, expand the EILF node in the MBeans tree to the left and click on org.eclipse.eilf.connectivity.framework.CrawlerController node.
+
If it doesn't work, check the log file (SMILA.log) for possible errors.
It's the point from where all crawling activities are managed and monitored. You can see some crawling attributes at the right pane.
+
  
[[Image:Mbeans-overview.png]]
+
=== Stop SMILA ===
  
 +
To stop SMILA, type <tt>exit</tt> into the OSGI console and press ''Enter'':
  
=== 5. Starting filesystem crawler. ===
+
<tt>
To start a crawler, open Operations tab at the right pane, type 'file' in the inputbox next to the startCrawl button and press the startCrawl button.
+
  osgi> exit
 +
</tt>
  
[[Image:Start-file-crawl.png]]
+
== Start Indexing Job and Crawl Import ==
  
 +
Now we're going to crawl and process the SMILA Eclipsepedia pages, Finally we index and search them by using the embedded [[SMILA/Documentation/Solr|Solr integration]].
  
You should get a message that says that crawler was successfully started:
+
=== Install a REST client ===
  
[[Image:Start-crawl-file-result.png]]
+
We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In [[SMILA/Documentation/Using_The_ReST_API#Interactive_Tools|REST Tools]] you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.
  
 +
=== Start the indexing job run ===
  
Now we can check a log to see what happened:
+
We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous workflow with the same name. This indexing job will process the imported data.
  
[[Image:File-crawl-log.png]]
+
Use your favorite REST Client to start a job run for the job "indexUpdate":
  
 +
<tt>
 +
  POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
 +
</tt>
  
=== 6. Configuring filesystem crawler. ===
+
Your REST client will show a result like this:
You could notice that there was a error message in the log output in the previous screenshot that said the following:
+
<tt>
 +
{
 +
  "jobId" : "20110901-121343613053",
 +
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
 +
}
 +
</tt>
  
'' 2008-09-11 18:14:36,059 [Thread-13] ERROR impl.CrawlThread - org.eclipse.eilf.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found ''
+
You will need the job run id ("jobId") later on to finish the job run. The job run Id can also be found via the monitoring API for the job:
  
This means that crawler tried to index the c:\data folder that was not present in the filesystem. Let's create folder with sample data, say ~/tmp/data, put some dummy text files into it and say filesystem crawler to index it.
+
<tt>
Filesystem Crawler configuration file is located at configuration/org.eclipse.eilf.connectivity.framework directory of the EILF distribution and named 'file'.
+
  GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
Let's open this file in text editor and modify BaseDir value. Put there absolute path to our sample directory that was created in previous step and save the file.
+
</tt>
  
[[Image:File-crawl-config-with-data.png]]
+
In the <tt>SMILA.log</tt> file you will see a message like that:
 +
<tt>
 +
  INFO ... internal.JobRunEngineImpl  - started job run '20110901-121343613053' for job 'indexUpdate'
 +
</tt>
  
 +
'''Further information''': The "indexUpdate" workflow uses the [[SMILA/Documentation/Worker/ScriptProcessorWorker|ScriptProcessorWorker]] that executes the JavaScript "add.js" workflow. So, the synchronous script call is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt> and <tt>jobs.json</tt>). For more information about job management in general please check the [[SMILA/Documentation/JobManager|JobManager documentation]].
  
Now let's start filesystem crawler with JConsole again (see step 4). This time there is something interesting in the log:
+
=== Start the crawl job run  ===
  
[[Image:Filesystem-crawler-in-work.png]]
+
Now that the indexing job is running we need to push some data to it. There is a predefined job for importing the [[SMILA|SMILA Wiki]] pages which we are going to start right now. 
 +
<tt>
 +
  POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/
 +
</tt>
  
It looks like something was indexed. In the next step we'll try to search on the index that was created.
+
This starts the job <tt>crawlSmilaWiki</tt>, which crawls the SMILA Wiki starting with <tt>http://wiki.eclipse.org/SMILA</tt> and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.
  
 +
Both job runs can be monitored via SMILA's REST API:
 +
* All jobs: [http://localhost:8080/smila/jobmanager/jobs/ http://localhost:8080/smila/jobmanager/jobs/]
 +
* Crawl job: [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki]
 +
* Import job: [http://localhost:8080/smila/jobmanager/jobs/indexUpdate http://localhost:8080/smila/jobmanager/jobs/indexUpdate]
  
=== 7. Searching on the indexes. ===
+
The crawling of the SMILA Wiki pages should take some time. If all pages are processed, the status of the [http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki crawlSmilaWiki]'s job run will change to {{code|SUCCEEDED}}. You can continue with the SMILA search (next chapter) to find out if some of the pages have already made their way into the Solr index.
To search on the indexes that were created by crawlers, point your browser to the http://localhost:8080/AnyFinder/SearchForm
+
Left column on the page opened is named Indexlist and currently there is only one index in this list. Click on it's name, test_index. The test_index search form should open:
+
  
[[Image:Eilf-search-form.png]]
+
'''Further information:''' For more information about importing and crawl jobs please see [[SMILA/Documentation#Importing | SMILA Importing ]]. For more information on jobs and tasks in general visit the [[SMILA/Documentation/JobManager|JobManager manual]].
  
 +
== Search the index ==
  
Now let's try to search on some word that was in our dummy files. In this tutorial, there was a word 'data ' in the sample.txt:
+
To have a look at the index state, e.g. how many documents are already indexed, call:
 +
<tt>
 +
  http://localhost:8080/solr/admin/
 +
</tt>
  
[[Image:Searching-for-text-in-file.png]]
+
To search the created index, point your browser to
 +
<tt>
 +
  http://localhost:8080/SMILA/search
 +
</tt>.  
  
 +
There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The ''Default'' stylesheet shows a reduced search form with text fields like ''Query'', ''Result Size'', and ''Index'', adequate to query the full-text content of the indexed documents. The ''Advanced'' stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example ''Path'', ''MimeType'', ''Filename'', and other document attributes.
  
There was also file 1.txt in the sample folder, let's check whether it was indexed. Type '1.txt' in the Filename field and click on search icon:
+
'''To use the ''Default'' Stylesheet''':
 +
#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
 +
#Enter the search term(s) into the ''Query'' text field (e.g. "SMILA").
 +
# Click ''OK'' to send your query to SMILA.
  
[[Image:Searching-by-filename.png]]
+
'''To use the ''Advanced'' Stylesheet''':
 +
#Point your browser to <tt>http://localhost:8080/SMILA/search</tt>.
 +
#Click ''Advanced'' to switch to the detailed search form.
 +
#For example, to find a file by its name, enter the file name into the ''Filename'' text field, then click ''OK'' to submit your search.
 +
 
 +
== Stop indexing job run ==
 +
 
 +
Although there's no need for it, we can finish our previously started indexing job run via REST client now:
 +
(replace <job-id> with the job run id you got before when [[#Start_indexing_job_run|you started the job run]]).
 +
 
 +
<tt>
 +
  POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish 
 +
</tt>
 +
 
 +
You can monitor the job run via your browser to see that it has finished successfully:
 +
<tt>
 +
  GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>
 +
</tt>
 +
 
 +
In the <tt>SMILA.log</tt> file you will see messages like this:
 +
<tt>
 +
INFO ... internal.JobRunEngineImpl  - finish called for job 'indexUpdate', run '20110901-141457584011'
 +
...
 +
INFO ... internal.JobRunEngineImpl  - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED
 +
</tt>
 +
 
 +
<br/>
 +
<br/>
 +
'''Congratulations, you've just finished the tutorial! '''
 +
 
 +
You crawled the SMILA Wiki, indexed the pages and searched through them. For more, just continue with the chapter below or visit the [[SMILA/Documentation|SMILA Documentation]].
 +
 
 +
== Further steps ==
 +
 
 +
=== Crawl the filesystem ===
 +
 
 +
SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.
 +
 
 +
We will settle for the second option, because it does not need that you stop and restart SMILA.
 +
 
 +
==== Create your Job ====
 +
POST the following job description to [[SMILA/Documentation/JobDefinitions#List.2C_create.2C_modify_jobs|SMILA's Job API]]. Adapt the <tt>rootFolder</tt> parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, office docs or HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. <tt>c:\\data\\files</tt>.
 +
<tt>
 +
POST http://localhost:8080/smila/jobmanager/jobs/
 +
{
 +
  "name":"crawlFilesAtData",
 +
  "workflow":"fileCrawling",
 +
  "parameters":{
 +
    "tempStore":"temp",
 +
    "dataSource":"file",
 +
    "rootFolder":"/data",
 +
    "jobToPushTo":"indexUpdate",
 +
    "mapping":{
 +
      "fileContent":"Content",
 +
      "filePath":"Path",     
 +
      "fileName":"Filename",     
 +
      "fileExtension":"Extension",
 +
      "fileLastModified":"LastModifiedDate"
 +
      }
 +
  }
 +
}
 +
</tt>
 +
 
 +
''Hint: Not all file formats are supported by SMILA out-of-the-box. Have a look [[SMILA/Documentation/TikaPipelet#Supported_document_types | here]] for details.''
 +
 
 +
==== Start your jobs ====
 +
 
 +
* Start the <tt>indexUpdate</tt> job (see [[#Start_indexing_job_run|Start indexing job run]]), if you have already stopped it. (If it is still running, that's fine)
 +
<tt>
 +
  POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/
 +
</tt>
 +
 
 +
* Start your <tt>crawlFilesAtData</tt> job. This new job behaves just like the web crawling job we used above, but its run time might be shorter, depending on how much data actually is at your {{code|rootFolder}}.
 +
<tt>
 +
  POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/
 +
</tt>
 +
 
 +
==== Search for your new data ====
 +
#After the job run's finished, wait a bit, then check whether the data has been indexed (see [[#Search_the_index|Search the index]]).
 +
#It is also a good idea to check the log file for errors.
 +
 
 +
=== 5 more minutes to change the workflow ===
 +
 
 +
The [[SMILA/Documentation/5 more minutes to change the workflow|5 more minutes to change the workflow]] show how you can configure the system so that data from different data sources will go through different workflows and scripts and will be indexed into different indices.

Latest revision as of 08:58, 15 April 2015


On this page we describe the necessary steps to install and run SMILA in order to create a search index on the SMILA Eclipsepedia pages and search them.

If you have any troubles or the results differ from what is described here, check the FAQ.

Supported Platforms

The following platforms are supported:

  • Linux 32 Bit
  • Linux 64 Bit
  • Mac OS X 64 Bit (Cocoa)
  • Windows 32 Bit
  • Windows 64 Bit

Download and start SMILA

Download the SMILA package matching your operation system and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /configuration    
  ...
  SMILA
  SMILA.ini

Preconditions

To be able to start SMILA, check the following preconditions first:

JRE

You will have to provide a JRE executable to be able to run SMILA. The JVM version should be Java 7 (or newer). You may either:

  • add the path of your local JRE executable to the PATH environment variable
    or
  • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
    Make sure that -vm is indeed the first argument in the file, that there is a line break after it and that there are no leading or trailing blanks. It should look similar to the following:
-vm
d:/java/jre7/bin/java
...

Linux

When using Linux, make sure that the file SMILA has executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA

MacOS

When using MAC, switch to SMILA.app/Contents/MacOS/ and set the permission by running the following command in a console:

chmod a+x ./SMILA

Start SMILA

To start SMILA, simply start the SMILA executable.

You can see that SMILA has fully started if the following line is printed on the OSGI console:

 ...
 HTTP server started successfully on port 8080

and you can access SMILA's REST API at http://localhost:8080/smila/.

If it doesn't work, check the log file (SMILA.log) for possible errors.

Stop SMILA

To stop SMILA, type exit into the OSGI console and press Enter:

 osgi> exit

Start Indexing Job and Crawl Import

Now we're going to crawl and process the SMILA Eclipsepedia pages, Finally we index and search them by using the embedded Solr integration.

Install a REST client

We're going to use SMILA's REST API to start and stop jobs, so you need a REST client. In REST Tools you find a selection of recommended browser plugins if you haven't got a suitable REST client yet.

Start the indexing job run

We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous workflow with the same name. This indexing job will process the imported data.

Use your favorite REST Client to start a job run for the job "indexUpdate":

 POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

Your REST client will show a result like this:

{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
}

You will need the job run id ("jobId") later on to finish the job run. The job run Id can also be found via the monitoring API for the job:

 GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

In the SMILA.log file you will see a message like that:

 INFO ... internal.JobRunEngineImpl   - started job run '20110901-121343613053' for job 'indexUpdate'

Further information: The "indexUpdate" workflow uses the ScriptProcessorWorker that executes the JavaScript "add.js" workflow. So, the synchronous script call is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Start the crawl job run

Now that the indexing job is running we need to push some data to it. There is a predefined job for importing the SMILA Wiki pages which we are going to start right now.

 POST http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki/

This starts the job crawlSmilaWiki, which crawls the SMILA Wiki starting with http://wiki.eclipse.org/SMILA and (by applying the configured filters) following only links that have the same prefix. All pages crawled matching this prefix will be pushed to the import job.

Both job runs can be monitored via SMILA's REST API:

The crawling of the SMILA Wiki pages should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. You can continue with the SMILA search (next chapter) to find out if some of the pages have already made their way into the Solr index.

Further information: For more information about importing and crawl jobs please see SMILA Importing . For more information on jobs and tasks in general visit the JobManager manual.

Search the index

To have a look at the index state, e.g. how many documents are already indexed, call:

 http://localhost:8080/solr/admin/

To search the created index, point your browser to

 http://localhost:8080/SMILA/search

.

There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

To use the Default Stylesheet:

  1. Point your browser to http://localhost:8080/SMILA/search.
  2. Enter the search term(s) into the Query text field (e.g. "SMILA").
  3. Click OK to send your query to SMILA.

To use the Advanced Stylesheet:

  1. Point your browser to http://localhost:8080/SMILA/search.
  2. Click Advanced to switch to the detailed search form.
  3. For example, to find a file by its name, enter the file name into the Filename text field, then click OK to submit your search.

Stop indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (replace <job-id> with the job run id you got before when you started the job run).

 POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish  

You can monitor the job run via your browser to see that it has finished successfully:

 GET http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>

In the SMILA.log file you will see messages like this:

INFO ... internal.JobRunEngineImpl   - finish called for job 'indexUpdate', run '20110901-141457584011'
...
INFO ... internal.JobRunEngineImpl   - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED



Congratulations, you've just finished the tutorial!

You crawled the SMILA Wiki, indexed the pages and searched through them. For more, just continue with the chapter below or visit the SMILA Documentation.

Further steps

Crawl the filesystem

SMILA has also a predefined job to crawl the file system ("crawlFilesystem"), but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.

We will settle for the second option, because it does not need that you stop and restart SMILA.

Create your Job

POST the following job description to SMILA's Job API. Adapt the rootFolder parameter to point to an existing folder on your machine where you have placed some files (e.g. plain text, office docs or HTML files). If your path includes backslashes, escape them with an additional backslash, e.g. c:\\data\\files.

POST http://localhost:8080/smila/jobmanager/jobs/
{
 "name":"crawlFilesAtData",
 "workflow":"fileCrawling",
 "parameters":{
   "tempStore":"temp",
   "dataSource":"file",
   "rootFolder":"/data",
   "jobToPushTo":"indexUpdate",
   "mapping":{
     "fileContent":"Content",
     "filePath":"Path",       
     "fileName":"Filename",       
     "fileExtension":"Extension",
     "fileLastModified":"LastModifiedDate"
     }
  }
}

Hint: Not all file formats are supported by SMILA out-of-the-box. Have a look here for details.

Start your jobs

  • Start the indexUpdate job (see Start indexing job run), if you have already stopped it. (If it is still running, that's fine)

  POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/

  • Start your crawlFilesAtData job. This new job behaves just like the web crawling job we used above, but its run time might be shorter, depending on how much data actually is at your rootFolder.

 POST http://localhost:8080/smila/jobmanager/jobs/crawlFilesAtData/

Search for your new data

  1. After the job run's finished, wait a bit, then check whether the data has been indexed (see Search the index).
  2. It is also a good idea to check the log file for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show how you can configure the system so that data from different data sources will go through different workflows and scripts and will be indexed into different indices.

Back to the top