Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

m (Linux)
m (MacOS)
Line 45: Line 45:
  
 
==== MacOS ====
 
==== MacOS ====
* When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following commands in a console:
+
When using MAC switch to <tt>SMILA.app/Contents/MacOS/</tt> and set the permission by running the following commands in a console:
 
<tt>
 
<tt>
 
  chmod a+x ./SMILA
 
  chmod a+x ./SMILA

Revision as of 12:15, 24 January 2012


This page contains installation instructions for the SMILA application which will help you taking the first steps with SMILA.

If you have any troubles or the results differ from what is described here, check the FAQ.

Download and start SMILA

Download the SMILA package and unpack it to an arbitrary folder. This will result in the following folder structure:

/<SMILA>
  /about_files
  /configuration
  /features
  /jmxclient
  /plugins
  /workspace
  .eclipseproduct
  ...
  SMILA
  SMILA.ini

Preconditions

To be able to start SMILA, check the following preconditions:

JRE

  • You will have to provide a JRE executable to be able to run SMILA. The JVM version should be at least Java 5.
    Either:
    • add the path of your local JRE executable to the PATH environment variable
      or
    • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini.
      Make sure that -vm is indeed the first argument in the file and that there is a line break after it. It should look similar to the following:

-vm
d:/java/jre6/bin/java
...

Linux

When using the Linux distributable of SMILA, make sure that the files SMILA and jmxclient/run.sh have executable permissions. If not, set the permission by running the following commands in a console:

chmod +x ./SMILA
chmod +x ./jmxclient/run.sh

MacOS

When using MAC switch to SMILA.app/Contents/MacOS/ and set the permission by running the following commands in a console:

chmod a+x ./SMILA

Start SMILA

To start the SMILA engine, simply double-click the SMILA executable. Alternatively, open a command line, navigate to the directory where you extracted the files to, and execute the SMILA executable. Wait until the engine has been fully started. You can tell if SMILA has fully started if the following line is printed in the console window: HTTP server started successfully on port 8080. and you can access SMILA's REST API from http://localhost:8080/smila/.

When using MAC switch in terminal to SMILA.app/Contents/MacOS/ and then start with ./SMILA

Now you should check the log file for errors that might have occurred before moving on.

Install a REST client

Will use SMILA's REST API in order to start and stop jobs, so you need a REST client. In REST Tools you find a selection of browser plugins for the case that you do not already have a suitable REST client.

Start Indexing Job and Crawl Import

Now we're going to crawl the SMILA Eclipsepedia pages and index them using the embedded Solr.

Start indexing job run

We are going to start the predefined indexing job "indexUpdate" based on the predefined asynchronous "indexUpdate" workflow. This indexing job will process the imported data.

The "indexUpdate" workflow contains a PipelineProcessorWorker worker which executes the synchronous "AddPipeline" BPEL workflow. So, the synchronous "AddPipeline" BPEL workflow is embedded in the asynchronous "indexUpdate" workflow. For more details about the "indexUpdate" workflow and "indexUpdate" job definitions see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json and jobs.json). For more information about job management in general please check the JobManager documentation.

Use your favourite REST Client to start a job run for the job "indexUpdateJob":

POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate

Your REST client will show a result like that:

{
  "jobId" : "20110901-121343613053",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/indexUpdate/20110901-121343613053/"
}

You will need the "jobId" later on to finish the job run. The job run Id can also be found via the monitoring API for the job:

http://localhost:8080/smila/jobmanager/jobs/indexUpdate

In the SMILA.log file you will see a message like that:

INFO ... internal.JobRunEngineImpl   - started job run '20110901-121343613053' for job 'indexUpdate'

Start the crawler

Now the indexing job is running we need to push some data to it. There is a predefined job for indexing the SMILA Eclipsepedia pages, which we are going to start now.

We need to start this job in the so-called runOnce mode, which is a special mode where tasks are generated by the system rather than by an input trigger and the jobs finished automatically. For more information why this is the case, please see Importing Concept. For more information on jobs and tasks, visit the JobManager manual.

Please POST the following json fragment with your REST client to the SMILA job REST API at http://localhost:8080/smila/jobmanager/jobs/crawlSmilaWiki:

{
  "mode": "runOnce"
}

This will now start the job crawlSmilaWiki which crawls the SMILA Eclipsepedia starting from http://wiki.eclipse.org/SMILA and following only links with the same prefix.

All pages that have the specified prefix will be pushed to the importing job.

If you like, you can monitor these two jobs with your REST client at the following URIs:

Or just in one overview at

The crawling of the wikipedia page should take some time. If all pages are processed, the status of the crawlSmilaWiki's job run will change to SUCCEEDED. We could have a look at SMILA's search page to find out if some of the pages have already made their progress into the Solr index.

You can find more information about the Importing here.

Search the index

Note.png
Since SMILA uses Solr's autocommit (which is configured in solrconfig.xml to a period of 60 seconds or 1000 documents, whichever comes first) it might take some time until you retrieve results.


To search the index which was created by the crawlers, point your browser to http://localhost:8080/SMILA/search. There are currently two stylesheets from which you can select by clicking the respective links in the upper left corner of the header bar: The Default stylesheet shows a reduced search form with text fields like Query, Result Size, and Index Name, adequate to query the full-text content of the indexed documents. The Advanced stylesheet in turn provides a more detailed search form with text fields for meta-data search like for example Path, MimeType, Filename, and other document attributes.

Now, let's try the Default stylesheet and enter our first simple search using a word that you expect to be contained in your dummy files. Enter the desired term into the Query text field. And finally, click OK to send your query to SMILA. Your should see some results.

Now, let's use the Advanced stylesheet and search for the name of one the files contained in the indexed folder to check whether it was properly indexed. Click Advanced to switch to the detailed search form, enter the desired file name into the Filename text field, then click OK to submit your search.

Stop indexing job run

Although there's no need for it, we can finish our previously started indexing job run via REST client now: (please replace <job-id> by the job-id you got before when started the job run)

POST http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>/finish  

You can monitor the job run via browser to see that it finished successful:

http://localhost:8080/smila/jobmanager/jobs/indexUpdate/<job-id>

In the SMILA.log file you will see messages like that:

 INFO ... internal.JobRunEngineImpl   - finish called for job 'indexUpdate', run '20110901-141457584011'
 ...
 INFO ... internal.JobRunEngineImpl   - Completing job run '20110901-141457584011' for job 'indexUpdate' with final state SUCCEEDED

Congratulations, you've just crawled the SMILA Eclipsepedia, indexed the pages and searched through them. For more, just visit SMILA Manual.

Further steps

Crawl the filesystem

SMILA has also a predefined job to crawl the file system, but you will have to either adapt the predefined job to point it to a valid folder in your filesystem or create your own job.

We will settle for the second option, because it does not need that you stop and restart SMILA.

Create your Job

POST the following job description to SMILA's job API at http://localhost:8080/smila/jobmanager/jobs (the name is just an example, as well as the rootFoler, which you should set to an existing folder on your machine where some data files (e.g. plain text, HTML files) reside.

{
  "name":"crawlFilesAtData",
  "workflow":"fileCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"file",
    "rootFolder":"/data",
    "jobToPushTo":"indexUpdate"
  }
}

For text files other than plain text and html you cannot search inside the document's text (at least not now, but you might have a look at Aperture Pipelet which addresses this problem).

Start your jobs

You have to start the indexUpdateJob (see Start indexing job run), if you have already stopped it. If it is still running, that's fine.

Now start your crawlFilesAtData job similar to Start the crawler but now use the new job's name crawlFilesAtData instead of crawlSmilaWiki.

This new job behaves just like the web crawling, but its run time might be shorter, depending on how much data actually is at your rootFilder.

Search for your new data

After the job run's finished, wait a bit and then you should check if the data has been indexed. (see Search the index

It is also a good idea to check the log file again for errors.

5 more minutes to change the workflow

The 5 more minutes to change the workflow show, how you can change the workflow the data that has been crawled will be processed.

Back to the top