Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/5 Minutes Tutorial

< SMILA
Revision as of 06:00, 29 September 2008 by Daniel.stucky.empolis.com (Talk | contribs) (Add SMILA category)

This page contains installation instructions for the SMILA application and helps you with your first steps in SMILA. Please note that in this tutorial as well as in the SMILA application you may sometimes come across the abbreviation EILF, which refers to the former name of the SMILA project.

1. Download and unpack the SMILA application.

Save-and-extract.png


2. Start the SMILA engine.

To start the SMILA engine open a terminal, navigate to the directory that contains the extracted files, and run the SMILA (EILF) executable. Wait until the engine is fully started. If everything is OK, you should see output similar to the one on the following screenshot:

Start-engine.png


3. Check the log file.

You can check what's happening in the background by opening the SMILA log file in an editor. This file is named SMILA.log (EILF.log) and can be found in the same directory as the SMILA executable.

Log-file.png

4. Configure the crawling jobs.

Now when the SMILA engine is up and running we can start the crawling jobs. Crawling jobs are managed over the JMX protocol, that means that we can connect to SMILA with a JMX client of your choice. We will use JConsole for that purpose since this JMX client is already available as a default with the Sun Java distribution.

Start the JConsole executable in your JDK distribution. If the client is up and running, select the PID in the Connect window and click Connect.

Jconsole-connect.png

Next, switch to the MBeans tab, expand the SMILA (EILF) node in the MBeans tree on the left side of the window, and click the org.eclipse.eilf.connectivity.framework.CrawlerController node. This node is used to manage and monitor all crawling activities. Click a sub node and find the crawling attributes on the right pane.

Mbeans-overview.png


5. Start the file system crawler.

To start a file system crawler, open the Operations tab on the right pane, type "file" into the text field next to the startCrawl button and click the button.

Start-file-crawl.png

You should receive a message similar to the following, indicating that the crawler has been successfully started:

Start-crawl-file-result.png

Now we can check the log file to see what happened:

File-crawl-log.png

6. Configuring the file system crawler.

Maybe you have already noticed the following error message in your log output after starting the file system crawler:

2008-09-11 18:14:36,059 [Thread-13] ERROR impl.CrawlThread - org.eclipse.eilf.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found

The error message above states that the crawler tried to index a folder at c:\data but was not able to find this it. To solve this, let's create a folder with sample data, say ~/tmp/data, put some dummy text files into it, and configure the file system crawler to index it. To configure the crawler to index the new directory instead of c:\data, open the configuration file of the crawler at configuration/org.eclipse.eilf.connectivity.framework/file. Modify the BaseDir attribute by setting its value to an absolute path that points to your sample directory. Don't forget to save the file.

File-crawl-config-with-data.png

Now start the file system crawler with JConsole again (see step 5). This time there should be something interesting in the log file:

Filesystem-crawler-in-work.png

It looks like something was indexed. In the next step we'll try to search on the index that was created.

7. Searching on the indices.

To search on the indices that were created by the crawlers, point your browser to http://localhost:8080/AnyFinder/SearchForm. Find the name of all available indices in the left column below the Indexlist header. Currently, there should be only one index in the list. Click its name to open the search form:

Eilf-search-form.png

Now let's try to search for a word from which you know that it occurs in your dummy files. In this tutorial, we know that there was a word "data" in a file named sample.txt.

Searching-for-text-in-file.png

There was also a file named file 1.txt in the sample folder. Let's check whether it was indexed. Type "1.txt" in the Filename field and click the search icon again:

Searching-by-filename.png

8. Configure and run the web crawler.

Now that we know how to start and configure the file system crawler and how to search on indices configuring and running the web crawler is straightforward: The configuration file of the web crawler is located at configuration/org.eclipse.eilf.connectivity.framework directory and is named web:

Webcrawler-config.png

By default the web crawler is configured to index the URL http://www.brox.de website. To change this, open the file in an editor of your choice and set the content of the <Seed> element to the desired web site. Detailed information on the configuration of the web crawler is also available at the Web crawler configuration page.

To start the crawling process, save the configuration file, open the Operations tab in JConsole again, type "web" into the text field next to the startCrawl button, and click the button.

Starting-web-crawler.png

Note that the Operations tab in JConsole also provides buttons to stop a crawler, get the list of active crawlers and the current status of a particular crawling job. As an example the following screenshot shows the result after the getActiveCrawlsStatus button has been clicked while the web crawler is running:

One-active-crawl-found.png

When the web crawler's job is finished, you can search on the generated index just like described above with the file system crawler (see step 7).

Webcrawler-index-search.png

Back to the top