Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/5 Minutes Tutorial"

(General review (format and styles))
Line 1: Line 1:
=== 1. Download and unpack EILF application. ===
+
This page contains installation instructions for the SMILA application and helps you with your first steps in SMILA.
 +
Please note that in this tutorial as well as in the SMILA application you may sometimes come across the abbreviation EILF, which refers to the former name of the SMILA project.  
  
 +
== 1. Download and unpack the SMILA application. ==
  
 
[[Image:Save-and-extract.png]]
 
[[Image:Save-and-extract.png]]
  
  
=== 2. Start EILF engine. ===
+
== 2. Start the SMILA engine. ==
To start EILF engine open terminal, navigate to directory that contains extracted files and run EILF executable. Wait until the engine is fully started. You should see output similar to one on the screenshot below
+
To start the SMILA engine open a terminal, navigate to the directory that contains the extracted files, and run the SMILA (EILF) executable. Wait until the engine is fully started. If everything is OK, you should see output similar to the one on the following screenshot:
  
 
[[Image:Start-engine.png]]
 
[[Image:Start-engine.png]]
  
  
=== 3. Log file. ===
+
== 3. Check the log file. ==
You can check what's happening from the log file. Log file is placed into the same directory as EILF executable and is named EILF.log
+
You can check what's happening in the background by opening the SMILA log file in an editor. This file is named <tt>SMILA.log</tt> (<tt>EILF.log</tt>) and can be found in the same directory as the SMILA executable.
  
 
[[Image:Log-file.png]]
 
[[Image:Log-file.png]]
  
 +
== 4. Configure the crawling jobs. ==
 +
Now when the SMILA engine is up and running we can start the crawling jobs. Crawling jobs are managed over the JMX protocol, that means that we can connect to SMILA with a JMX client of your choice. We will use JConsole for that purpose since this JMX client is already available as a default with the Sun Java distribution.
  
=== 4. Managing crawling jobs. ===
+
Start the JConsole executable in your JDK distribution. If the client is up and running, select the PID in the ''Connect'' window and click ''Connect''.
Now when EILF engine is up and running we can start and manage crawling jobs. Crawling jobs are managed over JMX protocol, that means that we need to connect to EILF with some JMX client.
+
Let's use JConsole that is available with Sun Java distribution.
+
  
 
[[Image:Jconsole-connect.png]]
 
[[Image:Jconsole-connect.png]]
  
 
+
Next, switch to the ''MBeans'' tab, expand the SMILA (EILF) node in the ''MBeans'' tree on the left side of the window, and click the <tt>org.eclipse.eilf.connectivity.framework.CrawlerController</tt> node. This node is used to manage and monitor all crawling activities. Click a sub node and find the crawling attributes on the right pane.
Next, open MBeans tab, expand the EILF node in the MBeans tree to the left and click on org.eclipse.eilf.connectivity.framework.CrawlerController node.
+
It's the point from where all crawling activities are managed and monitored. You can see some crawling attributes at the right pane.
+
  
 
[[Image:Mbeans-overview.png]]
 
[[Image:Mbeans-overview.png]]
  
  
=== 5. Starting filesystem crawler. ===
+
== 5. Start the file system crawler. ==
To start a crawler, open Operations tab at the right pane, type 'file' in the inputbox next to the startCrawl button and press the startCrawl button.
+
To start a file system crawler, open the ''Operations'' tab on the right pane, type "file" into the text field next to the <tt>startCrawl</tt> button and click the button.
  
 
[[Image:Start-file-crawl.png]]
 
[[Image:Start-file-crawl.png]]
  
 
+
You should receive a message similar to the following, indicating that the crawler has been successfully started:
You should get a message that says that crawler was successfully started:
+
  
 
[[Image:Start-crawl-file-result.png]]
 
[[Image:Start-crawl-file-result.png]]
  
 
+
Now we can check the log file to see what happened:
Now we can check a log to see what happened:
+
  
 
[[Image:File-crawl-log.png]]
 
[[Image:File-crawl-log.png]]
  
 +
== 6. Configuring the file system crawler. ==
 +
Maybe you have already noticed the following error message in your log output after starting the file system crawler:
  
=== 6. Configuring filesystem crawler. ===
+
<code>
You could notice that there was a error message in the log output in the previous screenshot that said the following:
+
 
+
 
'' 2008-09-11 18:14:36,059 [Thread-13] ERROR impl.CrawlThread - org.eclipse.eilf.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found ''
 
'' 2008-09-11 18:14:36,059 [Thread-13] ERROR impl.CrawlThread - org.eclipse.eilf.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found ''
 +
</code>
  
This means that crawler tried to index the c:\data folder that was not present in the filesystem. Let's create folder with sample data, say ~/tmp/data, put some dummy text files into it and say filesystem crawler to index it.
+
The error message above states that the crawler tried to index a folder at <tt>c:\data</tt> but was not able to find this it. To solve this, let's create a folder with sample data, say <tt>~/tmp/data</tt>, put some dummy text files into it, and configure the file system crawler to index it.
Filesystem Crawler configuration file is located at configuration/org.eclipse.eilf.connectivity.framework directory of the EILF distribution and is named 'file'.
+
To configure the crawler to index the new directory instead of <tt>c:\data</tt>, open the configuration file of the crawler at <tt>configuration/org.eclipse.eilf.connectivity.framework/file</tt>. Modify the ''BaseDir'' attribute by setting its value to an absolute path that points to your sample directory. Don't forget to save the file.
Let's open this file in text editor and modify BaseDir value. Put there absolute path to our sample directory that was created in previous step and save the file.
+
  
 
[[Image:File-crawl-config-with-data.png]]
 
[[Image:File-crawl-config-with-data.png]]
  
 
+
Now start the file system crawler with JConsole again (see step 5). This time there should be something interesting in the log file:
Now let's start filesystem crawler with JConsole again (see step 5). This time there is something interesting in the log:
+
  
 
[[Image:Filesystem-crawler-in-work.png]]
 
[[Image:Filesystem-crawler-in-work.png]]
Line 64: Line 61:
 
It looks like something was indexed. In the next step we'll try to search on the index that was created.
 
It looks like something was indexed. In the next step we'll try to search on the index that was created.
  
 
+
== 7. Searching on the indices. ==
=== 7. Searching on the indexes. ===
+
To search on the indices that were created by the crawlers, point your browser to <tt>http://localhost:8080/AnyFinder/SearchForm</tt>. Find the name of all available indexes in the left column below the ''Indexlist'' header. Currently, there should be only one index in the list. Click its name to open the search form:
To search on the indexes that were created by crawlers, point your browser to http://localhost:8080/AnyFinder/SearchForm
+
Left column on the page opened is named Indexlist and currently there is only one index in this list. Click on it's name, test_index. The test_index search form should open:
+
  
 
[[Image:Eilf-search-form.png]]
 
[[Image:Eilf-search-form.png]]
  
 
+
Now let's try to search for a word from which you know that it occurs in your dummy files. In this tutorial, we know that there was a word "data" in a file named <tt>sample.txt</tt>.
Now let's try to search on some word that was in our dummy files. In this tutorial, there was a word 'data ' in sample.txt:
+
  
 
[[Image:Searching-for-text-in-file.png]]
 
[[Image:Searching-for-text-in-file.png]]
  
 
+
There was also a file named <tt>file 1.txt</tt> in the sample folder. Let's check whether it was indexed. Type "1.txt" in the ''Filename'' field and click the search icon again:
There was also file 1.txt in the sample folder, let's check whether it was indexed. Type '1.txt' in the Filename field and click on the search icon:
+
  
 
[[Image:Searching-by-filename.png]]
 
[[Image:Searching-by-filename.png]]
  
 
+
== 8. Configure and run the web crawler. ==
=== 8. Configuring and running Web Crawler. ===
+
Now that we know how to start and configure the file system crawler and how to search on indexes configuring and running the web crawler is straightforward:
Now that we know how to start and configure filesystem crawler and how to search on indexes configuring and running Web Crawler is straightforward.
+
The configuration file of the web crawler is located at <tt>configuration/org.eclipse.eilf.connectivity.framework directory</tt> and is named <tt>web</tt>:
Web Crawler configuration file is located at configuration/org.eclipse.eilf.connectivity.framework directory and is named 'web':
+
  
 
[[Image:Webcrawler-config.png]]
 
[[Image:Webcrawler-config.png]]
  
 +
By default the web crawler is configured to index the URL ''http://www.brox.de website''. To change this, open the file in an editor of your choice and set the content of the <tt>&lt;Seed&gt;</tt> element to the desired web site. Detailed information on the configuration of the web crawler is also available at the [[SMILA/Documentation/Web_Crawler|Web crawler]] configuration page.
  
By default webcrawler is configured to index http://www.brox.de website. To change this you should edit the Seed element. Detailed information about web crawler configuration is available at [[SMILA/Documentation/Web_Crawler|Web Crawler Configuration page]]
+
To start the crawling process, save the configuration file, open the ''Operations'' tab in JConsole again, type "web" into the text field next to the <tt>startCrawl</tt> button, and click the button.
 
+
To start crawling you need to open Operations tab in JConsole again, type 'web' in the inputbox next to the startCrawl button and press the startCrawl button in JConsole.
+
  
 
[[Image:Starting-web-crawler.png]]
 
[[Image:Starting-web-crawler.png]]
  
 
+
Note that the ''Operations'' tab in JConsole also provides buttons to stop a crawler, get the list of active crawlers and the current status of a particular crawling job.
You can also stop crawler, get active crawls list and status and get the status of particular crawling job with JConsole.
+
As an example the following screenshot shows the result after the <tt>getActiveCrawlsStatus</tt> button has been clicked while the web crawler is running:
For example here is the result of getActiveCrawlsStatus while webcrawler is running:
+
  
 
[[Image:One-active-crawl-found.png]]
 
[[Image:One-active-crawl-found.png]]
  
 
+
When the web crawler's job is finished, you can search on the generated index just like described above with the file system crawler (see step 7).
After web crawler's job is finished, you can search on the generated index just like with filesystem crawler (see step 7)
+
  
 
[[Image:Webcrawler-index-search.png]]
 
[[Image:Webcrawler-index-search.png]]

Revision as of 12:22, 26 September 2008

This page contains installation instructions for the SMILA application and helps you with your first steps in SMILA. Please note that in this tutorial as well as in the SMILA application you may sometimes come across the abbreviation EILF, which refers to the former name of the SMILA project.

1. Download and unpack the SMILA application.

Save-and-extract.png


2. Start the SMILA engine.

To start the SMILA engine open a terminal, navigate to the directory that contains the extracted files, and run the SMILA (EILF) executable. Wait until the engine is fully started. If everything is OK, you should see output similar to the one on the following screenshot:

Start-engine.png


3. Check the log file.

You can check what's happening in the background by opening the SMILA log file in an editor. This file is named SMILA.log (EILF.log) and can be found in the same directory as the SMILA executable.

Log-file.png

4. Configure the crawling jobs.

Now when the SMILA engine is up and running we can start the crawling jobs. Crawling jobs are managed over the JMX protocol, that means that we can connect to SMILA with a JMX client of your choice. We will use JConsole for that purpose since this JMX client is already available as a default with the Sun Java distribution.

Start the JConsole executable in your JDK distribution. If the client is up and running, select the PID in the Connect window and click Connect.

Jconsole-connect.png

Next, switch to the MBeans tab, expand the SMILA (EILF) node in the MBeans tree on the left side of the window, and click the org.eclipse.eilf.connectivity.framework.CrawlerController node. This node is used to manage and monitor all crawling activities. Click a sub node and find the crawling attributes on the right pane.

Mbeans-overview.png


5. Start the file system crawler.

To start a file system crawler, open the Operations tab on the right pane, type "file" into the text field next to the startCrawl button and click the button.

Start-file-crawl.png

You should receive a message similar to the following, indicating that the crawler has been successfully started:

Start-crawl-file-result.png

Now we can check the log file to see what happened:

File-crawl-log.png

6. Configuring the file system crawler.

Maybe you have already noticed the following error message in your log output after starting the file system crawler:

2008-09-11 18:14:36,059 [Thread-13] ERROR impl.CrawlThread - org.eclipse.eilf.connectivity.framework.CrawlerCriticalException: Folder "c:\data" is not found

The error message above states that the crawler tried to index a folder at c:\data but was not able to find this it. To solve this, let's create a folder with sample data, say ~/tmp/data, put some dummy text files into it, and configure the file system crawler to index it. To configure the crawler to index the new directory instead of c:\data, open the configuration file of the crawler at configuration/org.eclipse.eilf.connectivity.framework/file. Modify the BaseDir attribute by setting its value to an absolute path that points to your sample directory. Don't forget to save the file.

File-crawl-config-with-data.png

Now start the file system crawler with JConsole again (see step 5). This time there should be something interesting in the log file:

Filesystem-crawler-in-work.png

It looks like something was indexed. In the next step we'll try to search on the index that was created.

7. Searching on the indices.

To search on the indices that were created by the crawlers, point your browser to http://localhost:8080/AnyFinder/SearchForm. Find the name of all available indexes in the left column below the Indexlist header. Currently, there should be only one index in the list. Click its name to open the search form:

Eilf-search-form.png

Now let's try to search for a word from which you know that it occurs in your dummy files. In this tutorial, we know that there was a word "data" in a file named sample.txt.

Searching-for-text-in-file.png

There was also a file named file 1.txt in the sample folder. Let's check whether it was indexed. Type "1.txt" in the Filename field and click the search icon again:

Searching-by-filename.png

8. Configure and run the web crawler.

Now that we know how to start and configure the file system crawler and how to search on indexes configuring and running the web crawler is straightforward: The configuration file of the web crawler is located at configuration/org.eclipse.eilf.connectivity.framework directory and is named web:

Webcrawler-config.png

By default the web crawler is configured to index the URL http://www.brox.de website. To change this, open the file in an editor of your choice and set the content of the <Seed> element to the desired web site. Detailed information on the configuration of the web crawler is also available at the Web crawler configuration page.

To start the crawling process, save the configuration file, open the Operations tab in JConsole again, type "web" into the text field next to the startCrawl button, and click the button.

Starting-web-crawler.png

Note that the Operations tab in JConsole also provides buttons to stop a crawler, get the list of active crawlers and the current status of a particular crawling job. As an example the following screenshot shows the result after the getActiveCrawlsStatus button has been clicked while the web crawler is running:

One-active-crawl-found.png

When the web crawler's job is finished, you can search on the generated index just like described above with the file system crawler (see step 7).

Webcrawler-index-search.png

Back to the top