Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/5 more minutes to change the workflow

< SMILA‎ | Documentation
Revision as of 12:10, 28 October 2014 by Andreas.weber.empolis.com (Talk | contribs) (scripting)


Just another 5 minutes to change the workflow

In the 5 minutes tutorial all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the script "add.js". All data was indexed into the same solr/lucene index "DefaultCore". It is possible however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.

In the following sections we are going to use the generic asynchronous "indexWithScript" workflow which let you specify the JavaScript script to process the data. We create an additional script for webcrawler records so that webcrawler data will be indexed into a separate index named "WebCore".

Configure new solr index

Please shutdown SMILA now if it's still running.

To configure your own index "WebCore" follow the description in the SMILA documentation for creating your own solr index.

|If you already started SMILA before (as we suppose you did), please copy your new Core configuration and the modified solr.xml file to the folder workspace\.metadata\.plugins\org.eclipse.smila.solr because the configuration will not be copied again, after the first start of the Solr bundle.

Please restart SMILA now.

Further information:: For more information about the Solr indexing, please see the SMILA Solr documentation.

Create a new indexing script

We need to add a new script for adding the imported webcrawler records. Predefined scripts are contained in the configuration/org.eclipse.smila.scripting/js directory. We can add new scripts by just adding them there.

Copy the script file "add.js", name the copy "addWeb.js" and change the solr "CoreName" in there from "DefaultCore" to "WebCore":

    ...
    var solrIndexPipelet = pipelets.create("org.eclipse.smila.solr.index.SolrIndexPipelet", {
    "ExecutionMode" : "ADD",
    "CoreName" : "WebCore",
   ...

Further information: For more information about Scripting please check the Scripting documentation.

Create and start a new indexing job

We define an indexing job based on the predefined asynchronous workflow "importToPipeline" (see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json). This indexing job will process the imported data by using our new BPEL pipeline "AddWebPipeline".

The "importToPipeline" workflow contains a PipelineProcessorWorker worker which is not configured for dedicated BPEL pipelines, so the BPEL pipelines handling adds and deletes have to be set via job parameter.

Use your favourite REST Client to create an appropriate job definition:

POST http://localhost:8080/smila/jobmanager/jobs/
  {
    "name":"indexWebJob",
    "parameters":{      
      "tempStore": "temp",
      "addPipeline": "AddWebPipeline",
      "deletePipeline": "DeletePipeline" 
     },
    "workflow":"importToPipeline"
  }

Note that the "DeletePipeline" is not needed for our test szenario here, but we must fulfill all undefined workflow parameters.

Afterwards, start a job run for the defined job:

POST http://localhost:8080/smila/jobmanager/jobs/indexWebJob

Update the web crawl job

Since the web crawl job already is predefined to push the crawled records to the indexUpdate job, we now either must define a new job or update the crawl job's definition in the job.json file. Here we choose the new job option.

POST the following update json using your favorite REST client:

POST http://localhost:8080/smila/jobmanager/jobs/
{
  "name":"crawlWikiToWebCore",
  "workflow":"webCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"web",
    "startUrl":"http://wiki.eclipse.org/SMILA",
    "filter":{
      "urlPrefix":"http://wiki.eclipse.org/SMILA"
    },
    "jobToPushTo":"indexWebJob"
  }
}

Please note that we used the following line to let the crawl job push the records to our new job:

"jobToPushTo":"indexWebJob"

Now start the crawl job (don't forget runOnce!):

POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore
{
  "mode": "runOnce"
}

After a sufficiently long time to crawl, process and commit the data, you can have another look at the SMILA search page to find your new core listed among the available cores, and if you choose it, you can search for e.g. SMILA in the new WebCore.

Put it all together

Ok, now it seems that we have finally finished configuring SMILA for using separate BPEL pipelines for file system and web crawling and index data from these crawlers into different indices. Here is what we have done so far:

  1. We added the WebCore index to the Solr configuration and copied it to the workspace.
  2. We created a new BPEL pipeline for Web crawler data referencing the new Lucene index.
  3. We used a separate job for web indexing that references the new BPEL pipeline.
  4. We updated the web crawl job to push the records to a different indexing job which references the new BPEL pipeline.

Configuration overview

SMILA configuration files are located in the configuration directory of the SMILA application. The following lists the configuration files and documentation links relevant to this tutorial, regarding SMILA components:

Jobmanager

BPEL Pipelines

Solr

Back to the top