Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/5 more minutes to change the workflow"

(Put it all together)
 
(14 intermediate revisions by 3 users not shown)
Line 4: Line 4:
 
= Just another 5 minutes to change the workflow  =
 
= Just another 5 minutes to change the workflow  =
  
In the [[SMILA/Documentation_for_5_Minutes_to_Success|5 minutes to success]] all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the BPEL pipeline "AddPipeline". All data was indexed into the same solr/lucene index "DefaultCore".
+
In the [[SMILA/5_Minutes_Tutorial|5 minutes tutorial]] all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the script "add.js". All data was indexed into the same solr/lucene index "DefaultCore".
It is possible, however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.
+
It is possible however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.
  
In the following sections we are going to use the generic asynchronous "importToPipeline" workflow which let you specify the BPEL pipeline to process the data. We create an additional BPEL pipeline for webcrawler records so that webcrawler data will be indexed into a separate index named "WebCore".
+
In the following sections we are going to use the generic asynchronous "indexWithScript" workflow which let you specify the JavaScript script to process the data. We create an additional script for webcrawler records so that webcrawler data will be indexed into a separate search index named "webCollection".
  
 
== Configure new solr index ==
 
== Configure new solr index ==
Line 13: Line 13:
 
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
 
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
 
|
 
|
It's very important to shutdown and restart the SMILA engine after the following configuration changes are done because modified configurations are loaded during startup only.
+
Please shutdown SMILA now if it's still running.
 
|}
 
|}
  
To configure your own index "WebCore" follow the description in the SMILA documentation for [[SMILA/Documentation/Solr#Setup_another_core|creating your own solr index]].
+
To configure your own search index "webCollection" copy the <tt>collection1</tt> configuration folder (see <tt>SMILA/configuration/org.eclipse.smila.solr/solr_home</tt>) with the name <tt>webCollection</tt>, in the same directory, delete <tt>date</tt> folder and adapt <tt>core.properties</tt> file.
 +
 
 +
Afterwards add your new core to the file <tt>SMILA.application/configuration/org.eclipse.smila.solr/solr-config.json</tt>:
 +
 
 +
<source lang="xml">
 +
{
 +
    "mode":"embedded",
 +
    "idFields":{
 +
        "collection1":"_recordid",
 +
        "webCollection":"_recordid"
 +
    },
 +
    ...
 +
}
 +
</source>
  
 
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
 
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;"
|If you already started SMILA before (as we suppose you did), please copy your new Core configuration and the modified <tt>solr.xml</tt> file to the folder <tt>workspace\.metadata\.plugins\org.eclipse.smila.solr</tt> because the configuration will not be copied again, after the first start of the Solr bundle.
+
|
 +
Please restart SMILA now.
 
|}
 
|}
  
For more information about the solr indexing, please see the [[SMILA/Documentation/Solr|SMILA solr documentation]].
+
'''Further information:''': For more information about the Solr indexing, please see the [[SMILA/Documentation/Solr_4.x|SMILA Solr 4.x documentation]].
  
== Create a new BPEL pipeline ==
+
== Create a new indexing script ==
  
We need to add the ''AddWebPipeline'' pipeline to the BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the [[SMILA/Documentation/BPEL_Workflow_Processor|BPEL WorkflowProcessor]] documentation.
+
We need to add a new script for adding the imported webcrawler records.  
Predefined BPEL WorkflowProcessor configuration files are contained in the <tt>configuration/org.eclipse.smila.processing.bpel/pipelines</tt> directory. However, we can add new BPEL pipelines with the SMILA REST API.
+
Predefined scripts are contained in the <tt>configuration/org.eclipse.smila.scripting/js</tt> directory. We can add new scripts by just adding them there.
  
Start SMILA if it's not yet running, and use your favourite REST client to add the "AddWebPipeline" BPEL pipeline: (the BPEL XML is a little bit unreadable cause we have to escape it for being valid JSON content; after posting the new pipeline you can get a readable version via monitoring REST API - see below)
+
Copy the script "add.js", name the copy "addWeb.js" and change the index name "indexname" in there from "collection1" to "webCollection":
  
 
<pre>
 
<pre>
POST http://localhost:8080/smila/pipeline
+
  ...
  {
+
   var solrIndexPipelet = pipelets.create("org.eclipse.smila.solr.update.SolrUpdatePipelet", {
    "name":"AddWebPipeline",
+
      "indexname" : "webCollection",
    "definition":"<?xml version=\"1.0\" encoding=\"utf-8\" ?>\r\n<process name=\"AddWebPipeline\" targetNamespace=\"http://www.eclipse.org/smila/processor\" xmlns=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\"\r\n  xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:proc=\"http://www.eclipse.org/smila/processor\" xmlns:rec=\"http://www.eclipse.org/smila/record\"\r\n  xmlns:bpel=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\">\r\n\r\n  <import location=\"processor.wsdl\" namespace=\"http://www.eclipse.org/smila/processor\" importType=\"http://schemas.xmlsoap.org/wsdl/\" />\r\n\r\n  <partnerLinks>\r\n   <partnerLink name=\"Pipeline\" partnerLinkType=\"proc:ProcessorPartnerLinkType\" myRole=\"service\" />\r\n    <partnerLink name=\"AdaptWebCrawlerWorkerOutput\" partnerLinkType=\"proc:ProcessorPartnerLinkType\" partnerRole=\"service\" />\r\n  </partnerLinks>\r\n\r\n  <extensions>\r\n    <extension namespace=\"http://www.eclipse.org/smila/processor\" mustUnderstand=\"no\" />\r\n  </extensions>\r\n\r\n  <variables>\r\n    <variable name=\"request\" messageType=\"proc:ProcessorMessage\" />\r\n  </variables>\r\n\r\n  <sequence name=\"AddWebPipeline\">\r\n    <receive name=\"start\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\" operation=\"process\" variable=\"request\"\r\n      createInstance=\"yes\" />\r\n\r\n    <invoke name=\"adaptWebCrawlerWorkerOutput\" inputVariable=\"request\" partnerLink=\"AdaptWebCrawlerWorkerOutput\"\r\n      outputVariable=\"request\" operation=\"process\" portType=\"proc:ProcessorPortType\" />\r\n\r\n    <forEach counterName=\"index\" parallel=\"yes\" name=\"iterateRecords\">\r\n      <startCounterValue>1</startCounterValue>\r\n      <finalCounterValue>count($request.records/rec:Record)</finalCounterValue>\r\n      <scope>\r\n        <variables>\r\n          <variable name=\"identifiedRecord\" messageType=\"proc:ProcessorMessage\" />\r\n        </variables>\r\n        <sequence>\r\n          <if name=\"MimeTypeNotSet\">\r\n            <condition>not($request.records/rec:Record[position()=$index]/rec:Val[@key=\"MimeType\"])</condition>\r\n            <sequence>\r\n              <extensionActivity>\r\n                <proc:invokePipelet name=\"detectMimeType\">\r\n                  <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet\" />\r\n                  <proc:variables input=\"request\" index=\"index\" output=\"identifiedRecord\" />\r\n                  <proc:configuration>\r\n                    <rec:Val key=\"FileExtensionAttribute\">Extension</rec:Val>\r\n                    <rec:Val key=\"MetaDataAttribute\">MetaData</rec:Val>\r\n                    <rec:Val key=\"MimeTypeAttribute\">MimeType</rec:Val>\r\n                  </proc:configuration>\r\n                </proc:invokePipelet>\r\n              </extensionActivity>\r\n              <assign name=\"copy result into original variable for next tests\">\r\n                <copy>\r\n                  <from>$identifiedRecord.records/rec:Record[1]</from>\r\n                  <to>$request.records/rec:Record[position()=$index]</to>\r\n                </copy>\r\n              </assign>\r\n            </sequence>\r\n          </if>\r\n\r\n          <!-- only process text based content, skip everything else -->\r\n          <if name=\"IsText\">\r\n            <condition>starts-with($request.records/rec:Record[position()=$index]/rec:Val[@key=\"MimeType\"],\"text/\")</condition>\r\n            <if name=\"IsHTML\">\r\n              <condition>$request.records/rec:Record[position()=$index]/rec:Val[@key=\"MimeType\"] = \"text/html\" or\r\n                $request.records/rec:Record[position()=$index]/rec:Val[@key=\"MimeType\"] = \"text/xml\"\r\n              </condition>\r\n              <!-- extract txt from html and xml files -->\r\n              <extensionActivity>\r\n                <proc:invokePipelet name=\"invokeHtml2Txt\">\r\n                  <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.HtmlToTextPipelet\" />\r\n                  <proc:variables input=\"request\" index=\"index\" />\r\n                  <proc:configuration>\r\n                    <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n                    <rec:Val key=\"outputType\">ATTRIBUTE</rec:Val>\r\n                    <rec:Val key=\"inputName\">Content</rec:Val>\r\n                    <rec:Val key=\"outputName\">Content</rec:Val>\r\n                    <rec:Val key=\"meta:title\">Title</rec:Val>\r\n                  </proc:configuration>\r\n                </proc:invokePipelet>\r\n              </extensionActivity>\r\n              <else>\r\n                <!-- copy txt from attachment to attribute -->\r\n                <extensionActivity>\r\n                  <proc:invokePipelet name=\"invokeCopyContent\">\r\n                    <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.CopyPipelet\" />\r\n                    <proc:variables input=\"request\" index=\"index\" />\r\n                    <proc:configuration>\r\n                      <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n                      <rec:Val key=\"outputType\">ATTRIBUTE</rec:Val>\r\n                      <rec:Val key=\"inputName\">Content</rec:Val>\r\n                      <rec:Val key=\"outputName\">Content</rec:Val>\r\n                      <rec:Val key=\"mode\">COPY</rec:Val>\r\n                    </proc:configuration>\r\n                  </proc:invokePipelet>\r\n                </extensionActivity>\r\n              </else>\r\n            </if>\r\n          </if>\r\n        </sequence>\r\n      </scope>\r\n    </forEach>\r\n\r\n    <extensionActivity>\r\n      <proc:invokePipelet name=\"SolrIndexPipelet\">\r\n        <proc:pipelet class=\"org.eclipse.smila.solr.index.SolrIndexPipelet\" />\r\n        <proc:variables input=\"request\" output=\"request\" />\r\n        <proc:configuration>\r\n          <rec:Val key=\"ExecutionMode\">ADD</rec:Val>\r\n          <rec:Val key=\"CoreName\">WebCore</rec:Val>\r\n          <rec:Seq key=\"CoreFields\">\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">_source</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Path</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Url</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Filename</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">MimeType</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Size</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">LastModifiedDate</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Content</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Extension</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Title</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Author</rec:Val>\r\n            </rec:Map>\r\n          </rec:Seq>\r\n        </proc:configuration>\r\n      </proc:invokePipelet>\r\n    </extensionActivity>\r\n\r\n    <reply name=\"end\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\" operation=\"process\" variable=\"request\" />\r\n  </sequence>\r\n</process>\r\n"
+
      ...
  }
+
 
</pre>
 
</pre>
  
You can monitor the defined BPEL pipelines via browser, so you should find your new pipeline there:
+
'''Further information:''' For more information about Scripting please check the [[SMILA/Documentation/Scripting|Scripting]] documentation.
<pre>
+
http://localhost:8080/smila/pipeline
+
</pre>
+
 
+
Note that we used "WebCore" index name for the Solr index in the BPEL above:
+
<source lang="xml">
+
...
+
<proc:configuration>
+
  <rec:Val key="CoreName">WebCore</rec:Val>
+
  ...
+
</proc:configuration>
+
...
+
</source>
+
  
 
== Create and start a new indexing job ==
 
== Create and start a new indexing job ==
  
We define an indexing job based on the predefined asynchronous workflow "importToPipeline" (see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt>). This indexing job will process the imported data by using our new BPEL pipeline "AddWebPipeline".
+
We define a new indexing job based on the predefined asynchronous workflow "indexWithScript" (see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt>). This indexing job will process the imported data by using our new script "addWeb.js".
  
The "importToPipeline" workflow contains a [[SMILA/Documentation/Worker/PipelineProcessorWorker|PipelineProcessorWorker worker]] which is not configured for dedicated BPEL pipelines, so the BPEL pipelines handling adds and deletes have to be set via job parameter.  
+
The "indexWithScript" workflow contains a [[SMILA/Documentation/Worker/ScriptProcessorWorker|ScriptProcessorWorker worker]] which is not configured for a dedicated script, so the scripts handling adds and deletes have to be set via job parameter.  
  
 
Use your favourite REST Client to create an appropriate job definition:
 
Use your favourite REST Client to create an appropriate job definition:
Line 68: Line 68:
 
     "parameters":{       
 
     "parameters":{       
 
       "tempStore": "temp",
 
       "tempStore": "temp",
       "addPipeline": "AddWebPipeline",
+
       "addScript": "addWeb",
       "deletePipeline": "DeletePipeline"  
+
       "deleteScript": "delete"  
 
     },
 
     },
     "workflow":"importToPipeline"
+
     "workflow":"indexWithScript"
 
   }
 
   }
 
</pre>
 
</pre>
  
Note that the "DeletePipeline" is not needed for our test szenario here, but we must fulfill all undefined workflow parameters.
+
Notes:
 +
* the "deleteScript" is not needed for our test scenario here, but we must fulfill all undefined workflow parameters.
 +
* in the add and the delete script we use the standard function ("process"), so we don't have to set/change this via parameter.
  
 
Afterwards, start a job run for the defined job:
 
Afterwards, start a job run for the defined job:
Line 83: Line 85:
 
</pre>
 
</pre>
  
== Update the web crawl job ==
+
== Create a new web crawl job ==
  
Since the web crawl job already is predefined to push the crawled records to the <tt>indexUpdate</tt> job, we now either must define a new job or update the crawl job's definition in the {{code|job.json}} file. Here we choose the ''new job'' option.
+
Since the predefined web crawl job pushes the crawled records to the <tt>indexUpdate</tt> job, we create a new job here using our new indexing job.
  
POST the following update json using your favorite REST client:
 
 
<pre>
 
<pre>
 
POST http://localhost:8080/smila/jobmanager/jobs/
 
POST http://localhost:8080/smila/jobmanager/jobs/
Line 96: Line 97:
 
     "tempStore":"temp",
 
     "tempStore":"temp",
 
     "dataSource":"web",
 
     "dataSource":"web",
 +
    "jobToPushTo":"indexWebJob",
 
     "startUrl":"http://wiki.eclipse.org/SMILA",
 
     "startUrl":"http://wiki.eclipse.org/SMILA",
     "filter":{
+
     "linksPerBulk": 100,
       "urlPrefix":"http://wiki.eclipse.org/SMILA"
+
    "filters":{
 +
       "urlPatterns": {
 +
        "include": ["http://wiki\\.eclipse\\.org/SMILA.*",
 +
            "http://wiki\\.eclipse\\.org/Image:.*",
 +
            "http://wiki\\.eclipse\\.org/images/.*"],
 +
        "exclude": [".*\\?.*",
 +
            "http://wiki\\.eclipse\\.org/images/archive/.*",
 +
            ".*\\.java"]
 +
      }
 
     },
 
     },
     "jobToPushTo":"indexWebJob"
+
     "mapping": {
 +
      "httpCharset": "Charset",
 +
      "httpContenttype": "ContentType",
 +
      "httpLastModified": "LastModifiedDate",
 +
      "httpMimetype": "MimeType",
 +
      "httpSize": "Size",
 +
      "httpUrl": "Url",
 +
      "httpContent": "Content"
 +
    }
 
   }
 
   }
}
+
}</pre>
</pre>
+
  
Please note that we used the following line to let the crawl job push the records to our new job:
+
Please note that we used the following line to let the crawl job push the records to our new indexing job:
 
<pre>
 
<pre>
 
"jobToPushTo":"indexWebJob"
 
"jobToPushTo":"indexWebJob"
 
</pre>
 
</pre>
  
Now start the crawl job (don't forget runOnce!):
+
Now start the crawl job:
 
<pre>
 
<pre>
 
POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore
 
POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore
{
 
  "mode": "runOnce"
 
}
 
 
</pre>
 
</pre>
  
After a sufficiently long time to crawl, process and commit the data, you can have another look at the [http://localhost:8080/SMILA/search SMILA search page] to find your new core listed among the available cores, and if you choose it, you can search for e.g. ''SMILA'' in the new WebCore.
+
After some time to crawl, process and commit the data, you can have another look at the [http://localhost:8080/SMILA/search SMILA search page] to find your new core listed among the available cores, and if you choose it, you can search for e.g. "SMILA" in the new webCollection.
  
 
== Put it  all together ==
 
== Put it  all together ==
  
Ok, now it seems that we have finally finished configuring SMILA for using separate BPEL pipelines for file system and web crawling and index data from these crawlers into different indices.
+
Ok, now it seems that we have finally finished configuring SMILA for using separate scripts for file system and web crawling and index data from these crawlers into different indices.
 
Here is what we have done so far:
 
Here is what we have done so far:
# We added the <tt>WebCore</tt> index to the Solr configuration and copied it to the workspace.
+
# We added the <tt>webCollection</tt> index to the Solr configuration.
# We created a new BPEL pipeline for Web crawler data referencing the new Lucene index.
+
# We created a new JavaScript script for Web crawler data referencing the new Solr index.
# We used a separate job for web indexing that references the new BPEL pipeline.
+
# We used a separate job for web indexing that references the new script.
# We updated the web crawl job to push the records to a different indexing job which references the new BPEL pipeline.
+
# We used a separate web crawl job to push the records to the new indexing job.
  
 
= Configuration overview =
 
= Configuration overview =
Line 137: Line 151:
 
* configuration folder: <tt>org.eclipse.smila.jobmanager</tt>
 
* configuration folder: <tt>org.eclipse.smila.jobmanager</tt>
 
** <tt>workflows.json</tt> (Predefined asynchronous workflows)
 
** <tt>workflows.json</tt> (Predefined asynchronous workflows)
 +
** <tt>jobs.json</tt> (Predefined jobs)
 
* Documentation
 
* Documentation
 
** [[SMILA/Documentation/JobManager|JobManager]]
 
** [[SMILA/Documentation/JobManager|JobManager]]
** [[SMILA/Documentation/Worker/PipelineProcessorWorker|PipelineProcessorWorker]]
 
 
* REST API: http://localhost:8080/smila/jobmanager
 
* REST API: http://localhost:8080/smila/jobmanager
  
'''BPEL Pipelines'''
+
'''Scripting'''
* configuration folder: <tt>org.eclipse.smila.processing.bpel</tt>
+
* configuration folder: <tt>org.eclipse.smila.processing.scripts</tt>
** <tt>pipelines/*.bpel</tt> (Predefined BPEL pipelines)
+
** <tt>js/</tt> (Predefined JavaScript scripts)
 
* Documentation
 
* Documentation
** [[SMILA/Documentation/BPEL_Workflow_Processor|BPELWorkflowProcessor]]
+
** [[SMILA/Documentation/Scripting|Scripting]]
** [[SMILA/Documentation/Processing/JSON REST API for BPEL pipelines|JSON REST API for BPEL pipelines]]
+
** [[SMILA/Documentation/Scripting#ScriptProcessorWorker|ScriptProcessorWorker]]
* REST API: http://localhost:8080/smila/pipeline
+
 
 +
* REST API: http://localhost:8080/smila/script
  
 
'''Solr'''
 
'''Solr'''
Line 154: Line 169:
 
** configuration folder: <tt>org.eclipse.smila.solr</tt>
 
** configuration folder: <tt>org.eclipse.smila.solr</tt>
 
* Documentation
 
* Documentation
** [[SMILA/Documentation/Solr]]
+
** [[SMILA/Documentation/Solr_4.x]]

Latest revision as of 06:30, 9 April 2015


Just another 5 minutes to change the workflow

In the 5 minutes tutorial all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the script "add.js". All data was indexed into the same solr/lucene index "DefaultCore". It is possible however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.

In the following sections we are going to use the generic asynchronous "indexWithScript" workflow which let you specify the JavaScript script to process the data. We create an additional script for webcrawler records so that webcrawler data will be indexed into a separate search index named "webCollection".

Configure new solr index

Please shutdown SMILA now if it's still running.

To configure your own search index "webCollection" copy the collection1 configuration folder (see SMILA/configuration/org.eclipse.smila.solr/solr_home) with the name webCollection, in the same directory, delete date folder and adapt core.properties file.

Afterwards add your new core to the file SMILA.application/configuration/org.eclipse.smila.solr/solr-config.json:

{
    "mode":"embedded",
    "idFields":{
        "collection1":"_recordid",
        "webCollection":"_recordid"
    },
    ...
}

Please restart SMILA now.

Further information:: For more information about the Solr indexing, please see the SMILA Solr 4.x documentation.

Create a new indexing script

We need to add a new script for adding the imported webcrawler records. Predefined scripts are contained in the configuration/org.eclipse.smila.scripting/js directory. We can add new scripts by just adding them there.

Copy the script "add.js", name the copy "addWeb.js" and change the index name "indexname" in there from "collection1" to "webCollection":

   ...
   var solrIndexPipelet = pipelets.create("org.eclipse.smila.solr.update.SolrUpdatePipelet", {
      "indexname" : "webCollection",
      ...

Further information: For more information about Scripting please check the Scripting documentation.

Create and start a new indexing job

We define a new indexing job based on the predefined asynchronous workflow "indexWithScript" (see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json). This indexing job will process the imported data by using our new script "addWeb.js".

The "indexWithScript" workflow contains a ScriptProcessorWorker worker which is not configured for a dedicated script, so the scripts handling adds and deletes have to be set via job parameter.

Use your favourite REST Client to create an appropriate job definition:

POST http://localhost:8080/smila/jobmanager/jobs/
  {
    "name":"indexWebJob",
    "parameters":{      
      "tempStore": "temp",
      "addScript": "addWeb",
      "deleteScript": "delete" 
     },
    "workflow":"indexWithScript"
  }

Notes:

  • the "deleteScript" is not needed for our test scenario here, but we must fulfill all undefined workflow parameters.
  • in the add and the delete script we use the standard function ("process"), so we don't have to set/change this via parameter.

Afterwards, start a job run for the defined job:

POST http://localhost:8080/smila/jobmanager/jobs/indexWebJob

Create a new web crawl job

Since the predefined web crawl job pushes the crawled records to the indexUpdate job, we create a new job here using our new indexing job.

POST http://localhost:8080/smila/jobmanager/jobs/
{
  "name":"crawlWikiToWebCore",
  "workflow":"webCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"web",
    "jobToPushTo":"indexWebJob",
    "startUrl":"http://wiki.eclipse.org/SMILA",
    "linksPerBulk": 100,
    "filters":{
      "urlPatterns": {
         "include": ["http://wiki\\.eclipse\\.org/SMILA.*",
            "http://wiki\\.eclipse\\.org/Image:.*",
            "http://wiki\\.eclipse\\.org/images/.*"],
         "exclude": [".*\\?.*",
            "http://wiki\\.eclipse\\.org/images/archive/.*",
            ".*\\.java"]
      }
    },
    "mapping": {
      "httpCharset": "Charset",
      "httpContenttype": "ContentType",
      "httpLastModified": "LastModifiedDate",
      "httpMimetype": "MimeType",
      "httpSize": "Size",
      "httpUrl": "Url",
      "httpContent": "Content"
    }
  }
}

Please note that we used the following line to let the crawl job push the records to our new indexing job:

"jobToPushTo":"indexWebJob"

Now start the crawl job:

POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore

After some time to crawl, process and commit the data, you can have another look at the SMILA search page to find your new core listed among the available cores, and if you choose it, you can search for e.g. "SMILA" in the new webCollection.

Put it all together

Ok, now it seems that we have finally finished configuring SMILA for using separate scripts for file system and web crawling and index data from these crawlers into different indices. Here is what we have done so far:

  1. We added the webCollection index to the Solr configuration.
  2. We created a new JavaScript script for Web crawler data referencing the new Solr index.
  3. We used a separate job for web indexing that references the new script.
  4. We used a separate web crawl job to push the records to the new indexing job.

Configuration overview

SMILA configuration files are located in the configuration directory of the SMILA application. The following lists the configuration files and documentation links relevant to this tutorial, regarding SMILA components:

Jobmanager

Scripting

Solr

Copyright © Eclipse Foundation, Inc. All Rights Reserved.