Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/5 more minutes to change the workflow

< SMILA‎ | Documentation
Revision as of 06:45, 24 January 2012 by Andreas.schank.attensity.com (Talk | contribs) (New page: Category:SMILA Category:HowTo = Just another 5 minutes to change the workflow = In the 5 minutes to success all data collected b...)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Just another 5 minutes to change the workflow

In the 5 minutes to success all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the BPEL pipeline "AddPipeline". All data was indexed into the same solr/lucene index "DefaultCore". It is possible, however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.

In the following sections we are going to use the generic asynchronous "importToPipeline" workflow which let you specify the BPEL pipeline to process the data. We create an additional BPEL pipeline for webcrawler records so that webcrawler data will be indexed into a separate index named "WebCore".

Configure new solr index

It's very important to shutdown and restart the SMILA engine after the following configuration changes are done because modified configurations are loaded during startup only.

To configure your own index "WebCore" follow the description in the SMILA documentation for creating your own solr index.

We need a new index field "Url" in our index scheme, so

  • open SMILA/configuration/org.eclipse.smila.solr/WebCore/conf/schema.xml
  • add field "Url" to the index fields:
 ...
 <fields>
    ...
    <field name="Url" type="text_path" indexed="true" stored="true"
			termVectors="true" termPositions="true" termOffsets="true" />
    ...
 </fields> 
 ...

For more information about the solr indexing, please see the SMILA solr documentation.

Create a new BPEL pipeline

We need to add the AddWebPipeline pipeline to the BPEL WorkflowProcessor. For more information about BPEL WorkflowProcessor please check the BPEL WorkflowProcessor documentation. Predefined BPEL WorkflowProcessor configuration files are contained in the configuration/org.eclipse.smila.processing.bpel/pipelines directory. However, we can add new BPEL pipelines with the SMILA REST API.

Start SMILA if it's not yet running, and use your favourite REST client to add the "AddWebPipeline" BPEL pipeline: (the BPEL XML is a little bit unreadable cause we have to escape it for being valid JSON content; after posting the new pipeline you can get a readable version via monitoring REST API - see below)

POST http://localhost:8080/smila/pipeline
  {
    "name":"AddWebPipeline",
    "definition":"<?xml version=\"1.0\" encoding=\"utf-8\" ?>\r\n<process name=\"AddWebPipeline\" targetNamespace=\"http://www.eclipse.org/smila/processor\"\r\n  xmlns=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"\r\n  xmlns:proc=\"http://www.eclipse.org/smila/processor\" xmlns:rec=\"http://www.eclipse.org/smila/record\"\r\n  xmlns:bpel=\"http://docs.oasis-open.org/wsbpel/2.0/process/executable\">\r\n\r\n  <import location=\"processor.wsdl\" namespace=\"http://www.eclipse.org/smila/processor\"\r\n    importType=\"http://schemas.xmlsoap.org/wsdl/\" />\r\n\r\n  <partnerLinks>\r\n    <partnerLink name=\"Pipeline\" partnerLinkType=\"proc:ProcessorPartnerLinkType\" myRole=\"service\" />\r\n  </partnerLinks>\r\n\r\n  <extensions>\r\n    <extension namespace=\"http://www.eclipse.org/smila/processor\" mustUnderstand=\"no\" />\r\n  </extensions>\r\n\r\n  <variables>\r\n    <variable name=\"request\" messageType=\"proc:ProcessorMessage\" />\r\n  </variables>\r\n\r\n  <sequence name=\"AddWebPipeline\">\r\n    <receive name=\"start\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\"\r\n      operation=\"process\" variable=\"request\" createInstance=\"yes\" />\r\n\r\n    <forEach counterName=\"index\" parallel=\"yes\" name=\"iterateRecords\">\r\n      <startCounterValue>1</startCounterValue>\r\n      <finalCounterValue>count($request.records/rec:Record)</finalCounterValue>\r\n      <scope>\r\n        <sequence>\r\n          <if name=\"HasMimeType\">\r\n            <condition>not($request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"])</condition>\r\n            <extensionActivity>\r\n              <proc:invokePipelet name=\"detectMimeType\">\r\n                <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet\" />\r\n                <proc:variables input=\"request\" index=\"index\" />\r\n                <proc:configuration>\r\n                  <rec:Val key=\"FileExtensionAttribute\">Extension</rec:Val>\r\n                  <rec:Val key=\"MetaDataAttribute\">MetaData</rec:Val>\r\n                  <rec:Val key=\"MimeTypeAttribute\">MimeType</rec:Val>\r\n                </proc:configuration>\r\n              </proc:invokePipelet>\r\n            </extensionActivity>\r\n          </if>\r\n\r\n          <!-- only process text based content, skip everything else -->\r\n          <if name=\"IsText\">\r\n            <condition>starts-with($request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"],\"text/\")\r\n            </condition>\r\n            <if name=\"IsHTML\">\r\n              <condition>$request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"] = \"text/html\"\r\n                or $request.records/rec:Record[$index]/rec:Val[@key=\"MimeType\"] = \"text/xml\"\r\n              </condition>\r\n              <!-- extract txt from html and xml files -->\r\n              <extensionActivity>\r\n                <proc:invokePipelet name=\"invokeHtml2Txt\">\r\n                  <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.HtmlToTextPipelet\" />\r\n                  <proc:variables input=\"request\" index=\"index\" />\r\n                  <proc:configuration>\r\n                    <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n                    <rec:Val key=\"outputType\">ATTRIBUTE</rec:Val>\r\n                    <rec:Val key=\"inputName\">Content</rec:Val>\r\n                    <rec:Val key=\"outputName\">Content</rec:Val>\r\n                    <rec:Val key=\"meta:title\">Title</rec:Val>\r\n                  </proc:configuration>\r\n                </proc:invokePipelet>\r\n              </extensionActivity>\r\n\t\t\t  <else>\r\n                <!-- copy txt from attachment to attribute -->\r\n                <extensionActivity>\r\n                  <proc:invokePipelet name=\"invokeCopyContent\">\r\n                    <proc:pipelet class=\"org.eclipse.smila.processing.pipelets.CopyPipelet\" />\r\n                    <proc:variables input=\"request\" index=\"index\" />\r\n                    <proc:configuration>\r\n                      <rec:Val key=\"inputType\">ATTACHMENT</rec:Val>\r\n                      <rec:Val key=\"outputType\">ATTRIBUTE</rec:Val>\r\n                      <rec:Val key=\"inputName\">Content</rec:Val>\r\n                      <rec:Val key=\"outputName\">Content</rec:Val>\r\n                      <rec:Val key=\"mode\">COPY</rec:Val>\r\n                    </proc:configuration>\r\n                  </proc:invokePipelet>\r\n                </extensionActivity>\r\n              </else>\r\n            </if>\r\n          </if>\r\n        </sequence>\r\n      </scope>\r\n    </forEach>\r\n\r\n    <extensionActivity>\r\n      <proc:invokePipelet name=\"SolrIndexPipelet\">\r\n        <proc:pipelet class=\"org.eclipse.smila.solr.index.SolrIndexPipelet\" />\r\n        <proc:variables input=\"request\" output=\"request\" />\r\n        <proc:configuration>\r\n          <rec:Val key=\"ExecutionMode\">ADD</rec:Val>\r\n          <rec:Val key=\"CoreName\">WebCore</rec:Val>\r\n          <rec:Seq key=\"CoreFields\">\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Path</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Filename</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Url</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">MimeType</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Size</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">LastModifiedDate</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Content</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Extension</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Title</rec:Val>\r\n            </rec:Map>\r\n            <rec:Map>\r\n              <rec:Val key=\"FieldName\">Author</rec:Val>\r\n            </rec:Map>\r\n          </rec:Seq>\r\n        </proc:configuration>\r\n      </proc:invokePipelet>\r\n    </extensionActivity>\r\n\r\n    <reply name=\"end\" partnerLink=\"Pipeline\" portType=\"proc:ProcessorPortType\" operation=\"process\"\r\n      variable=\"request\" />\r\n  </sequence>\r\n</process>\r\n"
  }

You can monitor the defined BPEL pipelines via browser, so you should find your new pipeline there:

http://localhost:8080/smila/pipeline

Note that we used "WebCore" index name for the Solr index in the BPEL above:

...
<proc:configuration>
  <rec:Val key="CoreName">WebCore</rec:Val>
  ...
</proc:configuration>
...

Create and start a new indexing job

We define an indexing job based on the predefined asynchronous workflow "importToPipeline" (see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json). This indexing job will process the imported data by using our new BPEL pipeline "AddWebPipeline".

The "importToPipeline" workflow contains a PipelineProcessorWorker worker which is not configured for dedicated BPEL pipelines, so the BPEL pipelines handling adds and deletes have to be set via job parameter.

Use your favourite REST Client to create an appropriate job definition:

POST http://localhost:8080/smila/jobmanager/jobs/
  {
    "name":"indexWebJob",
    "parameters":{      
      "tempStore": "temp",
      "addPipeline": "AddWebPipeline",
      "deletePipeline": "DeletePipeline" 
     },
    "workflow":"importToPipeline"
  }

Note that the "DeletePipeline" is not needed for our test szenario here, but we must fulfill all undefined workflow parameters.

Afterwards, start a job run for the defined job:

POST http://localhost:8080/smila/jobmanager/jobs/indexWebJob

Put it all together

Ok, now it seems that we have finally finished configuring SMILA for using separate BPEL pipelines for file system and web crawling and index data from these crawlers into different indices. Here is what we have done so far:

  1. We added the WebCore index to the Solr configuration.
  2. We created a new BPEL pipeline for Web crawler data referencing the new Lucene index.
  3. We used a separate job for web indexing that references the new BPEL pipeline.

Now, run the Web crawler again, remember to use "indexWebJob" as job name parameter!

Go back to your browser at http://localhost:8080/SMILA/search, select the new index "WebCore" and run a search:

Configuration overview

SMILA configuration files are located in the configuration directory of the SMILA application. The following lists the configuration files and documentation links relevant to this tutorial, regarding SMILA components:

Crawler

  • configuration folder: org.eclipse.smila.connectivity.framework
    • file.xml (FileSystem Crawler)
    • web.xml (Web Crawler)
  • Documentation

Jobmanager

BPEL Pipelines

Solr

Back to the top