Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/5 more minutes to change the workflow"
(scripting) |
(→Create and start a new indexing job) |
||
(6 intermediate revisions by one other user not shown) | |||
Line 7: | Line 7: | ||
It is possible however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones. | It is possible however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones. | ||
− | In the following sections we are going to use the generic asynchronous "indexWithScript" workflow which let you specify the JavaScript script to process the data. We create an additional script for webcrawler records so that webcrawler data will be indexed into a separate index named "WebCore". | + | In the following sections we are going to use the generic asynchronous "indexWithScript" workflow which let you specify the JavaScript script to process the data. We create an additional script for webcrawler records so that webcrawler data will be indexed into a separate search index named "WebCore". |
== Configure new solr index == | == Configure new solr index == | ||
Line 16: | Line 16: | ||
|} | |} | ||
− | To configure your own index "WebCore" follow the description in the SMILA documentation for [[SMILA/Documentation/Solr#Setup_another_core|creating your own solr index]]. | + | To configure your own search index "WebCore" follow the description in the SMILA documentation for [[SMILA/Documentation/Solr#Setup_another_core|creating your own solr index]]. |
− | + | If you had already started SMILA before (as we suppose you did), please copy your new Core configuration and the modified <tt>solr.xml</tt> file to the folder <tt>workspace\.metadata\.plugins\org.eclipse.smila.solr</tt> because the configuration will not be copied again, after the first start of the Solr bundle. | |
{|width="100%" style="background-color:#d8e4f1; padding-left:30px;" | {|width="100%" style="background-color:#d8e4f1; padding-left:30px;" | ||
Line 32: | Line 32: | ||
Predefined scripts are contained in the <tt>configuration/org.eclipse.smila.scripting/js</tt> directory. We can add new scripts by just adding them there. | Predefined scripts are contained in the <tt>configuration/org.eclipse.smila.scripting/js</tt> directory. We can add new scripts by just adding them there. | ||
− | Copy the script | + | Copy the script "add.js", name the copy "addWeb.js" and change the solr "CoreName" in there from "DefaultCore" to "WebCore": |
<pre> | <pre> | ||
− | + | ... | |
− | + | var solrIndexPipelet = pipelets.create("org.eclipse.smila.solr.index.SolrIndexPipelet", { | |
− | + | "ExecutionMode" : "ADD", | |
− | + | "CoreName" : "WebCore", | |
... | ... | ||
</pre> | </pre> | ||
Line 46: | Line 46: | ||
== Create and start a new indexing job == | == Create and start a new indexing job == | ||
− | We define | + | We define a new indexing job based on the predefined asynchronous workflow "indexWithScript" (see <tt>SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json</tt>). This indexing job will process the imported data by using our new script "addWeb.js". |
− | The " | + | The "indexWithScript" workflow contains a [[SMILA/Documentation/Worker/ScriptProcessorWorker|ScriptProcessorWorker worker]] which is not configured for a dedicated script, so the scripts handling adds and deletes have to be set via job parameter. |
Use your favourite REST Client to create an appropriate job definition: | Use your favourite REST Client to create an appropriate job definition: | ||
Line 58: | Line 58: | ||
"parameters":{ | "parameters":{ | ||
"tempStore": "temp", | "tempStore": "temp", | ||
− | " | + | "addScript": "addWeb", |
− | " | + | "deleteScript": "delete" |
}, | }, | ||
− | "workflow":" | + | "workflow":"indexWithScript" |
} | } | ||
</pre> | </pre> | ||
− | + | Notes: | |
+ | * the "deleteScript" is not needed for our test scenario here, but we must fulfill all undefined workflow parameters. | ||
+ | * in the add and the delete script we use the standard function ("process"), so we don't have to set/change this via parameter. | ||
Afterwards, start a job run for the defined job: | Afterwards, start a job run for the defined job: | ||
Line 73: | Line 75: | ||
</pre> | </pre> | ||
− | == | + | == Create a new web crawl job == |
− | Since the web crawl job | + | Since the predefined web crawl job pushes the crawled records to the <tt>indexUpdate</tt> job, we create a new job here using our new indexing job. |
− | |||
<pre> | <pre> | ||
POST http://localhost:8080/smila/jobmanager/jobs/ | POST http://localhost:8080/smila/jobmanager/jobs/ | ||
Line 86: | Line 87: | ||
"tempStore":"temp", | "tempStore":"temp", | ||
"dataSource":"web", | "dataSource":"web", | ||
+ | "jobToPushTo":"indexWebJob", | ||
"startUrl":"http://wiki.eclipse.org/SMILA", | "startUrl":"http://wiki.eclipse.org/SMILA", | ||
− | " | + | "linksPerBulk": 100, |
− | " | + | "filters":{ |
+ | "urlPatterns": { | ||
+ | "include": ["http://wiki\\.eclipse\\.org/SMILA.*", | ||
+ | "http://wiki\\.eclipse\\.org/Image:.*", | ||
+ | "http://wiki\\.eclipse\\.org/images/.*"], | ||
+ | "exclude": [".*\\?.*", | ||
+ | "http://wiki\\.eclipse\\.org/images/archive/.*", | ||
+ | ".*\\.java"] | ||
+ | } | ||
}, | }, | ||
− | " | + | "mapping": { |
+ | "httpCharset": "Charset", | ||
+ | "httpContenttype": "ContentType", | ||
+ | "httpLastModified": "LastModifiedDate", | ||
+ | "httpMimetype": "MimeType", | ||
+ | "httpSize": "Size", | ||
+ | "httpUrl": "Url", | ||
+ | "httpContent": "Content" | ||
+ | } | ||
} | } | ||
− | } | + | }</pre> |
− | </pre> | + | |
− | Please note that we used the following line to let the crawl job push the records to our new job: | + | Please note that we used the following line to let the crawl job push the records to our new indexing job: |
<pre> | <pre> | ||
"jobToPushTo":"indexWebJob" | "jobToPushTo":"indexWebJob" | ||
</pre> | </pre> | ||
− | Now start the crawl job | + | Now start the crawl job: |
<pre> | <pre> | ||
POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore | POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore | ||
− | |||
− | |||
− | |||
</pre> | </pre> | ||
− | After | + | After some time to crawl, process and commit the data, you can have another look at the [http://localhost:8080/SMILA/search SMILA search page] to find your new core listed among the available cores, and if you choose it, you can search for e.g. "SMILA" in the new WebCore. |
== Put it all together == | == Put it all together == | ||
− | Ok, now it seems that we have finally finished configuring SMILA for using separate | + | Ok, now it seems that we have finally finished configuring SMILA for using separate scripts for file system and web crawling and index data from these crawlers into different indices. |
Here is what we have done so far: | Here is what we have done so far: | ||
− | # We added the <tt>WebCore</tt> index to the Solr configuration | + | # We added the <tt>WebCore</tt> index to the Solr configuration. |
− | # We created a new | + | # We created a new JavaScript script for Web crawler data referencing the new Solr index. |
− | # We used a separate job for web indexing that references the new | + | # We used a separate job for web indexing that references the new script. |
− | # We | + | # We used a separate web crawl job to push the records to the new indexing job. |
= Configuration overview = | = Configuration overview = | ||
Line 127: | Line 141: | ||
* configuration folder: <tt>org.eclipse.smila.jobmanager</tt> | * configuration folder: <tt>org.eclipse.smila.jobmanager</tt> | ||
** <tt>workflows.json</tt> (Predefined asynchronous workflows) | ** <tt>workflows.json</tt> (Predefined asynchronous workflows) | ||
+ | ** <tt>jobs.json</tt> (Predefined jobs) | ||
* Documentation | * Documentation | ||
** [[SMILA/Documentation/JobManager|JobManager]] | ** [[SMILA/Documentation/JobManager|JobManager]] | ||
− | |||
* REST API: http://localhost:8080/smila/jobmanager | * REST API: http://localhost:8080/smila/jobmanager | ||
− | ''' | + | '''Scripting''' |
− | * configuration folder: <tt>org.eclipse.smila.processing. | + | * configuration folder: <tt>org.eclipse.smila.processing.scripts</tt> |
− | ** <tt> | + | ** <tt>js/</tt> (Predefined JavaScript scripts) |
* Documentation | * Documentation | ||
− | ** [[SMILA/Documentation/ | + | ** [[SMILA/Documentation/Scripting|Scripting]] |
− | ** [[SMILA/Documentation/ | + | ** [[SMILA/Documentation/Scripting#ScriptProcessorWorker|ScriptProcessorWorker]] |
− | * REST API: http://localhost:8080/smila/ | + | |
+ | * REST API: http://localhost:8080/smila/script | ||
'''Solr''' | '''Solr''' |
Revision as of 05:40, 1 December 2014
Contents
Just another 5 minutes to change the workflow
In the 5 minutes tutorial all data collected by crawlers was processed with the same asynchronous "indexUpdate" workflow using the script "add.js". All data was indexed into the same solr/lucene index "DefaultCore". It is possible however, to configure SMILA so that data from different data sources will go through different workflows and pipelines and will be indexed into different indices. This will require more advanced configuration features than before but still quite simple ones.
In the following sections we are going to use the generic asynchronous "indexWithScript" workflow which let you specify the JavaScript script to process the data. We create an additional script for webcrawler records so that webcrawler data will be indexed into a separate search index named "WebCore".
Configure new solr index
Please shutdown SMILA now if it's still running. |
To configure your own search index "WebCore" follow the description in the SMILA documentation for creating your own solr index.
If you had already started SMILA before (as we suppose you did), please copy your new Core configuration and the modified solr.xml file to the folder workspace\.metadata\.plugins\org.eclipse.smila.solr because the configuration will not be copied again, after the first start of the Solr bundle.
Please restart SMILA now. |
Further information:: For more information about the Solr indexing, please see the SMILA Solr documentation.
Create a new indexing script
We need to add a new script for adding the imported webcrawler records. Predefined scripts are contained in the configuration/org.eclipse.smila.scripting/js directory. We can add new scripts by just adding them there.
Copy the script "add.js", name the copy "addWeb.js" and change the solr "CoreName" in there from "DefaultCore" to "WebCore":
... var solrIndexPipelet = pipelets.create("org.eclipse.smila.solr.index.SolrIndexPipelet", { "ExecutionMode" : "ADD", "CoreName" : "WebCore", ...
Further information: For more information about Scripting please check the Scripting documentation.
Create and start a new indexing job
We define a new indexing job based on the predefined asynchronous workflow "indexWithScript" (see SMILA/configuration/org.eclipse.smila.jobmanager/workflows.json). This indexing job will process the imported data by using our new script "addWeb.js".
The "indexWithScript" workflow contains a ScriptProcessorWorker worker which is not configured for a dedicated script, so the scripts handling adds and deletes have to be set via job parameter.
Use your favourite REST Client to create an appropriate job definition:
POST http://localhost:8080/smila/jobmanager/jobs/ { "name":"indexWebJob", "parameters":{ "tempStore": "temp", "addScript": "addWeb", "deleteScript": "delete" }, "workflow":"indexWithScript" }
Notes:
- the "deleteScript" is not needed for our test scenario here, but we must fulfill all undefined workflow parameters.
- in the add and the delete script we use the standard function ("process"), so we don't have to set/change this via parameter.
Afterwards, start a job run for the defined job:
POST http://localhost:8080/smila/jobmanager/jobs/indexWebJob
Create a new web crawl job
Since the predefined web crawl job pushes the crawled records to the indexUpdate job, we create a new job here using our new indexing job.
POST http://localhost:8080/smila/jobmanager/jobs/ { "name":"crawlWikiToWebCore", "workflow":"webCrawling", "parameters":{ "tempStore":"temp", "dataSource":"web", "jobToPushTo":"indexWebJob", "startUrl":"http://wiki.eclipse.org/SMILA", "linksPerBulk": 100, "filters":{ "urlPatterns": { "include": ["http://wiki\\.eclipse\\.org/SMILA.*", "http://wiki\\.eclipse\\.org/Image:.*", "http://wiki\\.eclipse\\.org/images/.*"], "exclude": [".*\\?.*", "http://wiki\\.eclipse\\.org/images/archive/.*", ".*\\.java"] } }, "mapping": { "httpCharset": "Charset", "httpContenttype": "ContentType", "httpLastModified": "LastModifiedDate", "httpMimetype": "MimeType", "httpSize": "Size", "httpUrl": "Url", "httpContent": "Content" } } }
Please note that we used the following line to let the crawl job push the records to our new indexing job:
"jobToPushTo":"indexWebJob"
Now start the crawl job:
POST http://localhost:8080/smila/jobmanager/jobs/crawlWikiToWebCore
After some time to crawl, process and commit the data, you can have another look at the SMILA search page to find your new core listed among the available cores, and if you choose it, you can search for e.g. "SMILA" in the new WebCore.
Put it all together
Ok, now it seems that we have finally finished configuring SMILA for using separate scripts for file system and web crawling and index data from these crawlers into different indices. Here is what we have done so far:
- We added the WebCore index to the Solr configuration.
- We created a new JavaScript script for Web crawler data referencing the new Solr index.
- We used a separate job for web indexing that references the new script.
- We used a separate web crawl job to push the records to the new indexing job.
Configuration overview
SMILA configuration files are located in the configuration directory of the SMILA application. The following lists the configuration files and documentation links relevant to this tutorial, regarding SMILA components:
Jobmanager
- configuration folder: org.eclipse.smila.jobmanager
- workflows.json (Predefined asynchronous workflows)
- jobs.json (Predefined jobs)
- Documentation
- REST API: http://localhost:8080/smila/jobmanager
Scripting
- configuration folder: org.eclipse.smila.processing.scripts
- js/ (Predefined JavaScript scripts)
- Documentation
- REST API: http://localhost:8080/smila/script
Solr
- DataDictionary
- configuration folder: org.eclipse.smila.solr
- Documentation