Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/Search Processing"

m (Typo)
(Replacing page with '=== Obsolete, because processing services have been removed ===')
Line 1: Line 1:
== The Issue ==
+
=== Obsolete, because processing services have been removed ===
 
+
While writing [[SMILA/Specifications/Search API]] I have recognized that the way of processing defined for import processes is not sufficient for processing searches. The reason is mainly that in search we have two separate kinds of objects for processing: the query object and the current result set. And we may have services that need so see both kinds of objects, e.g. highlighting or adaptation rule services. On the other hand the processing service/pipelet API should still be simple for implementors that do not care about a query object but just want to process the current set of objects. And these pipelets should also still be usable in search pipelines to process the query object (before the actual retrieval) or the result objects (afterwards).
+
 
+
== The Proposal ==
+
 
+
The basic idea to introduce a second Pipelet/Service interface that receives the ID of the effective query object:
+
 
+
<source lang="java">
+
class SearchMessage {
+
  Id query;
+
  Id[] records;
+
}
+
 
+
interface SearchPipelet {
+
  SearchMessage process(Blackboard bb, SearchMessage message) throws ProcessingException;
+
}
+
 
+
interface SearchProcessingService {
+
  SearchMessage process(Blackboard bb, SearchMessage message) throws ProcessingException;
+
}
+
</source>
+
 
+
WSDL extensions: Besides the standard ProcessorMessage and ProcessorPortType used in import pipeline, we need in search pipelines:
+
 
+
<source lang="xml">
+
<types>
+
  <xsd:schema targetNamespace="http://www.eclipse.org/smila/processor">
+
    <xsd:import namespace="http://www.eclipse.org/smila/record" schemaLocation="record.xsd" />
+
    <xsd:element name="ReqId" type="xsd:string" />
+
    <xsd:complexType name="SearchMessage">
+
      <xsd:sequence>
+
        <xsd:element ref="proc:ReqId" minOccurs="1" maxOccurs="1" />
+
        <xsd:element ref="rec:Record" minOccurs="0" maxOccurs="1" />
+
        <xsd:element ref="rec:RecordList" minOccurs="0" maxOccurs="1" />
+
      </xsd:sequence>
+
    </xsd:complexType>
+
  </xsd:schema>
+
</types>
+
 
+
<message name="SearchProcessorMessage">
+
  <part name="records" type="proc:SearchMessage" />
+
</message>
+
 
+
<portType name="SearchProcessorPortType">
+
    <operation name="search">
+
        <input message="proc:SearchProcessorMessage" name="in" />
+
        <output message="proc:SearchProcessorMessage" name="out" />
+
        <fault message="proc:ProcessorException" name="ex" />
+
    </operation>
+
</portType>
+
</source>
+
 
+
It is then the job of the ODE integration layer to coordinate the routing of the elements of messages to the correct arguments in the service invocation:
+
 
+
* The initial query is written to <tt>$request.query</tt>
+
* Invocation of SimplePipelets/ProcessingServices (the old ones):
+
** if the <tt>records</tt> part of the input message does not exist, invoke the pipelet/service with the ID of the (single) record in the <tt>query</tt> part and store the result back to the <tt>query</tt> of the output message.
+
*** For the moment we assume that it is not useful to use query processing pipelets that produce multiple results, so we define to only store the first record in the <tt>query</tt> and just throw away additional records. Alternatively, multiple records could also be stored in the <tt>records</tt> part.
+
** else (i.e., <tt>records</tt> contains an record list, which can be empty) call the pipelet with the IDs in this record list and store the result to the <tt>records</tt> part of the output message.
+
* Invocation of SearchPipelets/SearchProcessingServices:
+
** construct a search message from <tt>query</tt> and <tt>records</tt> part of the input message and call the pipelet/service with it.
+
** the result message's records list becomes the <tt>records</tt> part of the output message, the query ID is stored in the <tt>query</tt> part of the output message.
+
 
+
=== Blackboard handling in search pipelines ===
+
 
+
Different searches can have results that share the same objects. If both searches would access their objects via the same blackboard instance this could lead to conflicts. This would be much simpler if each request could be exeucted with its own Blackboard instance. A blackboard implementation for this purpose could be quite light-weight, it's basically a Map<Id, Record>, as we do not need a connection to record storages, as (at least in the first version) records used in searches would be transient and can be GCed after the request is finished. This would also make it easy to clean up the resources used by a request after the request: Just throw away the complete request blackboard. Adding a storage connection in later versions to be able to persist query objects or current result lists for query session handling later should also be no major performance problem.
+
 
+
So we would extend the WorkflowProcessor interface such that it receives the Blackboard instance to use for processing the records in the <tt>process</tt>:
+
 
+
<source lang="java">
+
Id[] process(String workflowName, Blackboard blackboard, Id[] recordIds) throws ProcessingException;
+
SearchMessage process(String workflowName, Blackboard blackboard, SearchMessage query) throws ProcessingException;
+
</source>
+
 
+
The same change is proposed in [[SMILA/Specifications/Partitioning Storages#Changes in the Blackboard service]], so it seems that this change makes sense anyway. However, it will need a further extension of the WSDLs of SMILA pipelines such that the BPEL messages contain a part that is used to identify the correct blackboard when a pipelet/service is invoked. This changes the WSDL processor message to:
+
 
+
<source lang="xml">
+
<message name="ProcessorMessage">
+
  <part name="id" element="proc:ReqId" />
+
  <part name="records" element="rec:RecordList" />
+
</message>
+
</source>
+
 
+
(The WSDL SearchMessage described above contains this ID already.)
+
 
+
== Some Example BPEL pipelines ==
+
 
+
All used variables are assumed to be SearchProcessorMessages.
+
 
+
=== Simple Search Pipeline ===
+
 
+
<source lang="xml">
+
<sequence>
+
    <extensionActivity name="invokeTextminer">
+
        <proc:invokeService>
+
            <proc:service name="TextminerService" />
+
            <proc:variables input="request" output="request" />    <!-- process request.query -->
+
        </proc:invokeService>
+
    </extensionActivity>
+
    <extensionActivity name="invokeCompletionRules">
+
        <proc:invokeService>
+
            <proc:service name="CompletionRulesService" />
+
            <proc:variables input="request" output="request" />    <!-- process request.query -->
+
        </proc:invokeService>
+
    </extensionActivity>
+
    <extensionActivity name="invokeSearchIndex">
+
        <proc:invokeService>
+
            <proc:service name="SearchIndexService" />
+
            <proc:variables input="request" output="request" />   
+
        <!-- use request.query to produce request.records
+
            hence this needs to be a SearchProcessingService -->
+
        </proc:invokeService>
+
    </extensionActivity>
+
    <extensionActivity name="invokeCompletionRules">
+
        <proc:invokeService>
+
            <proc:service name="CompletionRulesService" />
+
            <proc:variables input="request" output="request" />    <!-- process request.records -->
+
        </proc:invokeService>
+
    </extensionActivity>
+
    <extensionActivity name="invokeHighlighter">
+
        <proc:invokeService>
+
            <proc:service name="HighlighterService" />
+
            <proc:variables input="request" output="request" />   
+
        <!-- use request.query AND request.records,
+
            hence this really needs to be a SearchProcessingService -->
+
        </proc:invokeService>
+
    </extensionActivity>
+
</sequence>
+
</source>
+
 
+
Ok, so far.
+
 
+
=== Use search result as query ("more like this") ===
+
 
+
<source lang="xml">
+
<sequence>
+
    ...
+
    <extensionActivity name="invokeIndexSearch">
+
        <proc:invokeService>
+
            <proc:service name="SearchIndexService" />
+
            <proc:variables input="request" output="request" />
+
        </proc:invokeService>
+
    </extensionActivity>
+
    <extensionActivity name="invokeIndexSearchAgain">
+
        <proc:invokeService>
+
            <proc:service name="SearchIndexService" />
+
            <proc:variables input="request" output="request" />
+
            <proc:setAnnotations>
+
              <rec:An n="parameters">
+
                <rec:V n="useResultRecord">true</V>
+
              </rec:An>
+
            </proc:setAnnotations>
+
        </proc:invokeService>
+
    </extensionActivity>
+
    ...
+
</sequence>
+
</source>
+
 
+
This can be done, if the SearchIndex service is implemented such that it can use the first entry of the records list instead of the query object to perform the search. In this example this behaviour is triggered by a special parameter for this pipelet.
+
 
+
Alternatively it would be feasible to have a pipelet between the two searches, that merges the first result of the first into the query record (and clear the result list) such that the second search can use the updated query object.
+
 
+
 
+
=== A problematic use case: "Federated search" ===
+
 
+
<source lang="xml">
+
<sequence>
+
    ...
+
    <flow> <!-- search in parallel -->
+
        <extensionActivity name="invokeSearchIndex1">
+
            <proc:invokeService>
+
                <proc:service name="SearchIndexService1" />
+
                <proc:variables input="request" output="result1" />
+
            </proc:invokeService>
+
        </extensionActivity>
+
        <extensionActivity name="invokeSearchIndex2">
+
            <proc:invokeService>
+
                <proc:service name="SearchIndexService2" />
+
                <proc:variables input="request" output="result2" />   
+
            </proc:invokeService>
+
        </extensionActivity>
+
        <extensionActivity name="invokeSearchIndex3">
+
            <proc:invokeService>
+
                <proc:service name="SearchIndexService3" />
+
                <proc:variables input="request" output="result3" />   
+
            </proc:invokeService>
+
        </extensionActivity>
+
    </flow>
+
   
+
    <!-- how to merge the record parts of the resultX.messages? />
+
    ...
+
   
+
</sequence>
+
</source>
+
 
+
The merging could probably be done using XSLT. But if we wanted to introduce a MergePipelet, we problably need a third kind of pipelet interface allowing to receive an arbitrary number of search messages.
+
 
+
<source lang="java">
+
interface AdvancedSearchPipelet {
+
    SearchMessage process(Blackboard bb, SearchMessage[] recordLists) throws ProcessingException;
+
</source>
+
 
+
and calling it via an extended <invokePipelet> activity:
+
 
+
<source lang="xml">
+
        ...
+
        <extensionActivity name="invokeResultMerge">
+
            <proc:invokePipelet>
+
                <proc:pipelet class="org.eclipse.smila.pipelets.SearchResultMergePipelet" />
+
                <proc:variables input="result1" output="mergeResult">
+
                  <proc:variable input="result2"/>
+
                  <proc:variable input="result3"/>
+
                </proc:variables>
+
            </proc:invokePipelet>
+
        </extensionActivity>
+
        ...
+
</source>
+
 
+
[[Category:SMILA]]
+

Revision as of 10:57, 20 April 2011

Obsolete, because processing services have been removed

Back to the top