This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests in BPEL workflows, and the sample servlet used to create a simple search Web GUI.
Let's start at the top: If you have installed SMILA and created an index by starting a crawler you can now use you web browser to go to http://localhost:8080/SMILA/search to search the index:
What happens behind the scenes when you enter a query string and submit the form is that a servlet creates a SMILA record from the HTTP parameters, uses the search service to execute a BPEL workflow on this record, receives an enriched version of the query record and a list of result records in XML form and uses an XSLT stylesheet to create a result HTML page.
Using the Advanced link at the top you can switch to more detailed search page:
This page allows you to enter a more specific query. In case you want to use the default search servlet for your own search page you should use the XSLT files that create these two pages as a reference when trying to design your own search page.
Having seen the tip of the iceberg, we dive down to the very bottom of SMILA search: the actual processing of search requests in SMILA BPEL pipelines. We assume that you are accustomed to the basic SMILA workflow processing features used in indexing workflows. You may want to refer to SMILA/Documentation/BPEL_Workflow_Processor for details.
Search workflows (or pipelines) are very similar to indexing pipelines, but there are a few extensions. The variables in indexing pipelines represent just a simple list of records. This is not sufficient for search pipelines where we need to distinguish between the single record representing the user query (the "query record") and the current list of result records (the "search result"). This results in a few general differences between the BPEL files of indexing and search pipelines:
- the partner link of the pipeline must be of type "proc:SearchProcessorPartnerLinkType":
<partnerLinks> <partnerLink name="Pipeline" partnerLinkType="proc:SearchProcessorPartnerLinkType" myRole="service" /> </partnerLinks>
- the input and output variables of the pipeline itself and of pipelet/service invocations must have the message type "proc:SearchProcessorMessage". This message has only a single part named "records" which can contain a single record (the query record) and a record list (the result records). Refer to org.eclipse.smila.processing.bpel/xml/processor.wsdl for the details of the schema definition.
<variables> <variable name="request" messageType="proc:SearchProcessorMessage" /> </variables>
- The <receive> and <reply> elements must use the portType "proc:SearchProcessorPortType":
<sequence> <receive name="start" partnerLink="Pipeline" portType="proc:SearchProcessorPortType" operation="process" variable="request" createInstance="yes" /> <!-- service/pipelet invocation and other workflow logic --> <reply name="end" partnerLink="Pipeline" portType="proc:SearchProcessorPortType" operation="process" variable="request" /> </sequence>
Apart from this, pipelet/service invocations look the same as in indexing pipelines. See SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/SearchPipeline.bpel for a complete example search pipeline (the one used in the above sample).
SimplePipelets/ProcessingServices in Search pipelines
Recall that the signature of the invocation method of SimplePipelets/ProcessingServices is
Id process(Blackboard blackboard, Id recordIds) throws ProcessingException;
This means when used in search pipelines they cannot process a complete message variable. Therefore the engine selects one part of the message when invoking a "simple" pipeline element:
- if there is not yet a result record list in the message (not even an empty one) the pipelet is called with the query record ID and the output message contains only a single query record ID, too.
- else it is called with the result record list and the result becomes the record list of the output variable. The query record ID is just copied to the result variable.
The rationale behind this is that in a search pipeline first some pipelets may be needed to prepare the query object (enrich the query, set some defaults, etc.), then follows the actual search, which takes the query as input and produces a list of results (thus adds the result record list to the variable) and then additional pipelets may be needed to manipulate the result further. Using the distinction described above makes it possible to use the same pipelet implementation for query and result records, just depending on their position in the pipeline.
For some operations in search pipelines this invocation pattern is not sufficient, the most prominent being the actual search implementation itself: It needs the query record as input and produces a result record list. But there may be other pipelets after the actual search that need to compare query and result records and therefore need access to both kinds of record. To support this, two new interfaces have been defined:
Concerning life cycle and configuration they are identical to standard Simple Pipelets and Processing Services: Pipelets are created and configured by the BPEL engine and must be declared in teh MANIFEST.MF of the providing bundle. ProcessingServices are started independently from the BPEL engine as OSGi services (though for different service interfaces). The enhancement provided by the search pipelets/service is a new invocation method:
SearchMessage process(Blackboard blackboard, SearchMessage message) throws ProcessingException;
where SearchMessage consists of a query record ID and a record ID list.
Search Service API
The actual Search API is quite simple: SMILA registeres an OSGi service with the interface org.eclipse.smila.search.api.SearchService. It provides a few methods that take a SMILA query record and the name of a search workflow as input, execute the workflow on the record and return the result in different formats:
- SearchResult search(String workflowName, Record query) throws ProcessingException: this is the basic method of the search service, that returns the result records as SMILA data structures. The other methods call this method for the actual search execution, too, and just convert the result.
- org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException: return the search result as a XML DOM document. See below for the schema of the result.
- String searchAsXmlString(String workflowName, Record query) throws ProcessingException: return the search result as an XML string. See below for the schema of the result.
The schema of XML search results is basically as follows (target namespace is http://www.eclipse.org/smila/search, see org.eclipse.smila.search.api/xml/search.xsd for the full definition):
<element name="SearchResult"> <complexType> <sequence minOccurs="1" maxOccurs="1"> <element name="Query" minOccurs="1" maxOccurs="1"> <complexType> <sequence> <element name="Workflow" type="string" minOccurs="1" maxOccurs="1" /> <element ref="rec:Record" minOccurs="0" maxOccurs="1" /> </sequence> </complexType> </element> <element ref="rec:RecordList" minOccurs="0" maxOccurs="1"/> </sequence> </complexType> </element>
You can view the result XML by using the sample SMILA search page  and selecting the "Show XML result" checkbox before submitting the query.
The content of the query record basically depends a lot on the used search services. E.g. using the LuceneSearchService, you can set attribute values to search in the index fields to which these attributes have been mapped during indexing (refer to the Lucene integration documentation for details). Other search parameters are attached to the query record as annotations. However, the Search API is also a recommendation where to put some basic, commonly used search parameters, which all index integrations should honor (of course they may quite specify extensions that are not covered by the generic Search API). The following sections describes these recommendations.
The query record contains mainly of parameters. The Search API defines these paremeter names, allowed values and default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also define default values. All parameters are single values if not specified differently.
- query: either a search string using a query syntax, or a query record describing the query by setting values for application attributes. The index implementor can define a syntax to describe complex search criteria in a single string, SMILA does currently not define an own query syntax.
- using query string:
<Record> <Val key="query">meaning of life</Val> </Record>
- using query object:
<Record> <Map key="query"> <Val key="author">shakespeare</Val> <Val key="title">hamlet</Val> </Map> </Record>
- maxcount: number of records to return to the search client, default value is 10:
<Val key="query">meaning of life</Val> <Val key="maxcount" type="long">3</Val>
- offset: number or top results to skip, default value is 0. Use this parameter to implement result list paging: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ....
<Val key="query">meaning of life</Val> <Val key="maxcount" type="long">3</Val> <Val key="offset" type="long">3</Val>
- threshold: minimal relevance score that a result must have, default is 0.0.
<Val key="query">meaning of life</Val> <Val key="threshold" type="double">0.5</Val>
- language: language of the query, no default value. There could be language specific pipelets/services that need to know in which language the user is expressing his query to work correctly.
<Val key="query">sinn des lebens</Val> <Val key="language">de</Val>
- index: some index services (like our LuceneIndexService) can manage multiple indexes at once, then they can use this parameter to select the index to search with this request. However, they always should have a default index name configured somehow so that a request succeeds without having this parameter set.
<Val key="query">meaning of life</Val> <Val key="index">wikipedia</Val>
- resultAttributes: multi-valued parameter, describing the names of attributes that should be added to result records by the search engine. This list should only contain the attributes needed by pipelets after the search for processing or by the search page for displaying the results, including too many attributes will always decrease performance. Omitting this parameter should result in getting all available attributes.
<Val key="query">meaning of life</Val> <Seq key="resultAttributes"> <Val>author</Val> <Val>title</Val> </Seq>
- highlight: sequence of string values specifying attribute names for which highlighting should be produced.
<Val key="query">meaning of life</Val> <Seq key="highlight"> <Val>content</Val> </Seq>
- sortby: sequence of maps each containing a key "attribute" (any string) and "order" ("ascending"/"descending") specifying that the search result should be be sorted by the named attributes in the given direction. Omitting this parameter should result in search result sorted by relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of appearance.
<Val key="query">meaning of life</Val> <Seq key="sortby"> <Map> <Val key="attribute">year</Val> <Val key="order">descending</Val> </Map> <Map> <Val key="attribute">author</Val> <Val key="order">ascending</Val> </Map> </Seq>
- groupby: sequence of maps each containing a key "attribute" (any string) and "maxcount" (long). This tells the search to produce a grouping of the search results by these attributes, returning "maxcount" groups for each attribute. Optionally each groupby map may contain a map under key "sortby" with keys "order" ("ascending"/"descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return these groups (e.g. "count" by number of this per group or "value" by attribute value name).
<Val key="query">meaning of life</Val> <Seq key="groupby"> <Map> <Val key="attribute">year</Val> <Val key="maxcount" type="long">10</Val> </Map> <Map> <Val key="attribute">author</Val> <Map key="sortby"> <Val key="criterion">value</Val> <Val key="order">ascending</Val> </Map> <Val key="maxcount" type="long">5</Val> </Map> </Seq>
- filter: sequence of maps describing for certain attributes which values they are required to have in valid result records. Each of the maps contains a key "attribute" and one or more value descriptions:
- "oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values.
- "atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value.
<Val key="query">meaning of life</Val>
<Seq key="filter"> <Map> <Val key="attribute">author</Val> <Seq key="oneOf"> <Val>shakespeare</Val> <Val>adams</Val> </Seq> </Map> <Map> <Val key="attribute">year</Val> <Val key="atLeast">1500</Val> <Val key="lessThan">2000</Val> </Seq> </Map> </Seq>
- ranking: a configuration of how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.