Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Search"

(Query Parameters)
(Adding attachments)
(12 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests in BPEL workflows, and the sample servlet used to create a simple search Web GUI.
+
This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests in BPEL workflows, and the sample servlet used to create a simple web-based GUI for search.  
  
=== Introduction ===
+
=== Introduction ===
  
Let's start at the top: If you have installed SMILA and created an index by starting a crawler you can now use you web browser to go to [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] to search the index:
+
Let's start right at the top: Provided that you installed SMILA and created an index by starting a crawler as described in [[SMILA/Documentation for 5 Minutes to Success|5 Minutes to Success]], you can use you web browser to go to [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] and search on the index:  
  
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]]
+
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]]  
  
What happens behind the scenes when you enter a query string and submit the form is that a servlet creates a SMILA record from the HTTP parameters, uses the search service to execute a BPEL workflow on this record, receives an enriched version of the query record and a list of result records in XML form and uses an XSLT stylesheet to create a result HTML page.
+
What happens behind the scenes when you enter a query string and submit the form, is that a servlet creates a SMILA record from the HTTP parameters, uses the search service to execute a BPEL workflow on this record, receives an enriched version of the query record and also a list of result records in XML form, and uses an XSLT stylesheet to create a result page in HTML format.  
  
Using the [http://localhost:8080/SMILA/search?style=SMILASearchAdvanced Advanced] link at the top you can switch to more detailed search page:
+
By clicking the ''Advanced'' link at the top of the search page (or by entering the URL <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>), you can switch to a more detailed search form page, which allows you to construct more specific search queries:  
  
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]]
+
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]]  
  
This page allows you to enter a more specific query. In case you want to use the default search servlet for your own search page you should use the XSLT files that create these two pages as a reference when trying to design your own search page.
+
If you want to use the default search servlet for your own search page, you are encouraged to use the two XSLT files creating these HTML pages as a reference or basis when building your pages.  
  
 +
=== Search Processing  ===
  
=== Search Processing ===
+
Having seen the tip of the iceberg, we dive down to the very bottom of SMILA search: the actual processing of search requests in SMILA BPEL pipelines. We assume that you are accustomed to the basic SMILA workflow processing features used in indexing workflows. You may want to refer to [[SMILA/Documentation/BPEL Workflow Processor]] for details.
  
Having seen the tip of the iceberg, we dive down to the very bottom of SMILA search: the actual processing of search requests in SMILA BPEL pipelines. We assume that you are accustomed to the basic SMILA workflow processing features used in indexing workflows. You may want to refer to [[SMILA/Documentation/BPEL_Workflow_Processor]] for details.
+
==== Search Pipelines  ====
  
==== Search Pipelines ====
+
Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead of pushing lists of records corresponding to data source objects through them, they are invoked with a single record representing the search request. This record contains the values of the parameters which were defined by the Search API (see below). The request object can be analyzed and enriched with additional information during the workflow before the actual search on the index takes place. The results of this search are not added to the blackboard as records of their own, but are added to the request record under the key "records". Further pipelets may then do further processing based on the request data and the result record list (e.g. highlighting). Finally, the request record including the search results is returned to the client and can be presented.
  
Search workflows (or pipelines) are very similar to indexing pipelines, but there are a few extensions. The variables in indexing pipelines represent just a simple list of records. This is not sufficient for search pipelines where we need to distinguish between the single record representing the user query (the "query record") and the current list of result records (the "search result"). This results in a few general differences between the BPEL files of indexing and search pipelines:
+
Pipelet invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample).  
  
* the partner link of the pipeline must be of type "proc:SearchProcessorPartnerLinkType":
+
=== Search Service API  ===
  
<source lang="xml">
+
The actual Search API is quite simple: SMILA registers an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a SMILA query record and the name of a search workflow as input, execute the workflow on the record, and return the result in different formats:
<partnerLinks>
+
    <partnerLink name="Pipeline" partnerLinkType="proc:SearchProcessorPartnerLinkType" myRole="service" />
+
</partnerLinks>
+
</source>
+
  
* the input and output variables of the pipeline itself and of pipelet/service invocations must have the message type "proc:SearchProcessorMessage". This message has only a single part named "records" which can contain a single record (the query record) and a record list (the result records). Refer to <tt>org.eclipse.smila.processing.bpel/xml/processor.wsdl</tt> for the details of the schema definition.
+
*<tt>Record search(String workflowName, Record query) throws ProcessingException</tt>: This is the basic method of the search service, returning the result records as native SMILA data structures. The other methods call this method for the actual search execution, too, and then just convert the result.  
 +
*<tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt>: Returns the search result as an XML DOM document. See below for the schema of the result.
 +
*<tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt>: Returns the search result as an XML string. See below for the schema of the result.
  
<source lang="xml">
+
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition):  
<variables>
+
    <variable name="request" messageType="proc:SearchProcessorMessage" />
+
</variables>
+
</source>
+
 
+
 
+
* The <receive> and <reply> elements must use the portType "proc:SearchProcessorPortType":
+
 
+
<source lang="xml">
+
<sequence>
+
    <receive name="start" partnerLink="Pipeline" portType="proc:SearchProcessorPortType"
+
        operation="process" variable="request" createInstance="yes" />
+
    <!-- service/pipelet invocation and other workflow logic -->
+
    <reply name="end" partnerLink="Pipeline" portType="proc:SearchProcessorPortType"
+
        operation="process" variable="request" />
+
</sequence>
+
</source>
+
 
+
Apart from this, pipelet/service invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/SearchPipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample).
+
 
+
 
+
==== SimplePipelets/ProcessingServices in Search pipelines ====
+
 
+
Recall that the signature of the invocation method of SimplePipelets/ProcessingServices is
+
 
+
<source lang="java">
+
Id[] process(Blackboard blackboard, Id[] recordIds) throws ProcessingException;
+
</source>
+
 
+
This means when used in search pipelines they cannot process a complete message variable. Therefore the engine selects one part of the message when invoking a "simple" pipeline element:
+
 
+
* if there is not yet a result record list in the message (not even an empty one) the pipelet is called with the query record ID and the output message contains only a single query record ID, too.
+
* else it is called with the result record list and the result becomes the record list of the output variable. The query record ID is just copied to the result variable.
+
 
+
The rationale behind this is that in a search pipeline first some pipelets may be needed to prepare the query object (enrich the query, set some defaults, etc.), then follows the actual search, which takes the query as input and produces a list of results (thus adds the result record list to the variable) and then additional pipelets may be needed to manipulate the result further. Using the distinction described above makes it possible to use the same pipelet implementation for query and result records, just depending on their position in the pipeline.
+
 
+
==== SearchPipelets/SearchProcessingServices ====
+
 
+
For some operations in search pipelines this invocation pattern is not sufficient, the most prominent being the actual search implementation itself: It needs the query record as input and produces a result record list. But there may be other pipelets after the actual search that need to compare query and result records and therefore need access to both kinds of record. To support this, two new interfaces have been defined:
+
 
+
* <tt>org.eclipse.smila.processing.SearchPipelet</tt>
+
* <tt>org.eclipse.smila.processing.SearchProcessingService</tt>
+
 
+
Concerning life cycle and configuration they are identical to standard Simple Pipelets and Processing Services: Pipelets are created and configured by the BPEL engine and must be declared in teh MANIFEST.MF of the providing bundle. ProcessingServices are started independently from the BPEL engine as OSGi services (though for different service interfaces). The enhancement provided by the search pipelets/service is a new invocation method:
+
 
+
<source lang="java">
+
SearchMessage process(Blackboard blackboard, SearchMessage message) throws ProcessingException;
+
</source>
+
 
+
where <tt>SearchMessage</tt> consists of a query record ID and a record ID list.
+
 
+
=== Search Service API ===
+
 
+
The actual Search API is quite simple: SMILA registeres an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a
+
SMILA query record and the name of a search workflow as input, execute the workflow on the record and return the result in different formats:
+
 
+
* <tt>SearchResult search(String workflowName, Record query) throws ProcessingException</tt>: this is the basic method of the search service, that returns the result records as SMILA data structures. The other methods call this method for the actual search execution, too, and just convert the result.
+
* <tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt>: return the search result as a XML DOM document. See below for the schema of the result.
+
* <tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt>: return the search result as an XML string. See below for the schema of the result.
+
 
+
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition):
+
  
 
<source lang="xml">
 
<source lang="xml">
 
<element name="SearchResult">
 
<element name="SearchResult">
    <complexType>
+
  <complexType>
        <sequence minOccurs="1" maxOccurs="1">
+
    <sequence minOccurs="1" maxOccurs="1">
            <element name="Query" minOccurs="1" maxOccurs="1">
+
      <element name="Workflow" type="string" minOccurs="1" maxOccurs="1" />
                <complexType>
+
      <element ref="rec:Record" minOccurs="0" maxOccurs="1" />
                    <sequence>
+
    </sequence>
                        <element name="Workflow" type="string" minOccurs="1" maxOccurs="1" />
+
  </complexType>
                        <element ref="rec:Record" minOccurs="0" maxOccurs="1" />
+
                    </sequence>
+
                </complexType>
+
            </element>
+
            <element ref="rec:RecordList" minOccurs="0" maxOccurs="1"/>
+
        </sequence>
+
    </complexType>
+
 
</element>
 
</element>
</source>
+
</source>  
  
You can view the result XML by using the sample SMILA search page [http://localhsot:8080/SMILA/search] and selecting the "Show XML result" checkbox before submitting the query.
+
You can view the result XML when using the sample SMILA search page at <tt>http://localhost:8080/SMILA/search</tt> if you enable the ''Show XML result'' option before submitting the query.  
  
The content of the query record basically depends a lot on the used search services. E.g. using the LuceneSearchService, you can set attribute values to search in the index fields to which these attributes have been mapped during indexing (refer to the Lucene integration documentation for details). Other search parameters are attached to the query record as annotations. However, the Search API is also a recommendation where to put some basic, commonly used search parameters, which all index integrations should honor (of course they may quite specify extensions that are not covered by the generic Search API). The following sections describes these recommendations.
+
The content of the query record basically depends a lot on the used search services. However, the Search API also includes a recommendation where to put some basic commonly used search parameters which all index integrations should honor (of course they may specify extensions that are not covered by the generic Search API). The following sections describe these recommendations.  
  
 +
=== Query Parameters  ===
  
=== Query Parameters ===
+
The query record mainly consists of parameters. The Search API defines the names of these parameters, the allowed values as well as the default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also defined default values. All parameters are single-valued unless otherwise specified.
  
The query record contains mainly of parameters. The Search API defines these paremeter names, allowed values and default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also define default values. All parameters are single values if not specified differently.
+
*''query'': Either a search string using a query syntax or a query record describing the query by setting values for attributes (aka fielded search). The implementer for a specific underlying technology may define a query syntax to be able to build complex search criteria in a single string. However, SMILA currently does not define an own query syntax and passes the string as is to its default search engine [[SMILA/Documentation/Solr|Solr]] (see there for handling and interpretation).
 +
**Example using a query string:
  
* query: either a search string using a query syntax, or a query record describing the query by setting values for application attributes. The index implementor can define a syntax to describe complex search criteria in a single string, SMILA does currently not define an own query syntax.
 
** using query string:
 
 
<source lang="xml">
 
<source lang="xml">
 
<Record>
 
<Record>
 
   <Val key="query">meaning of life</Val>
 
   <Val key="query">meaning of life</Val>
 
</Record>
 
</Record>
</source>
+
</source>  
** using query object:
+
 
 +
*Example using a query object (fielded search):
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Record>
 
<Record>
Line 139: Line 72:
 
   </Map>
 
   </Map>
 
</Record>
 
</Record>
</source>
+
</source>  
* maxcount: number of records to return to the search client, default value is 10:
+
 
 +
*''maxcount'': The maximum number of records which should be returned to the search client. Default value is 10. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
 
<Val key="maxcount" type="long">3</Val>
 
<Val key="maxcount" type="long">3</Val>
</source>
+
</source>  
* offset: number or top results to skip, default value is 0. Use this parameter to implement result list paging: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ....
+
 
 +
*''offset'': The number of hits which, starting from the top, should be skipped in the search result. Default value is 0. Use this parameter to implement result list paging and to provide the user a means to navigate through the result pages: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ... Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
 
<Val key="maxcount" type="long">3</Val>
 
<Val key="maxcount" type="long">3</Val>
 
<Val key="offset" type="long">3</Val>
 
<Val key="offset" type="long">3</Val>
</source>
+
</source>  
* threshold: minimal relevance score that a result must have, default is 0.0.
+
 
 +
*''threshold'': The minimal value of the relevance score that a result must have to be returned to the search client. Default is 0.0.
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
 
<Val key="threshold" type="double">0.5</Val>
 
<Val key="threshold" type="double">0.5</Val>
</source>
+
</source>  
* language: language of the query, no default value. There could be language specific pipelets/services that need to know in which language the user is expressing his query to work correctly.
+
 
 +
*''language'': The natural language of the query. No default value. This parameter may be required for language-specific pipelets/services that need to know in which language the user is expressing his or her query to be able to deliver feasible results. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">sinn des lebens</Val>
 
<Val key="query">sinn des lebens</Val>
 
<Val key="language">de</Val>
 
<Val key="language">de</Val>
</source>
+
</source>  
* index: some index services (like our LuceneIndexService) can manage multiple indexes at once, then they can use this parameter to select the index to search with this request. However, they always should have a default index name configured somehow so that a request succeeds without having this parameter set.
+
 
 +
*''indexname'': Some index services (like Solr) can manage multiple indices at once. When doing so, they can use this parameter to select the index which is to be searched with the current request. However, when using such a scenario, it is recommended to configure a default index name, so that search requests will succeed without having this parameter set explicitly. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
<Val key="index">wikipedia</Val>
+
<Val key="indexname">wikipedia</Val>
</source>
+
</source>  
* resultAttributes: multi-valued parameter, describing the names of attributes that should be added to result records by the search engine. This list should only contain the attributes needed by pipelets after the search for processing or by the search page for displaying the results, including too many attributes will always decrease performance. Omitting this parameter should result in getting all available attributes.
+
 
 +
*''resultAttributes'': A multi-valued parameter containing the names of the attributes which the search engine should add to the result records. Since including too many attributes will decrease performance, the list should contain only those attributes that are needed by some pipelets for further processing after the search has taken place or for displaying the results in the end. Omitting the parameter results in getting all available attributes. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
Line 173: Line 118:
 
   <Val>title</Val>
 
   <Val>title</Val>
 
</Seq>
 
</Seq>
</source>
+
</source>  
* highlight: sequence of string values specifying attribute names for which highlighting should be produced.
+
 
 +
*''highlight'': A sequence of string values specifying the attribute names for which highlighting should be returned. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
Line 180: Line 127:
 
   <Val>content</Val>
 
   <Val>content</Val>
 
</Seq>
 
</Seq>
</source>
+
</source>  
* sortby: sequence of maps each containing a key "attribute" (any string) and "order" ("ascending"/"descending") specifying that the search result should be be sorted by the named attributes in the given direction. Omitting this parameter should result in search result sorted by relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of appearance.
+
 
 +
*''sortby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "order" ("ascending" | "descending") specifying that the search result should be sorted by the named attributes in the given order. Omitting this parameter results in a search result sorting by descending relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of their appearance. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
Line 194: Line 143:
 
   </Map>
 
   </Map>
 
</Seq>
 
</Seq>
</source>
+
</source>  
* groupby: sequence of maps each containing a key "attribute" (any string) and "maxcount" (long). This tells the search to produce a grouping of the search results by these attributes, returning "maxcount" groups for each attribute. Optionally each groupby map may contain a map under key "sortby" with keys "order" ("ascending"/"descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return these groups (e.g. "count" by number of this per group or "value" by attribute value name).
+
 
 +
*''facetby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "maxcount" (long). This causes facets to be returned by the search results for the specified attributes, returning "maxcount" values for each attribute. Optionally, each facetby map may contain a map with key "sortby" with keys "order" ("ascending" | "descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return the values (e.g. "count" by number of this per facet or "value" by attribute value name). Example:
 +
 
 +
{{Note|since 1.0|prior to 1.0 this was named ''groupby'' and has been merely renamed, (see [http://dev.eclipse.org/mhonarc/lists/smila-dev/msg00998.html mail thread]}}
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
<Seq key="groupby">
+
<Seq key="facetby">
 
   <Map>
 
   <Map>
 
     <Val key="attribute">year</Val>
 
     <Val key="attribute">year</Val>
Line 212: Line 165:
 
   </Map>
 
   </Map>
 
</Seq>
 
</Seq>
</source>
+
</source>  
* filter: sequence of maps describing for certain attributes which values they are required to have in valid result records. Each of the maps contains a key "attribute" and one or more value descriptions:
+
 
** "oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values.
+
*''filter'': A sequence of maps describing for certain attributes which values they must have in valid result records. Each of the maps contains a ''key'' "attribute" and one or more value descriptions:  
** "atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value.
+
**"oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values.  
 +
**"atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value. Example:
 +
 
 
<source lang="xml">
 
<source lang="xml">
 
<Val key="query">meaning of life</Val>
 
<Val key="query">meaning of life</Val>
Line 222: Line 177:
 
     <Val key="attribute">author</Val>
 
     <Val key="attribute">author</Val>
 
     <Seq key="oneOf">
 
     <Seq key="oneOf">
       <Val>shakespeare</Val>
+
       <Val>pratchett</Val>
 
       <Val>adams</Val>
 
       <Val>adams</Val>
 
     </Seq>
 
     </Seq>
Line 228: Line 183:
 
   <Map>
 
   <Map>
 
     <Val key="attribute">year</Val>
 
     <Val key="attribute">year</Val>
     <Val key="atLeast">1500</Val>
+
     <Val key="atLeast">1990</Val>
 
     <Val key="lessThan">2000</Val>
 
     <Val key="lessThan">2000</Val>
    </Seq>
 
 
   </Map>
 
   </Map>
 
</Seq>
 
</Seq>
</source>
+
</source>  
* ranking: a configuration of how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.
+
  
=== Result Annotations ===
+
*''ranking'': A configuration defining how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.
  
Annotations may not only be attached to the query record, but to the records in the search result, too. There are even additional annotations attached to the query object to describe result properties that do not refer to a single result record, but to the complete search result.
+
=== Result Annotations ===
  
* result statistics: After the search the query record contains an annotation named "result" that currently contains these named values
+
The search result is usually the request record, enriched with result data.  
** runtime: runtime for the invoked pipeline, in milliseconds
+
** totalHits: number of possible results for this query, i.e. all objects from an index that have a relevance score greater than the specified threshold (or zero).
+
** indexSize: complete number of objects in the searched index.  
+
  
* Score: Each result record should have a "result" annotation, too, giving at least the ranking score calculated by the search engine as named value "relevance" as a double value (usually 1 means: perfect match).
+
*''records'': A sequence of maps describing the actual search result, meaning the records retrieved from the index. Each record should have an additional attribute "_weight" describing the relevance score of this record with respect to the query. The size of the "record" sequence is limited by the "maxcount" parameter.
  
* Highlighting: TODO
+
<source lang="xml">
 +
<Val key="query">meaning of life</Val>
 +
<!-- other query parameters -->
 +
<Seq key="records">
 +
  <Map>
 +
    <Val key="_weight" type="double">0.95</Val>
 +
    <Val key="_recordid">file:hamlet</Val>
 +
    <Val key="title">Hamlet</Val>
 +
    <Val key="author">Shakespeare</Val>
 +
    ...
 +
  </Map>
 +
  <Map>
 +
    <Val key="_weight" type="double">0.90</Val>
 +
    <Val key="_recordid">file:hitchhiker</Val>
 +
    <Val key="title">Hitchhiker's Guide to the Galaxy</Val>
 +
    <Val key="author">Adams</Val>
 +
    ...
 +
  </Map>
 +
</Seq>
 +
</source> {{Note|return binary content|
 +
There is no nice way to return binary content anymore as attachents may only be top-level children of a record. These two solutions are possible:
 +
# add an attachment to the search record with a name after this pattern: <resultItem-record.Id>.<resultItem.atachmentName>
 +
# convert the byte[] into a string (e.g. base64 encoding, so it is serializable) and return it in the AnyMap
 +
}}
  
* Facets: TODO
+
*''count'': The total number of records in the index that have any relevance to the query. Example see ''runtime''.
 +
*''indexSize'' (optional): The total number of records in the searched index. Example see ''runtime''.
 +
*''runtime'': The execution time of request in milliseconds, added by the search service. Example:
  
* Terms: TODO
+
<source lang="xml">
 +
<Val key="query">meaning of life</Val>
 +
<Val key="count" type="long">123456</Val>
 +
<Val key="indexSize" type="long">987654321</Val>
 +
<Val key="runtime" type="long">42</Val>
 +
<!-- other query parameters -->
 +
<Seq key="records">
 +
  <!-- contains returned records -->
 +
</Seq>
 +
</source>
  
=== Helper Classes ===
+
*''facets'': The faceting results as requested by the ''facetby'' parameters. This Map contains a nested Seq for each requested facet and its values.
  
There are some classes that help a client to create query records with their annotations and to read out result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>:
+
<source lang="xml">
 +
<Val key="query">meaning of life</Val>
 +
<Map key="facets">
 +
  <Seq key="year">
 +
    <Map>
 +
      <Val key="value">2000</Val>
 +
      <Val key="count" type="long">42</Val>
 +
    </Map>
 +
    <Map>
 +
      <Val key="value">2001</Val>
 +
      <Val key="count" type="long">21</Val>
 +
    </Map>
 +
    ...
 +
  </Seq>
 +
  <Seq key="author">
 +
    <Map>
 +
      <Val key="value">adams</Val>
 +
      <Val key="count" type="long">13</Val>
 +
    </Map>
 +
    <Map>
 +
      <Val key="value">shakespear</Val>
 +
      <Val key="count" type="long">17</Val>
 +
    </Map>
 +
    ...
 +
  </Seq>
 +
</Map>
 +
</Val>
 +
</source>  
  
* <tt>QueryBuilder</tt>: helper class for building queries and sending the query to search service. Returns a result in the form of the next class:
+
*''_highlight'': The annotation of the result record, usually used to highlight relevant sections from the result documents in order to allow the user to see at one glance if it suits what he or she was looking for. What is returned here exactly, depends on the used search engine. For example, the Solr integration in SMILA returns the raw form of the text and information about the matching parts to be highlighted. Example:
* <tt>ResultAccessor</tt>: wrapper for the complete search result. Does not do much on its own, but basically creates instances of the following classes to access the records of the result.
+
* <tt>QueryRecordAccessor</tt>: Defines methods for accessing literals and annotations of the enriched query that is part of the search result.
+
* <tt>ResultRecordAccessor</tt>: Defines methods for reading literals and annotations of search result records.
+
  
See the source code or javadocs for more details of the provided methods.
+
<source lang="xml">
 +
<Seq key="records">
 +
  <Map>
 +
    ...
 +
    <Map key="_highlight">
 +
      <Map key="content">
 +
        <Val key="text">... To be or not to be ...</Val>
 +
        <Seq key="positions">
 +
          <Map>
 +
            <Val key="start" type="long">7</Val>
 +
            <Val key="end" type="long">9</Val>
 +
            <Val key="quality" type="long">100</Val>
 +
          </Map>
 +
          <Map>
 +
            <Val key="start" type="long">20</Val>
 +
            <Val key="end" type="long">22</Val>
 +
            <Val key="quality" type="long">95</Val>
 +
          </Map>
 +
        </Seq>
 +
      </Map>
 +
    <Map>
 +
    ...
 +
  </Map>
 +
</source> Using the HighlightingPipelet this can be transformed into a highlighted text fragment (here using * as the highlight tag): <source lang="xml">
 +
<Seq key="records">
 +
  <Map>
 +
    <Val key="_weight" type="double">0.95</Val>
 +
    <Val key="_recordid">file:hamlet</Val>
 +
    <Val key="title">Hamlet</Val>
 +
    <Val key="author">Shakespeare</Val>
 +
    <Map key="_highlight">
 +
      <Map key="content">
 +
        <Val key="text">... To *be* or not to *be* ...</Val>
 +
      </Map>
 +
    <Map>
 +
    ...
 +
  </Map>
 +
</source>
  
 +
=== Helper Classes  ===
  
=== Servlet ===
+
There are some classes that help a client to create query records with their annotations and to read result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>:
  
Additionally to this "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos, not for productive use. It is usually deployed in the Tomcat instance that comes with SMILA at "/SMILA/search". On first invocation it currently creates a quite empty query record (it sets some default parameters like resultSize etc) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page.  
+
*<tt>QueryBuilder</tt>: A helper class for building queries and sending the query to search service. Returns a result in the form of the next class:
 +
*<tt>ResultAccessor</tt>: A wrapper for the complete search result. Provides methods to access the basic top-level result annotations and to access each search result record wrapped by a:
 +
*<tt>ResultRecordAccessor</tt>: Defines methods for accessing some of the result record annotations.
  
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indexes available in the LuceneSearchService so that the search page can display the names for selection on the left side:
+
See the source code or JavaDocs for more details on the provided methods.
 +
 
 +
=== SMILA Search Servlet  ===
 +
 
 +
In addition to the "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos only, not for productive use. It is usually deployed in the Jetty instance that comes with SMILA at <tt>/SMILA/search</tt>. On first invocation, it currently creates a quite empty query record (it sets some default parameters like ''maxcount'' etc.) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page.
 +
 
 +
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indices available in Solr so that the search page can display the names for selection on the left side:  
  
 
<source lang="xml">
 
<source lang="xml">
 
<SearchResult xmlns="http://www.eclipse.org/smila/search">
 
<SearchResult xmlns="http://www.eclipse.org/smila/search">
   <Query>
+
   <Workflow>searchpipeline</Workflow>
  </Query>
+
   <Record xmlns="http://www.eclipse.org/smila/record">
   <RecordList xmlns="http://www.eclipse.org/smila/record">
+
     <!-- effective query and embedded result records --->
     ...
+
   </Record>
   </RecordList>
+
 
   <!-- part added by SearchServlet -->
 
   <!-- part added by SearchServlet -->
 
   <IndexNames>
 
   <IndexNames>
Line 283: Line 336:
 
   </IndexNames>
 
   </IndexNames>
 
</SearchResult>
 
</SearchResult>
</source>
+
</source>  
  
You can use the same mechanism to add other information to the XML that is necessary to display the search form but not contained in the search service result, you just have to implement your own servlet or extend the default servlet. Please refer to the source code for details.
+
You can use the same mechanism to add other information to the XML that is necessary for displaying purposes in the search form but not contained in the search service result: You just have to implement your own servlet or extend the default servlet. Please refer to the source code for details.  
  
==== XSLT Stylehsheets for SMILA search and result pages ====
+
==== XSLT Stylesheets for SMILA search and result pages ====
  
The stylesheets are loaded from the configuration directory "org.eclipse.smila.search.servlet" and select using the HTTP parameter "style". The value of this parameter must be the stylesheet filename without suffix, the suffix must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value is set.
+
The stylesheets are loaded from the configuration directory <tt>org.eclipse.smila.search.servlet</tt> and are selected by adding the HTTP parameter "style" to the URL. The value of this parameter must be the filename of the desired stylesheet without the suffix. The file's extension must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value was set.  
  
In the default application, three stylesheets are avaiable:
+
In the default application, three stylesheets are avaiable:  
  
* SMILASearchDefault: the default search page. Use this as a reference for how to describe simple queries and to present result lists, including paging through bigger results.
+
*SMILASearchDefault: The default search page. Use this as a reference on how to describe simple queries and present result lists, including paging through bigger results.  
* SMILASearchAdvanced: same layout for the result list, but demostrates how to create more complex query records with attribute values and filters.
+
*SMILASearchAdvanced: Same layout for the result list but demostrates how to create more complex query records with attribute values and filters.  
* SMILASearchTest: primitive layout, no paging, but demonstrates the setting of even more query features.
+
*SMILASearchTest: Primitive layout without paging but demonstrates the setting of even more query features.
  
To start with another than the default stylesheet, you can add a "style" parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: [http://localhost:8080/SMILA/search?style=SMILASearchAdvanced http://localhost:8080/SMILA/search?style=SMILASearchAdvanced].
+
To start with another than the default stylesheet, you can add a ''style'' parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>.  
  
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples of how to apply them, as we will not present something like a full tutorial here (-;
+
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples on how to apply them, as we will not present something like a full tutorial here (-;  
  
 +
==== Setting parameters  ====
  
==== Setting parameters ====
+
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the ''resultSize'' parameter to 7 using a hidden HTML input field, use:  
 
+
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the "resultSize" parameter to 7 using an  HTML hidden input field, use:
+
  
 
<source lang="xml">
 
<source lang="xml">
 
<input type="hidden" name="resultSize" value="7" />
 
<input type="hidden" name="resultSize" value="7" />
</source>
+
</source>  
 
+
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules.
+
  
 +
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules.
  
==== Setting attributes ====
+
==== Setting attributes ====
  
You can add literal string values to attributes using "A.<AttributeName>" as the HTTP parameter name. E.g., to set a value from a HTML text input field as an literal in attribute "Title", use:
+
You can add literal string values to attributes using "A.&lt;AttributeName&gt;" as the HTTP parameter name. E.g., to set a value from an HTML text input field as a literal in attribute "Title", use:  
  
 
<source lang="xml">
 
<source lang="xml">
 
<input type="text" name="A.Title" />
 
<input type="text" name="A.Title" />
</source>
+
</source>  
  
 +
==== Setting other parameters  ====
  
==== Setting other annotations ====
+
To add a "sortby" parameter for an attribute, use "sortBy.&lt;AttributeName&gt;=&lt;order&gt;", e.g.  
 
+
To set a named value in the ranking annotation for the complete record or an attribute, use "R.<ValueName>[.<AttributeName>]". You are not limited to the predefined ranking value names "name" and "boost". E.g., the following input field sets add a named value "Operator=OR" to attribute "Content":
+
  
 
<source lang="xml">
 
<source lang="xml">
<input type="hidden" name="R.Operator.Content" value="OR" />
+
<input type="hidden" name="sortby.FileSize" value="descending" />
</source>
+
</source>  
  
To create a filter for an attribute, use HTTP params:
+
To create a filter for an attribute, use HTTP params:  
  
* "F.<AttributeName>" to set the filter mode ("ALL", "ANY", "ONLY", "NONE")
+
*"F.val.&lt;AttributeName&gt;" to add filter values to an "oneOf" filter.  
* "Fval.<AttributeName>" to add filter values to an enumeration filter.
+
*"F.min.&lt;AttributeName&gt;" and "F.max.&lt;AttributeName&gt;" to set the lower/upper bounds of an "atLeast"/"atMost" filter.
* "Fmin.<AttributeName>" and "Fmax.<AttributeName>" to set the lower/upper bounds of a range filter.
+
  
If both "Fval" and "Fmin/Fmax" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g.
+
If both "F.val" and "F.min/F.max" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g.  
  
* to set a filter for attribute "MimeType" restricting the result to HTML documents, use:
+
*To set a filter for attribute ''MimeType'' restricting the result to HTML documents, use:
  
 
<source lang="xml">
 
<source lang="xml">
<input type="hidden" name="Fval.MimeType" value="text/html" />
+
<input type="hidden" name="F.val.MimeType" value="text/html" />
</source>
+
</source>  
  
* to set a filter for attribute "FileSize" restricting the result to document sizes between 1000 and 10000 bytes, use:
+
*To set a filter for attribute ''FileSize'' restricting the result to document sizes between 1000 and 10000 bytes, use:
  
 
<source lang="xml">
 
<source lang="xml">
<input type="hidden" name="Fmin.FileSize" value="1000" />
+
<input type="hidden" name="F.min.FileSize" value="1000" />
<input type="hidden" name="Fmax.FileSize" value="10000" />
+
<input type="hidden" name="F.max.FileSize" value="10000" />
 +
</source>
 +
 
 +
To set a value in the ranking parameter for the complete record or an attribute, use "R[.&lt;AttributeName&gt;].&lt;ValueName&gt;". E.g., the following input field adds a parameter "Operator=OR" to attribute "Content":
 +
 
 +
<source lang="xml">
 +
<input type="hidden" name="R.Operator.Content" value="OR" />
 
</source>
 
</source>
  
To set named values in other attribute annotations, use "A.<AttributeName>.(<AnnotationName>.)+<ValueName>". Note that this does not work for attribute and annotation names containing "." characters. E.g., the following snippet create an annotation "highlight" on attribute "Concent", with a sub-annotation "HighlightingTransformer" and a named value "name=Sentence":
+
==== Adding attachments ====
  
 +
Attachments can be added to the query record by adding file upload fields to the search form, for example:
  
 
<source lang="xml">
 
<source lang="xml">
<input type="hidden" name="A.Content.highlight.HighlightingTransformer.name" value="Sentence" />
+
<input type="file" name="Content"/>    
 
</source>
 
</source>
 +
 +
If the user selects a file for this field, it will be uploaded to SMILA and added as attachment "Content". Of course, there must be pipelets in your search pipeline that can process this attachment. Note, that the attachments will be kept in memory in a default SMILA configuration, so they should not be too large.
 +
 +
=== Record Search Servlet  ===
 +
 +
In addition there exists the very basic Record Search Servlet available at {{Path|/SMILA/recordsearch}}.
 +
 +
You can do a POST or GET request on this URL with a SMILA search record in XML representation as the request body. The servlet then parses the given XML and calls the Search Service.  The default is to use the SeachPipeline but you can define any other pipeline by adding the {{code|_workflow}} annotation to the search record with the respective pipeline name.
 +
 +
The servlet returns the XML representation of the record returned by the Search Service as is, in which you can find the search results (see above).
 +
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Revision as of 06:07, 4 May 2012

This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests in BPEL workflows, and the sample servlet used to create a simple web-based GUI for search.

Introduction

Let's start right at the top: Provided that you installed SMILA and created an index by starting a crawler as described in 5 Minutes to Success, you can use you web browser to go to http://localhost:8080/SMILA/search and search on the index:

SMILA's sample search page

What happens behind the scenes when you enter a query string and submit the form, is that a servlet creates a SMILA record from the HTTP parameters, uses the search service to execute a BPEL workflow on this record, receives an enriched version of the query record and also a list of result records in XML form, and uses an XSLT stylesheet to create a result page in HTML format.

By clicking the Advanced link at the top of the search page (or by entering the URL http://localhost:8080/SMILA/search?style=SMILASearchAdvanced), you can switch to a more detailed search form page, which allows you to construct more specific search queries:

SMILA's advanced sample search page

If you want to use the default search servlet for your own search page, you are encouraged to use the two XSLT files creating these HTML pages as a reference or basis when building your pages.

Search Processing

Having seen the tip of the iceberg, we dive down to the very bottom of SMILA search: the actual processing of search requests in SMILA BPEL pipelines. We assume that you are accustomed to the basic SMILA workflow processing features used in indexing workflows. You may want to refer to SMILA/Documentation/BPEL Workflow Processor for details.

Search Pipelines

Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead of pushing lists of records corresponding to data source objects through them, they are invoked with a single record representing the search request. This record contains the values of the parameters which were defined by the Search API (see below). The request object can be analyzed and enriched with additional information during the workflow before the actual search on the index takes place. The results of this search are not added to the blackboard as records of their own, but are added to the request record under the key "records". Further pipelets may then do further processing based on the request data and the result record list (e.g. highlighting). Finally, the request record including the search results is returned to the client and can be presented.

Pipelet invocations look the same as in indexing pipelines. See SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel for a complete example search pipeline (the one used in the above sample).

Search Service API

The actual Search API is quite simple: SMILA registers an OSGi service with the interface org.eclipse.smila.search.api.SearchService. It provides a few methods that take a SMILA query record and the name of a search workflow as input, execute the workflow on the record, and return the result in different formats:

  • Record search(String workflowName, Record query) throws ProcessingException: This is the basic method of the search service, returning the result records as native SMILA data structures. The other methods call this method for the actual search execution, too, and then just convert the result.
  • org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException: Returns the search result as an XML DOM document. See below for the schema of the result.
  • String searchAsXmlString(String workflowName, Record query) throws ProcessingException: Returns the search result as an XML string. See below for the schema of the result.

The schema of XML search results is basically as follows (target namespace is http://www.eclipse.org/smila/search, see org.eclipse.smila.search.api/xml/search.xsd for the full definition):

<element name="SearchResult">
  <complexType>
    <sequence minOccurs="1" maxOccurs="1">
      <element name="Workflow" type="string" minOccurs="1" maxOccurs="1" />
      <element ref="rec:Record" minOccurs="0" maxOccurs="1" />
    </sequence>
  </complexType>
</element>

You can view the result XML when using the sample SMILA search page at http://localhost:8080/SMILA/search if you enable the Show XML result option before submitting the query.

The content of the query record basically depends a lot on the used search services. However, the Search API also includes a recommendation where to put some basic commonly used search parameters which all index integrations should honor (of course they may specify extensions that are not covered by the generic Search API). The following sections describe these recommendations.

Query Parameters

The query record mainly consists of parameters. The Search API defines the names of these parameters, the allowed values as well as the default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also defined default values. All parameters are single-valued unless otherwise specified.

  • query: Either a search string using a query syntax or a query record describing the query by setting values for attributes (aka fielded search). The implementer for a specific underlying technology may define a query syntax to be able to build complex search criteria in a single string. However, SMILA currently does not define an own query syntax and passes the string as is to its default search engine Solr (see there for handling and interpretation).
    • Example using a query string:
<Record>
  <Val key="query">meaning of life</Val>
</Record>
  • Example using a query object (fielded search):
<Record>
  <Map key="query">
    <Val key="author">shakespeare</Val>
    <Val key="title">hamlet</Val>
  </Map>
</Record>
  • maxcount: The maximum number of records which should be returned to the search client. Default value is 10. Example:
<Val key="query">meaning of life</Val>
<Val key="maxcount" type="long">3</Val>
  • offset: The number of hits which, starting from the top, should be skipped in the search result. Default value is 0. Use this parameter to implement result list paging and to provide the user a means to navigate through the result pages: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ... Example:
<Val key="query">meaning of life</Val>
<Val key="maxcount" type="long">3</Val>
<Val key="offset" type="long">3</Val>
  • threshold: The minimal value of the relevance score that a result must have to be returned to the search client. Default is 0.0.
<Val key="query">meaning of life</Val>
<Val key="threshold" type="double">0.5</Val>
  • language: The natural language of the query. No default value. This parameter may be required for language-specific pipelets/services that need to know in which language the user is expressing his or her query to be able to deliver feasible results. Example:
<Val key="query">sinn des lebens</Val>
<Val key="language">de</Val>
  • indexname: Some index services (like Solr) can manage multiple indices at once. When doing so, they can use this parameter to select the index which is to be searched with the current request. However, when using such a scenario, it is recommended to configure a default index name, so that search requests will succeed without having this parameter set explicitly. Example:
<Val key="query">meaning of life</Val>
<Val key="indexname">wikipedia</Val>
  • resultAttributes: A multi-valued parameter containing the names of the attributes which the search engine should add to the result records. Since including too many attributes will decrease performance, the list should contain only those attributes that are needed by some pipelets for further processing after the search has taken place or for displaying the results in the end. Omitting the parameter results in getting all available attributes. Example:
<Val key="query">meaning of life</Val>
<Seq key="resultAttributes">
  <Val>author</Val>
  <Val>title</Val>
</Seq>
  • highlight: A sequence of string values specifying the attribute names for which highlighting should be returned. Example:
<Val key="query">meaning of life</Val>
<Seq key="highlight">
  <Val>content</Val>
</Seq>
  • sortby: A sequence of maps each containing the key "attribute" (any string) and the key "order" ("ascending" | "descending") specifying that the search result should be sorted by the named attributes in the given order. Omitting this parameter results in a search result sorting by descending relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of their appearance. Example:
<Val key="query">meaning of life</Val>
<Seq key="sortby">
  <Map>
    <Val key="attribute">year</Val>
    <Val key="order">descending</Val>
  </Map>
  <Map>
    <Val key="attribute">author</Val>
    <Val key="order">ascending</Val>
  </Map>
</Seq>
  • facetby: A sequence of maps each containing the key "attribute" (any string) and the key "maxcount" (long). This causes facets to be returned by the search results for the specified attributes, returning "maxcount" values for each attribute. Optionally, each facetby map may contain a map with key "sortby" with keys "order" ("ascending" | "descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return the values (e.g. "count" by number of this per facet or "value" by attribute value name). Example:
Note.png
since 1.0
prior to 1.0 this was named groupby and has been merely renamed, (see mail thread


<Val key="query">meaning of life</Val>
<Seq key="facetby">
  <Map>
    <Val key="attribute">year</Val>
    <Val key="maxcount" type="long">10</Val>
  </Map>
  <Map>
    <Val key="attribute">author</Val>
    <Map key="sortby">
      <Val key="criterion">value</Val>
      <Val key="order">ascending</Val>        
    </Map>
    <Val key="maxcount" type="long">5</Val>
  </Map>
</Seq>
  • filter: A sequence of maps describing for certain attributes which values they must have in valid result records. Each of the maps contains a key "attribute" and one or more value descriptions:
    • "oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values.
    • "atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value. Example:
<Val key="query">meaning of life</Val>
<Seq key="filter">
  <Map>
    <Val key="attribute">author</Val>
    <Seq key="oneOf">
      <Val>pratchett</Val>
      <Val>adams</Val>
    </Seq>
  </Map>
  <Map>
    <Val key="attribute">year</Val>
    <Val key="atLeast">1990</Val>
    <Val key="lessThan">2000</Val>
  </Map>
</Seq>
  • ranking: A configuration defining how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.

Result Annotations

The search result is usually the request record, enriched with result data.

  • records: A sequence of maps describing the actual search result, meaning the records retrieved from the index. Each record should have an additional attribute "_weight" describing the relevance score of this record with respect to the query. The size of the "record" sequence is limited by the "maxcount" parameter.
<Val key="query">meaning of life</Val>
<!-- other query parameters -->
<Seq key="records">
  <Map>
    <Val key="_weight" type="double">0.95</Val>
    <Val key="_recordid">file:hamlet</Val>
    <Val key="title">Hamlet</Val>
    <Val key="author">Shakespeare</Val>
    ...
  </Map>
  <Map>
    <Val key="_weight" type="double">0.90</Val>
    <Val key="_recordid">file:hitchhiker</Val>
    <Val key="title">Hitchhiker's Guide to the Galaxy</Val>
    <Val key="author">Adams</Val>
    ...
  </Map>
</Seq>
Note.png
return binary content
There is no nice way to return binary content anymore as attachents may only be top-level children of a record. These two solutions are possible:
  1. add an attachment to the search record with a name after this pattern: <resultItem-record.Id>.<resultItem.atachmentName>
  2. convert the byte[] into a string (e.g. base64 encoding, so it is serializable) and return it in the AnyMap


  • count: The total number of records in the index that have any relevance to the query. Example see runtime.
  • indexSize (optional): The total number of records in the searched index. Example see runtime.
  • runtime: The execution time of request in milliseconds, added by the search service. Example:
<Val key="query">meaning of life</Val>
<Val key="count" type="long">123456</Val>
<Val key="indexSize" type="long">987654321</Val>
<Val key="runtime" type="long">42</Val>
<!-- other query parameters -->
<Seq key="records">
  <!-- contains returned records -->
</Seq>
  • facets: The faceting results as requested by the facetby parameters. This Map contains a nested Seq for each requested facet and its values.
<Val key="query">meaning of life</Val>
<Map key="facets">
  <Seq key="year">
    <Map>
      <Val key="value">2000</Val>
      <Val key="count" type="long">42</Val>
    </Map>
    <Map>
      <Val key="value">2001</Val>
      <Val key="count" type="long">21</Val>
    </Map>
    ...
  </Seq>
  <Seq key="author">
    <Map>
      <Val key="value">adams</Val>
      <Val key="count" type="long">13</Val>
    </Map>
    <Map>
      <Val key="value">shakespear</Val>
      <Val key="count" type="long">17</Val>
    </Map>
    ...
  </Seq>
</Map>
</Val>
  • _highlight: The annotation of the result record, usually used to highlight relevant sections from the result documents in order to allow the user to see at one glance if it suits what he or she was looking for. What is returned here exactly, depends on the used search engine. For example, the Solr integration in SMILA returns the raw form of the text and information about the matching parts to be highlighted. Example:
<Seq key="records">
  <Map>
    ...
    <Map key="_highlight">
      <Map key="content">
        <Val key="text">... To be or not to be ...</Val>
        <Seq key="positions">
          <Map>
            <Val key="start" type="long">7</Val>
            <Val key="end" type="long">9</Val>
            <Val key="quality" type="long">100</Val>
          </Map>
          <Map>
            <Val key="start" type="long">20</Val>
            <Val key="end" type="long">22</Val>
            <Val key="quality" type="long">95</Val>
          </Map>
        </Seq>
      </Map>
    <Map>
    ...
  </Map>
Using the HighlightingPipelet this can be transformed into a highlighted text fragment (here using * as the highlight tag):
<Seq key="records">
 <Map>
   <Val key="_weight" type="double">0.95</Val>
   <Val key="_recordid">file:hamlet</Val>
   <Val key="title">Hamlet</Val>
   <Val key="author">Shakespeare</Val>
   <Map key="_highlight">
     <Map key="content">
       <Val key="text">... To *be* or not to *be* ...</Val>
     </Map>
   <Map>
   ...
</Map>

Helper Classes

There are some classes that help a client to create query records with their annotations and to read result records and their annotation. You can find them in package org.eclipse.smila.search.api.helper:

  • QueryBuilder: A helper class for building queries and sending the query to search service. Returns a result in the form of the next class:
  • ResultAccessor: A wrapper for the complete search result. Provides methods to access the basic top-level result annotations and to access each search result record wrapped by a:
  • ResultRecordAccessor: Defines methods for accessing some of the result record annotations.

See the source code or JavaDocs for more details on the provided methods.

SMILA Search Servlet

In addition to the "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos only, not for productive use. It is usually deployed in the Jetty instance that comes with SMILA at /SMILA/search. On first invocation, it currently creates a quite empty query record (it sets some default parameters like maxcount etc.) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page.

Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indices available in Solr so that the search page can display the names for selection on the left side:

<SearchResult xmlns="http://www.eclipse.org/smila/search">
  <Workflow>searchpipeline</Workflow>
  <Record xmlns="http://www.eclipse.org/smila/record">
    <!-- effective query and embedded result records --->
  </Record>
  <!-- part added by SearchServlet -->
  <IndexNames>
    <IndexName>test_index</IndexName>
  </IndexNames>
</SearchResult>

You can use the same mechanism to add other information to the XML that is necessary for displaying purposes in the search form but not contained in the search service result: You just have to implement your own servlet or extend the default servlet. Please refer to the source code for details.

XSLT Stylesheets for SMILA search and result pages

The stylesheets are loaded from the configuration directory org.eclipse.smila.search.servlet and are selected by adding the HTTP parameter "style" to the URL. The value of this parameter must be the filename of the desired stylesheet without the suffix. The file's extension must bei .xsl. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value was set.

In the default application, three stylesheets are avaiable:

  • SMILASearchDefault: The default search page. Use this as a reference on how to describe simple queries and present result lists, including paging through bigger results.
  • SMILASearchAdvanced: Same layout for the result list but demostrates how to create more complex query records with attribute values and filters.
  • SMILASearchTest: Primitive layout without paging but demonstrates the setting of even more query features.

To start with another than the default stylesheet, you can add a style parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: http://localhost:8080/SMILA/search?style=SMILASearchAdvanced.

In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples on how to apply them, as we will not present something like a full tutorial here (-;

Setting parameters

To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the resultSize parameter to 7 using a hidden HTML input field, use:

<input type="hidden" name="resultSize" value="7" />

See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules.

Setting attributes

You can add literal string values to attributes using "A.<AttributeName>" as the HTTP parameter name. E.g., to set a value from an HTML text input field as a literal in attribute "Title", use:

<input type="text" name="A.Title" />

Setting other parameters

To add a "sortby" parameter for an attribute, use "sortBy.<AttributeName>=<order>", e.g.

<input type="hidden" name="sortby.FileSize" value="descending" />

To create a filter for an attribute, use HTTP params:

  • "F.val.<AttributeName>" to add filter values to an "oneOf" filter.
  • "F.min.<AttributeName>" and "F.max.<AttributeName>" to set the lower/upper bounds of an "atLeast"/"atMost" filter.

If both "F.val" and "F.min/F.max" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g.

  • To set a filter for attribute MimeType restricting the result to HTML documents, use:
<input type="hidden" name="F.val.MimeType" value="text/html" />
  • To set a filter for attribute FileSize restricting the result to document sizes between 1000 and 10000 bytes, use:
<input type="hidden" name="F.min.FileSize" value="1000" />
<input type="hidden" name="F.max.FileSize" value="10000" />

To set a value in the ranking parameter for the complete record or an attribute, use "R[.<AttributeName>].<ValueName>". E.g., the following input field adds a parameter "Operator=OR" to attribute "Content":

<input type="hidden" name="R.Operator.Content" value="OR" />

Adding attachments

Attachments can be added to the query record by adding file upload fields to the search form, for example:

<input type="file" name="Content"/>

If the user selects a file for this field, it will be uploaded to SMILA and added as attachment "Content". Of course, there must be pipelets in your search pipeline that can process this attachment. Note, that the attachments will be kept in memory in a default SMILA configuration, so they should not be too large.

Record Search Servlet

In addition there exists the very basic Record Search Servlet available at /SMILA/recordsearch.

You can do a POST or GET request on this URL with a SMILA search record in XML representation as the request body. The servlet then parses the given XML and calls the Search Service. The default is to use the SeachPipeline but you can define any other pipeline by adding the _workflow annotation to the search record with the respective pipeline name.

The servlet returns the XML representation of the record returned by the Search Service as is, in which you can find the search results (see above).