Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/2011.Simplification/Search"

(Search Service API)
(For SMILA 1.0: Simplification pages are obsolete, redirect to SMILA/Documentation/Search)
 
(46 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests in BPEL workflows, and the sample servlet used to create a simple search Web GUI.
+
#REDIRECT [[SMILA/Documentation/Search]]
 
+
=== Introduction ===
+
 
+
Let's start at the top: If you have installed SMILA and created an index by starting a crawler you can now use you web browser to go to  [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] to search the index:
+
 
+
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]]
+
 
+
What happens behind the scenes when you enter a query string and submit the form is that a servlet creates a SMILA record from the HTTP parameters, uses the search service to execute a BPEL workflow on this record, receives an enriched version of the query record and a list of result records in XML form and uses an XSLT stylesheet to create a result HTML page.
+
 
+
Using the [http://localhost:8080/SMILA/search?style=SMILASearchAdvanced Advanced] link at the top you can switch to more detailed search page:
+
 
+
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]]
+
 
+
This page allows you to enter a more specific query. In case you want to use the default search servlet for your own search page you should use the XSLT files that create these two pages as a reference when trying to design your own search page.
+
 
+
 
+
=== Search Processing ===
+
 
+
Having seen the tip of the iceberg, we dive down to the very bottom of SMILA search: the actual processing of search requests in SMILA BPEL pipelines. We assume that you are accustomed to the basic SMILA workflow processing features used in indexing workflows. You may want to refer to [[SMILA/Documentation/BPEL_Workflow_Processor]] for details.
+
 
+
==== Search Pipelines ====
+
 
+
Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead pushing lists of records through them that correspond to data source objects they are invoked with a single record that represents the search request by containing values for parameters defined by the Search API (see below). During the workflow this request object can be analyzed and enriched and eventually the actual search in an index is done. The results of this search are not added to the blackboard as records of their own, but added to the request record under key "records". Further pipelets may then do further work based on the request data and the result record list (e.g. highlighting). Finally the request record containing also the search result is returned to the client and can be presented.
+
 
+
Pipelet invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample).
+
 
+
=== Search Service API ===
+
 
+
The actual Search API is quite simple: SMILA registeres an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a
+
SMILA query record and the name of a search workflow as input, execute the workflow on the record and return the result in different formats:
+
 
+
* <tt>Record search(String workflowName, Record query) throws ProcessingException</tt>: this is the basic method of the search service, that returns the result records as SMILA data structures. The other methods call this method for the actual search execution, too, and just convert the result.
+
* <tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt>: return the search result as a XML DOM document. See below for the schema of the result.
+
* <tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt>: return the search result as an XML string. See below for the schema of the result.
+
 
+
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition):
+
 
+
<source lang="xml">
+
  <element name="SearchResult">
+
    <complexType>
+
      <sequence minOccurs="1" maxOccurs="1">
+
        <element name="Workflow" type="string" minOccurs="1" maxOccurs="1" />
+
        <element ref="rec:Record" minOccurs="0" maxOccurs="1" />
+
      </sequence>
+
    </complexType>
+
  </element>
+
</source>
+
 
+
You can view the result XML by using the sample SMILA search page [http://localhsot:8080/SMILA/search] and selecting the "Show XML result" checkbox before submitting the query.
+
 
+
The content of the query record basically depends a lot on the used search services. E.g. using the LuceneSearchService, you can set attribute values to search in the index fields to which these attributes have been mapped during indexing (refer to the Lucene integration documentation for details). Other search parameters are attached to the query record as annotations. However, the Search API is also a recommendation where to put some basic, commonly used search parameters, which all index integrations should honor (of course they may quite specify extensions that are not covered by the generic Search API). The following sections describes these recommendations.
+
 
+
=== Query Parameters ===
+
 
+
Parameters are stored in a single place in the query record and used to describe relatively simple query properties: The query record has a single annotation named "parameters", which can contain:
+
 
+
* single valued parameters: named values of the annotation
+
* multi valued parameters: subannotation with the parameter name and the values as its anon-values list.
+
* map valued parameters: subannotations with named values
+
 
+
The Search API defines also the names, allowed values and default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additioal parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also define default values. All parameters are single values if not specified differently.
+
 
+
* query: the search string. The index implementor can define a syntax to describe complex search criteria in a single string, SMILA does currently not define an own query syntax. The index implementor might or might not be able to merge this query string with search criteria described by attribute values/annotations of the query record.
+
* resultSize: number of records to return to the search client, default value is 10.
+
* resultOffset: number or top results to skip, default value is 0. Use this parameter to implement result list paging: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ....
+
* threshold: minimal relevance score that a result must have, default is 0.0.
+
* language: language of the query, no default value. There could be language specific pipelets/services that need to know in which language the user is expressing his query to work correctly.
+
* indexName: some index services (like our LuceneIndexService) can manage multiple indexes at once, then they can use this parameter to select the index to search with this request. However, they always should have a default index name configured somehow so that a request succeeds without having this parameter set.
+
* resultAttributes: multi-valued parameter, describing the names of attributes that should be added to result records by the search engine. This list should only contain the attributes needed by pipelets after the search for processing or by the search page for displaying the results, including too many attributes will always decrease performance. Omitting this parameter should result in getting all available attributes.
+
* orderBy: map valued parameter with named values "attribute" (any string) and "mode" ("ASC"/"DESC") specifying that the search result should be be sorted by the named attributes in the given direction. Omitting this parameter should result in search result sorted by relevance (score, similarity, ranking, ....). Multiple orderBy annotations can be added and should be evaluated in the order of appearance.
+
 
+
=== Query Attribute Annotations ===
+
 
+
Additional annotations can be added to attributes for which they describe refinements of the search. They usually contain only named values and anonymous values.
+
 
+
* Filters: Filters describe hard restrictions on the values of result record attributes that must be matched for a record to be included in a result (in opposite to attribute values which may describe only soft criteria). Advanced search engines might even allow to add multiple filter annotations to a single attribute.
+
** type: "ENUMERATION"/"RANGE": Specifies if the filter is described by an explicit enumeration of allowed/forbidden values or by giving the lower and/or upper bound. For enumeration filters, the actual filter values are added as anonymous values to the filter annotation. For range filters, see below.
+
** mode: "ALL"/"ANY"/"ONLY"/"NONE": Specify whether an allowed object must have all, any, only or none of the filter values to match the filter.
+
** min/max: Specify the lower and/or upper bound of a range filter.
+
 
+
* Ranking: Contains properties that modify the ranking or relevance score of results by manipulating the relevance valuation for a single attribute or for the complete record (if attached to the record itself, not an attribute). Two property names are predefined, but search engine integrations may include additional names:
+
** name: if the engine knows a number of different named ways or algorithms to compute the relevance this property can be used to select a different one than the default
+
** boost: changes the weight of this attribute when the local relevance is accumulated into a global one.
+
 
+
 
+
=== Result Annotations ===
+
 
+
Annotations may not only be attached to the query record, but to the records in the search result, too. There are even additional annotations attached to the query object to describe result properties that do not refer to a single result record, but to the complete search result.
+
 
+
* result statistics: After the search the query record contains an annotation named "result" that currently contains these named values
+
** runtime: runtime for the invoked pipeline, in milliseconds
+
** totalHits: number of possible results for this query, i.e. all objects from an index that have a relevance score greater than the specified threshold (or zero).
+
** indexSize: complete number of objects in the searched index.
+
 
+
* Score: Each result record should have a "result" annotation, too, giving at least the ranking score calculated by the search engine as named value "relevance" as a double value (usually 1 means: perfect match).
+
 
+
* Highlighting: TODO
+
 
+
* Facets: TODO
+
 
+
* Terms: TODO
+
 
+
=== Helper Classes ===
+
 
+
There are some classes that help a client to create query records with their annotations and to read out result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>:
+
 
+
* <tt>QueryBuilder</tt>: helper class for building queries and sending the query to search service. Returns a result in the form of the next class:
+
* <tt>ResultAccessor</tt>: wrapper for the complete search result. Does not do much on its own, but basically creates instances of the following classes to access the records of the result.
+
* <tt>QueryRecordAccessor</tt>: Defines methods for accessing literals and annotations of the enriched query that is part of the search result.
+
* <tt>ResultRecordAccessor</tt>: Defines methods for reading literals and annotations of search result records.
+
 
+
See the source code or javadocs for more details of the provided methods.
+
 
+
 
+
=== Servlet ===
+
 
+
Additionally to this "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos, not for productive use. It is usually deployed in the Tomcat instance that comes with SMILA at "/SMILA/search". On first invocation it currently creates a quite empty query record (it sets some default parameters like resultSize etc) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page.
+
 
+
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indexes available in the LuceneSearchService so that the search page can display the names for selection on the left side:
+
 
+
<source lang="xml">
+
<SearchResult xmlns="http://www.eclipse.org/smila/search">
+
  <Query>
+
  </Query>
+
  <RecordList xmlns="http://www.eclipse.org/smila/record">
+
    ...
+
  </RecordList>
+
  <!-- part added by SearchServlet -->
+
  <IndexNames>
+
    <IndexName>test_index</IndexName>
+
  </IndexNames>
+
</SearchResult>
+
</source>
+
 
+
You can use the same mechanism to add other information to the XML that is necessary to display the search form but not contained in the search service result, you just have to implement your own servlet or extend the default servlet. Please refer to the source code for details.
+
 
+
==== XSLT Stylehsheets for SMILA search and result pages ====
+
 
+
The stylesheets are loaded from the configuration directory "org.eclipse.smila.search.servlet" and select using the HTTP parameter "style". The value of this parameter must be the stylesheet filename without suffix, the suffix must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value is set.
+
 
+
In the default application, three stylesheets are avaiable:
+
 
+
* SMILASearchDefault: the default search page. Use this as a reference for how to describe simple queries and to present result lists, including paging through bigger results.
+
* SMILASearchAdvanced: same layout for the result list, but demostrates how to create more complex query records with attribute values and filters.
+
* SMILASearchTest: primitive layout, no paging, but demonstrates the setting of even more query features.
+
 
+
To start with another than the default stylesheet, you can add a "style" parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: [http://localhost:8080/SMILA/search?style=SMILASearchAdvanced http://localhost:8080/SMILA/search?style=SMILASearchAdvanced].
+
 
+
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples of how to apply them, as we will not present something like a full tutorial here (-;
+
 
+
 
+
==== Setting parameters ====
+
 
+
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the "resultSize" parameter to 7 using an  HTML hidden input field, use:
+
 
+
<source lang="xml">
+
<input type="hidden" name="resultSize" value="7" />
+
</source>
+
 
+
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules.
+
 
+
 
+
==== Setting attributes ====
+
 
+
You can add literal string values to attributes using "A.<AttributeName>" as the HTTP parameter name. E.g., to set a value from a HTML text input field as an literal in attribute "Title", use:
+
 
+
<source lang="xml">
+
<input type="text" name="A.Title" />
+
</source>
+
 
+
 
+
==== Setting other annotations ====
+
 
+
To set a named value in the ranking annotation for the complete record or an attribute, use "R.<ValueName>[.<AttributeName>]". You are not limited to the predefined ranking value names "name" and "boost". E.g., the following input field sets add a named value "Operator=OR" to attribute "Content":
+
 
+
<source lang="xml">
+
<input type="hidden" name="R.Operator.Content" value="OR" />
+
</source>
+
 
+
To create a filter for an attribute, use HTTP params:
+
 
+
* "F.<AttributeName>" to set the filter mode ("ALL", "ANY", "ONLY", "NONE")
+
* "Fval.<AttributeName>" to add filter values to an enumeration filter.
+
* "Fmin.<AttributeName>" and "Fmax.<AttributeName>" to set the lower/upper bounds of a range filter.
+
 
+
If both "Fval" and "Fmin/Fmax" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g.
+
 
+
* to set a filter for attribute "MimeType" restricting the result to HTML documents, use:
+
 
+
<source lang="xml">
+
<input type="hidden" name="Fval.MimeType" value="text/html" />
+
</source>
+
 
+
* to set a filter for attribute "FileSize" restricting the result to document sizes between 1000 and 10000 bytes, use:
+
 
+
<source lang="xml">
+
<input type="hidden" name="Fmin.FileSize" value="1000" />
+
<input type="hidden" name="Fmax.FileSize" value="10000" />
+
</source>
+
 
+
To set named values in other attribute annotations, use "A.<AttributeName>.(<AnnotationName>.)+<ValueName>". Note that this does not work for attribute and annotation names containing "." characters. E.g., the following snippet create an annotation "highlight" on attribute "Concent", with a sub-annotation "HighlightingTransformer" and a named value "name=Sentence":
+
 
+
 
+
<source lang="xml">
+
<input type="hidden" name="A.Content.highlight.HighlightingTransformer.name" value="Sentence" />
+
</source>
+
 
+
[[Category:SMILA]]
+

Latest revision as of 07:10, 19 January 2012

Back to the top