Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/Search API"

m (Goals: corrections)
m (Elements of queries: wording)
Line 21: Line 21:
 
* workflow name: name of pipeline processing the query.
 
* workflow name: name of pipeline processing the query.
 
* A single query record
 
* A single query record
** The ID of the record could denote a user session (source = mandate ID, key = session ID) so that it could be possible to support session handling in the search backend. If no ID is specified, a GUID is generated as the ID key.  
+
** The ID of the record could denote a user session (source = client ID, key = session ID) so that it is possible to support session handling in the search backend. If no ID is specified, a GUID is generated as the ID key.  
** Attributes contains the values to search for: Only objects in the index that relate to these attribute value (i.e. have the same values in exact search, or "similar" values in fuzzy search) are part of the result set
+
** Attributes contain the values to search for: Only objects in the index that relate to these attribute value (i.e. have the same values in exact search, or "similar" values in fuzzy search) are part of the result set
* Filters: describe hard criteria for single attribute values of the results. If they are only usable as pure filters, that do not select any object for the result set like attribute value, but just restrict the result set created by the result values, or can be used without actual query values to produce just an unranked result, depends on the retrieval service used in the called pipeline.
+
* Filters: conceptionally limit the allowed values for single search attribute in the result (e.g. if u want to limit the search results to a specific source from whence your index entries originate). If they are only usable as pure filters, that do not select any object for the result set like attribute value, but just restrict the result set created by the result values, or can be used without actual query values to produce just an unranked result, depends on the retrieval service used in the called pipeline.
* Tweaking of relevance ranking/similarity valuation: Change weights of single attributes (boosting, "Importance" in eIAS), use different calculation methods for complete query or single attributes ("UseMeasure" in eIAS).
+
* Tweaking of relevance ranking/similarity evaluation: Change weights of single attributes (boosting, "Importance" in eIAS), use different calculation methods for complete query or single attributes ("UseMeasure" in eIAS).
* Textual query string: Alternatively a query may be described a "single string" represenation of query attributes, filter and boosting. To enable this we have to define a search syntax like the [http://lucene.apache.org/java/2_4_0/queryparsersyntax.html Lucene query parser syntax], for example. It is not necessary that everything is expressable in this search string, but it should be possible to merge a string query with an "object query" (the query expressed by the things above) so one could do whatever is possible with the string search and add advanced features with the query object. In this case the string search would be useful as an easy start for the most common search features.
+
* Textual query string: Alternatively a query may be described as "single string" representation of query attributes, filter and boosting. To enable this we have to define a search syntax like the [http://lucene.apache.org/java/2_4_0/queryparsersyntax.html Lucene query parser syntax], for example. It is not necessary that everything is expressible in this search string, but it should be possible to merge a string query with an "object query" (the query expressed by the things above) so one can do whatever is possible with the string search and add advanced features with the query object. In this case the string search would be useful as an easy start for the most common search features.
 
** Note that adopters of SMILA do not have to support the "officially chosen SMILA syntax" if they want to integrate other search engines that supports a different (more sophisticated) query syntax natively. The search API will treat this only as a string and will not perform syntax checking.
 
** Note that adopters of SMILA do not have to support the "officially chosen SMILA syntax" if they want to integrate other search engines that supports a different (more sophisticated) query syntax natively. The search API will treat this only as a string and will not perform syntax checking.
* Control parameters for pipelets and services: These are things like size of the result set, "cursor position", query language, user ID or token (if needed for security), sort order, etc. that control the operation of the pipelets and services in the search workflow. See [[SMILA/Specifications/Service Runtime Parameters]] for details on the representation and the evaluation of these parameters in the services/pipelets.
+
* Control parameters for pipelets and services: These control the operation of the pipelets and services in the search workflow. like:
 +
** size of the result set,  
 +
** "cursor position" or page in the result,  
 +
** query language, user ID or token (e.g. if needed for security),  
 +
** sort order,  
 +
** etc.  
 +
: See [[SMILA/Specifications/Service Runtime Parameters]] for details on the representation and the evaluation of these parameters in the services/pipelets.
  
 
=== Elements of results ===
 
=== Elements of results ===

Revision as of 06:11, 26 January 2011

Goals

The goal is to design a (relatively ;-) easy-to-use search API for SMILA.

Thoughts/Requirements:

  • Processing of Search Requests is done with the BPEL engine. This does not mean that they have to use the same very generic Processor interface (Record list in, Record list out). But it should be similar to avoid getting used to writing completely different kinds of BPEL files.
  • It should be designed to be somehow "specific" for search: We do not want to create an interface where literally everything can be returned, making it quite difficult for clients (and newbie users) to recognize what is returned where, and it requires a lot of configuration effort on the client side to describe which part of a search result can be found where in the server result.
  • On the other hand, we can learn from the current usage of the empolis IAS (eIAS) APIs what elements are needed in search queries and results:
    • The query needs to have things like query values, filters, boost factors (importance) and other parameters to control the evaluation, and control parameters.
    • The result needs to contain the "effective" query (enriched by text mining, rules, ...), the total number of hits, a result list (result objects + similarities), faceted classification (grouping, dialog questions), highlighting/markup of result attributes.
  • We have found that it does not make sense to invent new result structures for each new service, but that we rather try to express new service results in terms of existing result structures, because then existing clients can use the new results without changes.
  • We want to design an API that is extensible when new requirements for API elements arise. "Ad-hoc" extensibility would be preferable: I.e., the search API should only be a wrapper for some generic objects to ease accessing them, but it should be possible to use new features by using the generic objects immediately so that one does not have to wait for an API extension.
  • The Search API needs to be usable efficiently over different types of transports. While the most common client may be a servlet (portlet, ...) running in a Tomcat service in the same VM than the actual SMILA processing and search services, other clients may use RMI calls or SOAP webservices to talk to the SMILA search.
  • The search API should be easily consumable by Web, Java, and .NET clients.

Design

Elements of queries

A search request (or "query" for short) consists of

  • workflow name: name of pipeline processing the query.
  • A single query record
    • The ID of the record could denote a user session (source = client ID, key = session ID) so that it is possible to support session handling in the search backend. If no ID is specified, a GUID is generated as the ID key.
    • Attributes contain the values to search for: Only objects in the index that relate to these attribute value (i.e. have the same values in exact search, or "similar" values in fuzzy search) are part of the result set
  • Filters: conceptionally limit the allowed values for single search attribute in the result (e.g. if u want to limit the search results to a specific source from whence your index entries originate). If they are only usable as pure filters, that do not select any object for the result set like attribute value, but just restrict the result set created by the result values, or can be used without actual query values to produce just an unranked result, depends on the retrieval service used in the called pipeline.
  • Tweaking of relevance ranking/similarity evaluation: Change weights of single attributes (boosting, "Importance" in eIAS), use different calculation methods for complete query or single attributes ("UseMeasure" in eIAS).
  • Textual query string: Alternatively a query may be described as "single string" representation of query attributes, filter and boosting. To enable this we have to define a search syntax like the Lucene query parser syntax, for example. It is not necessary that everything is expressible in this search string, but it should be possible to merge a string query with an "object query" (the query expressed by the things above) so one can do whatever is possible with the string search and add advanced features with the query object. In this case the string search would be useful as an easy start for the most common search features.
    • Note that adopters of SMILA do not have to support the "officially chosen SMILA syntax" if they want to integrate other search engines that supports a different (more sophisticated) query syntax natively. The search API will treat this only as a string and will not perform syntax checking.
  • Control parameters for pipelets and services: These control the operation of the pipelets and services in the search workflow. like:
    • size of the result set,
    • "cursor position" or page in the result,
    • query language, user ID or token (e.g. if needed for security),
    • sort order,
    • etc.
See SMILA/Specifications/Service Runtime Parameters for details on the representation and the evaluation of these parameters in the services/pipelets.

Elements of results

A search result constists of:

  • a single object representing the "effective query": During the search workflow the query object and parameters could be changed or enhanced by textmining, rules and other query processing services. The search client may need to access the results of this query analysis and enrichment, e.g. to highlight concepts recognized in the query, present spellchecker proposals ("Did you mean:"). This object may have further information attached that is not related to a single search result object and can be used by the user to create a follow-up query. An examples for this is a faceted classification of the search result (aka grouping, dialog questions).
  • a result object list representing the hits to display in the client as the result of the search. The result objects consist of information that is returned from the index or an associated record store, e.g. SMILA XML storage, and may be enriched by pipelets and services that create single-result-object related information, e.g. highlighting/marker services or adaptation rules.

For efficient communication with the client, both parts of the result are optional. This is obvious for the "effective query" object since there may be clients that are only interested in the result objects. But there could also be searches that are only interested in the query analysis, enrichment, classification and do not need the actual search result, e.g. an AJAX client looking for query proposals while the user is typing in the search text field.

Low level API and representation

"Low-level" means a description of the actual objects that are given to the workflow processor (i.e. BPEL engine, currently) and that are returned as a result. The base of the low level representation os the SMILA record, i.e. everything listed above as a part of a query or search result must be represented as a part of a record. This way each part of the query (and result) is available inside the BPEL pipeline by just accessing record attributes or record annotations.

This means that the interface of the Search Service is:

interface SearchService {
    SearchResult search(String workflowName, Record query) throws ProcessingException;
}
 
interface SearchResult {
    String getWorkflowName(); // to be able to create follow-up queries.
    Record getQuery();
    List<Record> getResult();

Simple, isn't it? This is also the API which is exposed by a remote Search Service, may it be a as an RMI interface or a WebService interface or whatever.

Every query part apart form the workflow name is attached to the query record as an annotation. Also, each part of the result is attached to either the "effective query" record or the record in the result list, depending if is related to the complete query as such or to a single result.


The following sub-sections describe how query and result elements are represented in this low-level structure.

Representation of query elements

  • query attribute values: stored as record attributes, of course.
  • service/pipelet parameters: This is stored in annotations of the query record, see SMILA/Specifications/Service Runtime Parameters for details on representation and evaluation:
    • string query: named value "query"
    • result set size: named value "resultSize"
    • offset in result set (for paging): named value "resultOffset"
    • minimal relevance: named value "threshold"
    • language of query: named value "language"
    • order-by specification: list of sub-annotations "orderBy" with named values "attribute" = attribute to order by, "mode" = "ASC"/"DESC", precedence is given by list order
    • more parameters can be added and should be observed by all pipelets if no specific parameter are set (see below)
  • filter: annotation "filter" of attribute to which the filter applies. A list of filter values can be added in the list of anonymous values. The meaning of the filter is determined by named values:
    • "type": "enumeration", "range"
    • "mode": "all" (all filter values must be in result attribute), "any" (at least one filter value must be in result attrbute), "none" (no filter value is in result attribute), "only" (result attribute contains only filter values)
    • for range filters: no anonymous values, but "min" and "max" named values describing the range of the filter. Only "any", "none", "only" make sense as modes for interval filter.
  • changing relevance ranking: annotation "ranking" on complete record or attribute to manipulate. named values
    • "name": select another ranking calculation method than the default.
    • "boost": change weight of attribute (irrelevant in top-level annotation)


Representation of result elements

  • Effective query: a single record as part of search result. Contains everything the original query object contained, plus additional information computed by services in the workflow.
  • total number of hits: annotation "result" of effective query object, named value "totalNoOfHits". This is currently the only value of this annotation, but others general result properties could be added later.
  • Result objects: a list of records as part of the search result. Can be created by the index service or loaded from record (XML) storage and may be manipulated by subsequent services in the pipeline.
  • relevance ranking: top-level annotation "result" of each result object, named value "relevance" (Double, usually between 0 and 1)
  • Textmining proposals (spellchecker, wildcards, "meinten sie"): list of annotations named "terms" on analysed attribute with named values "concept" (recognized term), "token" (original string), "target", "startChar", "endChar", "startWord", "endWord", "pos" (part-of-speech), "method" (Typo, Wildcard, ...), "quality"
  • Facet classification: Annotation "facets" on classified attributes of effective query.
    • named value "name": description of the group (usually the classifying value).
    • named value "count": no of hits in this facet
    • filter: either named value "filter" in query string syntax or sub-annotation "filter" using filter annotation struction (see query elements) - or both?
    • sub-facets: List of sub-annotations with same structure. Name of annotation is name of classified attribute.
  • Highlighting: annotation "highlight" on top-level or on single attribute. Contents:
    • named value "text": either plain text to markup (smartfinder like) or already marked up on server side (orenge:Marker like), depending on configuration
    • sub-annotations "positions": if text is not marked up already: info where to put highlights, so that client can do it on its own. Named values: "startPos", "endPos", "quality", "queryGroup", "type" (see smartfinder highlighting for explanation)

High level API

"High-level API" means a set of Java classes that can be used in clients to create the low-level objects. Note that a client can always bypass the high level API in order to express and access things that are not yet supported by the high level API, but are representable in elements of the low level API. But usually it should not be necessary for a client to do this, but it should be possible to create queries and read results by using only the high level API.

The query builder API is designed as a "fluent API" that makes it simpler to set all the parts in simple statements: Most methods return a QueryBuilder which is just the same object on which the methods was called ("return this;"). See the Examples section below. This means that this API should never be remoted, but used locally on the client side to build a complete request that is sent to a remote search service for evaluation.

All the following classes or interfaces are not meant to by a specification claiming to be complete. I'm pretty sure that during implementation and initial usages we will find a lots of need (or at least wishes) for more convenience methods.

enum FilterMode { ANY, ALL, NONE, ONLY }; // use in filter creation
 
enum OrderMode { ASC, DESC }; // use in OrderBy spec creation
 
class QueryBuilder {
    QueryBuilder(String workflowName); // init request for given pipeline
    QueryBuilder(String workflowName, RecordFactory factory); // init request for given pipeline, use non-default RecordFactory.
    QueryBuilder setQuery(String queryString);
    QueryBuilder setResultSize(int size);
    QueryBuilder setResultOffset(int offset);
    QueryBuilder setThreshold(double threshold);
    QueryBuilder setLanguage(String language);
    QueryBuilder addLiteral(String attribute, Object value) throws InvalidTypeException;
    QueryBuilder addLiteral(String attribute, Literal literal);
    QueryBuilder addEnumFilter(String attribute, FilterMode mode, Iterable<Object> filterValues) throws InvalidTypeException; 
    QueryBuilder addRangeFilter(String attribute, FilterMode mode, Object lowerBound, Object upperBound) throws InvalidTypeException;
    QueryBuilder addFacetFilter(Facets facetList, int index); // copy filter from a facet to the attribute of this facet.
    QueryBuilder addOrderBy(String attribute, OrderMode mode);
    QueryBuilder addParameter(String valueName, String value) // sets a named value in a general param annotation
    QueryBuilder addParameter(String name, String valueName, String value) // sets a named value in a general param sub-annotation
    QueryBuilder addParameter(String name, String value) // adds an anonymous value to a general param sub-annotation
    ResultAccessor executeRequest(SearchService searchService) throws ProcessingException; 
        // execute query on given search service and wrap result in high level result helper
    Record getRecord(); // access underlying query record for advanced manipulation
    String getWorkflowName(); // for completeness ...

Note that the SearchService for the executeRequest does not need to be the WorkflowProcessor itself, but can also be a remote proxy to a real service. This way it's transparent if the client uses a local SearchService or a remote one and which protocols are used to talk to the remote service.

And then we need some classes to access the search result:

class SearchResultAccessor {
    ResultAccessor(SearchResult result);
    SearchResult getResult(); // access original result
 
    String getWorkflowName(); // for convenience
    QueryRecordAccessor getQuery();
    int resultLength();
    ResultRecordAccessor getResult(int index);
 
    // create new QueryBuilder for same pipeline from effective query object of this result:
    QueryBuilder newQueryBuilder(); // use complete query object
    QueryBuilder newQueryBuilder(String recordFilterName); 
      // keep only parts of query object as described by record filter.
      // parameters are always copied.
}
class RecordAccessor {
    RecordAccessor(Record record); 
    // methods to read literals and annotations, similar to blackboard.
    // this is just "to get the idea", it's not meant as a complete description.
    // if more convenience methods are needed, they should be added on demand.
 
    // access literals
    boolean hasLiterals(String attributeName);
    int literalSize(String attributeName);
    Literal getLiteral(String attributeName);
    List<Literal> getLiterals(String attributeName);
 
    // access top-level annotations
    boolean hasAnnotations(String annotationName);
    int annotationSize(String annotationName);
    Annotation getAnnotation(String annotationName);
    List<Annotation> getAnnotations(String annotationName);
 
    // access attribute annotations
    boolean hasAnnotations(String attributeName,String annotationName);
    int annotationSize(String attributeName,String annotationName);
    Annotation getAnnotation(String attributeName, String annotationName); 
    List<Annotation> getAnnotations(String attributeName, String annotationName); 
 
    Record getRecord(); // get underlying record.
}
class QueryRecordAccessor extends RecordAccessor{
    QueryRecordAccessor(Record record);
 
    // methods to read query parameters and annotations
    int getResultSize();
    int getResultOffset();
    double getThreshold();
    String getLanguage();
    int getTotalNoOfHits();
 
    String getParameter(String valueName) // gets a named value from general param annotation
    String getParameter(String name, String valueName) // gets a named value from a general param sub-annotation
    List<String> getParameter(String name) // gets anonymous values from a general param sub-annotation
 
    Terms getTerms(String attributeName);
    Facets getFacets(String attributeName);
}
class ResultRecordAccessor extends RecordAccessor{
    ResultRecordAccessor(Record record);
 
    // methods to read result annotations
    int getRelevance();
    HighlightInfo getHighlightInfo(); // get top-level highlight.
    HighlightInfo getHighlightInfo(String attributeName); // get attribute hightlight
}
class Terms extends AnnotationListWrapper {
    String getAttributeName();
    int length();
    String getConcept(int index);
    String getToken(int index);
    String getTargetAttributeName(int index);
    int getStartCharPos(int index);
    int getEndCharPos(int index);
    int getStartWordPos(int index);
    int getEndWordPos(int index);
    String getPartOfSpeech(int index);
    String getMethod(int index);
    double getQuality(int index);
 
    // for extensibility, if a special service creates more data.
    String getProperty(int index, String name) // access named value of n'th annotations in list
 
    List<Annotation> getSource();
}
class Facets {
    String getAttributeName();
    int length();
    String getName(int index);
    String getStringFilter(int index);
    Annotation getObjectFilter(int index);
    int getCount(int index);
    Facets getSubFacets(int index);
 
    // for extensibility, if a special service creates more data.
    String getProperty(int index, String name) // access named value of n'th annotations in list
 
    List<Annotation> getSource(); // access to original objects
}
class HighlightInfo {
    String getAttributeName();
    String getText();
 
    // for extensibility, if a special service creates more data.
    String getProperty(String name) // access named value of highlight annotation
 
    boolean isHighlighted(); // iff (noOfPositions() == 0)
    int length();
    int getStartPos(int index);
    int getEndPos(int index);
    double getQuality(int index);
    int getQueryGroup(int index);
    int getType(int index);
 
    // for extensibility, if a special service creates more data.
    String getProperty(int index, String name) // access named value of n'th sub-annotation
 
    Annotation getSource(); // access to original object
}


Examples

A sample query session:

Start with only a fulltext query:

SearchResult result = new QueryBuilder("SearchPipeline")
    .setQuery(queryString)
    .setResultSize(6)
    .setResultOffset(0) // optional, of course
    .setThreshold(0.2)
    .setLanguage(userLanguage)
    .executeSearch(searchService);
// display result.

Create a follow-up query with filtering by a user-selected facet:

result = result.newQueryBuilder("queryTextFilter")
    .addFacetFilter(selectFacet, selectedIndex)
    .executeSearch(searchService);
// display result.

Or do some paging in the result list:

 
result = result.newQueryBuilder("queryTextFilter")
    .setResultOffset(page * 6).
    .executeSearch(searchService);
// display result.

Or re-order the result list:

 
result = result.newQueryBuilder("queryTextFilter")
    .addOrderBy("PRICE", OrderMode.ASC)
    .executeSearch(searchService);
// display result.

Additional Notes

  • I propose to extend the SMILA Literal interface by the possibility to add a language specific label to the literal value. This way the SearchService could add labels in the search language to literals contained in the result which the client could use to nicely display concept values in the results.
  • This proposal will have an effect on the ProcessingService/Pipelet interface, because at the moment we do not have the possibility to give an extra query object to the pipelet. See SMILA/Specifications/Search Processing for details, ideas and discussions.
  • The proposed helper interfaces currently support only using literal attributes, i.e. no MObjects as attribute values. This is based on the assumption is that it will be sufficient for nearly all search applications. Complex objects can still be created using the Record API immediately, because all helper classes allow the direct access to the underlying objects. Also an extension of the helper methods to allow the use of attribute paths instead just names would be possible, if necessary.
  • An extension of the high level API supporting representation of information about the searching user according to SMILA/Specifications/Smila Security Concept will be added when the security specification is finished.

Copyright © Eclipse Foundation, Inc. All Rights Reserved.