Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Solr"

(Setup another core)
 
(26 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of build in features like highlighting, facets, auto-suggest and spell checking.
+
Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.
 +
 
 +
{{Note|The current implementation is a Work In Progress. The goal is to use this implementation as the default search implementation for SMILA replacing the current embedded Lucene integration. As a consequence things are likely to change in future versions. So stay tuned.}}
  
 
= SolrServerManager & SolrProperties =
 
= SolrServerManager & SolrProperties =
Line 15: Line 17:
 
</source>
 
</source>
  
= Default configuration =
+
= Configuration =
SMILA support Solr only in multicore configuration, regardless if Solr server runs embedded or remote. If SMILA starts up for the first time and Solr is configured embedded, SolrHelper copy the default configuration to Solr workspace (solr.home). The default configuration include <tt>configuration/org.eclipse.smila.solr/solr.xml</tt> which is required for multicore support. It defines all cores (like an index in Lucene universe) and map them to whose their configurations. Besides there is a folder <tt>configuration/org.eclipse.smila.solr/DefaultCore</tt> which holds a default solr core configuration which support all features implemented in SMILA.
+
 
More information about solr.xml: http://wiki.apache.org/solr/CoreAdmin
+
SMILA supports Solr only in multicore setup ("core" is the solr word for a search index), regardless whether Solr runs embedded or remote.  
 +
 
 +
== DefaultCore ==
 +
 
 +
The default configuration included in SMILA is defined in <tt>configuration/org.eclipse.smila.solr</tt>. The default mode is 'embedded' in which case SMILA starts up its own internal solr server. The full solr multicore configuration which is present in the configuration folder is used when the mode is set to {{code|embedded}}. This setup defines the sole DefaultCore holding that is suitable for the HowTo cases in SMILA.
 +
 
 +
If SMILA should connect to an already running Solr server instead of starting up an own instance, the property {{code|solr.embedded}} must be set to {{code|false}}. In that case the URL to connect to the (external) Solr server URL has to be provided by setting the property {{code|solr.serverUrl}} in the properties file.
 +
 
 +
Please note that you have to add the <tt>PingRequestHandler</tt> in each cores <tt>solrconfig.xml</tt> file, see [[#solrconfig.xml|section solrconfig.xml]]
 +
 
 +
More information about solr cores and their configuration can be found at: http://wiki.apache.org/solr/CoreAdmin
 +
 
 +
If SMILA starts up for the first time and Solr is configured embedded, the configuration is copied to Solr workspace (solr.home).
  
== schema.xml ==
+
=== schema.xml ===
 
One of the most import configuration files is <tt>configuration/org.eclipse.smila.solr/DefaultCore/conf/schema.xml</tt>. This file defines index fields and types. SMILA comes with the following set of predefined fields:
 
One of the most import configuration files is <tt>configuration/org.eclipse.smila.solr/DefaultCore/conf/schema.xml</tt>. This file defines index fields and types. SMILA comes with the following set of predefined fields:
  
Line 35: Line 49:
 
</source>
 
</source>
  
Also in schema.xml the '''uniqueKey''' property could be set. If it is set Solr know on his own when to add or update a document in index. Because in SMILA there is always an Id the Id field should be declared here.
+
The schema.xml also contains the '''uniqueKey''' property which Solr needs to know what field is used to id the documents and transparently handles add/updated accordingly. By default it is set to '''Id'''.
  
All other configuration possibilities like field types, default search field copy field and many more you can look up here: http://wiki.apache.org/solr/SchemaXml
+
Information about other configuration possibilities like field types, default search field, copy fields and many more can be found here: http://wiki.apache.org/solr/SchemaXml
  
== solrconfig.xml ==
+
=== solrconfig.xml ===
An other major configuration file is <tt>configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml</tt>. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.
+
Another major configuration file is <tt>configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml</tt>. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.
  
Very important is the '''dataDir''' property. By default the index data goes to <tt>solr.home/DefaultCore/data/</tt>. So in embedded mode the SMILA workspace could be grow very fast. With the '''dataDir''' an alternate directory can be specified.
+
Please refer to its documentation here: http://wiki.apache.org/solr/SolrConfigXml
  
Furhermore SMILA uses '''autoCommit''' via solr.DirectUpdateHandler2. It tells Solr to commit automatically all 60 seconds or after 1000 documents were added. If this property is not set no commit will occur and the indexed data is not persistent or search-able.
+
Important for SMILA is that in the embedded case the '''dataDir''' property defaults to the data/ sub folder of the core instance (e.i. <tt>solr.home/DefaultCore/data/</tt>. Hence, in embedded mode the SMILA workspace may grow quite large. Use this property in this file or set it through solr.xml at the core to provide an alternative location.
  
Complete solrconfig.xml documentation: http://wiki.apache.org/solr/SolrConfigXml
+
SMILA uses '''autoCommit''' via solr.DirectUpdateHandler2. It tells Solr to commit automatically every 60 seconds or after 1000 documents were added. If this property is not set, no commit will occur and the indexed data will not be persistent or search-able unless you send appropriate solr commands yourself. The values are a compromise where these factors play a role:
 +
* how soon shall/must a user that searches see the updates?
 +
* how many update request are sent to solr?
 +
 
 +
Note, that during commit the solr server stalls updates which might lead to index pipelet timeouts.
 +
 
 +
Note that when using an external Solr server, you have to add the PingRequestHandler since this handler is required by the SolrAdminHttpHandler to check if the cores exist and are alive before adressing them. You have to add the handler to each core's configuration file:
 +
 
 +
<source lang="xml">
 +
<requestHandler name="/admin/ping" class="PingRequestHandler">
 +
  <lst name="defaults">
 +
    <str name="qt">standard</str>
 +
    <str name="q">solrpingquery</str>
 +
    <str name="echoParams">all</str>
 +
  </lst>
 +
</requestHandler>
 +
</source>
 +
 
 +
== Setup another core ==
 +
 
 +
If you don't want to use the default solr index (<tt>DefaultCore</tt>), you can easily setup your own core. Just copy the <tt>DefaultCore</tt> configuration folder (see <tt>SMILA/configuration/org.eclipse.smila.solr</tt>) with another name, e.g. <tt>MyCore</tt>, in the same directory and adapt the configuration files described before to your needs.
 +
 
 +
Afterwards add your new core to the file <tt>SMILA.application/configuration/org.eclipse.smila.solr/solr.xml</tt>:
 +
 
 +
<source lang="xml">
 +
<?xml version='1.0' encoding='UTF-8'?>
 +
<solr persistent="true">
 +
  <cores adminPath="/admin/cores">
 +
  <core name="DefaultCore" instanceDir="DefaultCore"/>
 +
  <core name="MyCore" instanceDir="MyCore"/>
 +
  </cores>
 +
</solr>
 +
</source>
  
 
= How to use Solr with SMILA =
 
= How to use Solr with SMILA =
Line 105: Line 151:
 
== Search ==
 
== Search ==
  
The SolrSearchPipelet offers the possibility to search indexed data on a Solr server. The pipelet need only a small configuration without any special parameters.
+
The SMILA standard search servlet already uses solr to search via the SolrSearchPipelet since SMILA version 1.0. Up to version 0.9 the SMILA standard search servlet used plain lucene search.
 +
 
 +
== Search Pipelet Config ==
 +
The SolrSearchPipelet offers the possibility to search a Solr index. The pipelet needs only a small configuration without any special parameters.
  
 
<source lang="xml">
 
<source lang="xml">
Line 118: Line 167:
 
</source>
 
</source>
  
For full feature support a special kind of SearchRecord is required. The easiest way to create such a record is to use the '''SolrQueryBuilder'''. This class extend native '''QueryBuilder''' with methods to configure a Solr request and special Solr feature like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters. All Solr specific parameters are stored in a map within SearchRecord named '''_solr.query'''.  
+
== Solr Specific Search Record ==
 +
 
 +
For full feature support an enhanced search record is required.  
 +
This sections will provide both, XML samples on how the features are configured in the search record as well as description on helper classes that are  available from within SMILA. Path notations for the elements in the record just have their key names of the respective elements as the path element and always start from the root; e.g. <tt>_solr.params/highlighting</tt>.
 +
 
 +
To understand the following section you must know the standard [[SMILA/Documentation/Search#Query_Parameters|SMILA search record]]
 +
 
 +
=== Standard Parameters ===
 +
 
 +
The following SMILA standard query parameters are supported:
 +
 
 +
* maxcount
 +
* offset
 +
* indexname, this must correspond to an existing solr core name
 +
* resultAttributes
 +
* query
 +
 
 +
The solr pipelet suports only a sole <tt>query</tt> element as a string value, which it passes unaltered to solr. The solr default handler assumes this to be a valid Lucene query string, but ultimately this depends on the configured handler. All escaping needs to be done by the one constructing the search record (Note: There is no need to URL encode it, as this is done internally).
 +
 
 +
<source lang='xml'>
 +
<Record xmlns="http://www.eclipse.org/SMILA/record" version="2.0">
 +
  <!-- query (q) -->
 +
  <Val key="query">Content:solr Content:eclipse</Val>
 +
  <Val key="maxcount" type="long">3</Val>
 +
  <Val key="offset" type="long">3</Val> 
 +
  <Val key="indexname">wikipedia</Val>
 +
  <Seq key="resultAttributes">
 +
    <Val>Content</Val>
 +
    <Val>Id</Val>
 +
  </Seq>
 +
    ...
 +
<Record>
 +
 
 +
</source>
 +
 
 +
The above sample shows a query on the index field <tt>Content</tt> for the string "solr eclipse".
 +
 
 +
=== Highlighting ===
 +
 
 +
Highlighting for Solr deviates from the standard SMILA way to support solr features. The configuration is contained in <tt>_solr.query/highlighting</tt>
 +
<source lang='xml'>
 +
<Map key="_solr.query">
 +
  ...
 +
  <Seq key="highlighting">
 +
    <Map>
 +
      <Val key="attribute">global.solr.params</Val>
 +
      <Val key="hl" type="boolean">true</Val>
 +
      <!-- list of fields to be highlighted, space delimited -->
 +
      <Val key="hl.fl">Content  Title</Val>
 +
      <Val key="hl.simple.pre">&lt;b&gt;</Val>
 +
      <Val key="hl.simple.post">&lt;/b&gt;</Val>
 +
    </Map>
 +
    <!-- other maps with attribute = field name for per-field configuration -->
 +
  </Seq>
 +
  ...
 +
</Map>
 +
 
 +
</source>
 +
 
 +
The configuration can be done globally (applies to all HL fields) as well as per field and are contained in maps that must have an entry <tt>attribute</tt> that either contains the value <tt>golabl.solr.params</tt> which then signifies the the global highlight settings or the name of the attribute/filed that is to be highlight-configured.
 +
The other entries in this map correspond in name and values to the ones solr supports.
 +
See http://wiki.apache.org/solr/HighlightingParameters.
 +
 
 +
In order to turn on highlighting, at least the global config must be present with the entry <tt>hl=true</tt>.
 +
 
 +
Programmatic highlighting configuration is done though <tt>HihglightingQueryConfigAdapter</tt>. The default constructor creates a configuration object with global highlighting parameters which is required to enable highlighting. The other constructor provides an optional per-field configuration.
 +
 
 +
<source lang="java">
 +
  // create global highlighting configuration (required, enables highlighting)
 +
    final HighlightingQueryConfigAdapter highlighting = new HighlightingQueryConfigAdapter();
 +
    highlighting.setHighlightingFields("Content Title");
 +
    highlighting.setHighlightingSimplePre("<b>");
 +
    highlighting.setHighlightingSimplePost("</b>");
 +
    builder.addHighlightingConfiguration(highlighting);
 +
</source>
 +
 
 +
Other than in SMILA, the <tt>_highlight</tt> annotation is not created per result item but replaces the normally returned field value, i.e. when you have the <tt>Content</tt> field to be returned in your search and you also configured highlighting on it, then the search returns only the highlighted value for the <tt>Content</tt> field.
 +
 
 +
=== Facets ===
 +
 
 +
Facets are specified for solr through the <tt>/facetby</tt> Seq as defined in the standard. However, the following differences exist:
 +
* maxcount is optional
 +
* solr doesn't support ordering of facets, so if this is set, then there is a warning in the log but otherwise ignored.
 +
 
 +
Faceting is turned on as soon as the <tt>facetby</tt> Seq is present.
 +
 
 +
Note, that the attibute value must be the solr field-name as the mapping from the solrSearchPipelet is not applied. 
 +
 
 +
The values in the <tt>nativeParametes</tt> Map are passed to solr for the field verbatim after the pattern <tt>f.${attribute}.${key}=${value}</tt>. This allows you to just specify any valid solr parameter/value pair on field level without any interaction on our part.
 +
Global facet parameters may be defined in the <tt>_solr.query</tt> map
 +
 
 +
Solr supports different kinds of faceting and this can be selected with the <tt>type</tt> parameter. It's value is solr's respective parameter name and is passed as given. No checks are performed here as to allow future methods OOB. However, it defaults to <tt>facet.field</tt> if missing. Solr's <tt>facet.query</tt> is not supported thru this structure ATM as it needs to be formulated quite differently and hence must be formulated as global parameters in the <tt>_solr.query</tt> map. Nontheless, the facets are retuned the normal way. 
 +
 
 +
<source lang='xml'>
 +
<Seq key="facetby">
 +
  <!-- per-field configuration for facet.field -->
 +
  <Map>
 +
    <Val key="type">facet.field</Val>
 +
    <Val key="maxcount" type="long">10</Val>
 +
    <Val key="attribute">Extension</Val>
 +
  </Map>
 +
  <!-- per-field configuration for facet.date -->
 +
  <Map>
 +
    <Val key="type">facet.date</Val>
 +
    <Val key="attribute">LastModifiedDate</Val>
 +
    <Map key="nativeParameters">
 +
      <Val key="facet.date.start">NOW/DAY-5DAYS</Val>
 +
      <Val key="facet.date.gap">+1DAY</Val>
 +
      <Val key="facet.date.end">NOW/DAY+1DAY</Val>
 +
    </Map>
 +
  </Map>
 +
</Seq>
 +
</source>
 +
 
 +
Facets are returned the SMILA standard way in the <tt>facets</tt> map.
 +
 
 +
=== Solr Specific Parameters (_solr.query) ===
 +
 
 +
Some configuration deviations from the SMILA standard and other solr specialties are put into a Solr specific <tt>_solr.query</tt> Map element at top level of the search record.
 +
 
 +
The following are supported:
 +
* filters
 +
* shards
 +
* request handler
 +
 
 +
==== Filters ====
 +
Solr filters may only be specified directly, i.e. as native query strings via the <tt>fq</tt> element.  Multiple ones will be automatically ANDed.
 +
Note, that the QueryBuilder's methods to add filters and the <tt>/filter</tt> Seq are not supported (yet).  
 +
 
 +
==== Shards ====
 +
Shards are only supported in remote mode and may be defined through the <tt>_solr.query/shards</tt> Seq.
 +
 
 +
==== Solr Request Handler ====
 +
To select another solr request handler add the <tt>_solr.query/qt</tt> entry.
 +
 
 +
The following XML snippet shall illustrate these cases:
 +
 +
<source lang="xml">
 +
<Record xmlns="http://www.eclipse.org/SMILA/record" version="2.0">
 +
  ...
 +
  <Map key="_solr.query">
 +
    <!-- filter query (fq) -->
 +
    <Seq key="fq">
 +
      <Val>Size:[500 TO 1000]</Val>
 +
      <Val>Author:"H. Simpson"</Val>
 +
    </Seq>
 +
 
 +
    <!-- shards -->
 +
    <Seq key="shards">
 +
      <Val>http://localhost:8983/solr</Val>
 +
      <Val>http://remote-server:8983/solr</Val>
 +
    </Seq>
 +
 
 +
    <!-- request handler (qt) -->
 +
    <Val key="qt">/custom</Val>
 +
 
 +
</Record>
 +
</source> 
 +
 
 +
==== SolrQueryBuilder  ====
  
 +
Instead of assembling the XML/Record yourself you can use the <tt>SolrQueryBuilder</tt> from within SMILA. This class extends native <tt>QueryBuilder</tt> with methods to configure a Solr request and special Solr features like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters.
 +
 
<source lang="java">
 
<source lang="java">
// create Solr specific query builder
+
  // create Solr specific query builder
 
     final SolrQueryBuilder builder = new SolrQueryBuilder();
 
     final SolrQueryBuilder builder = new SolrQueryBuilder();
  
Line 148: Line 358:
 
</source>
 
</source>
  
More information about common query parameters: http://wiki.apache.org/solr/CommonQueryParameters
+
=== Auxillary Search Functions ===
  
=== Highlighting ===
+
==== Auto-suggest/Terms ====
  
To configure highlighting the <tt>HihglightingQueryConfigAdapter</tt> is used. The default constructor create a configuration object with global highlighting parameters which is required to enable highlighting. The other constructor provides an optional per-field configuration.
+
Auto suggest/completion is also done via a search request, albeit a very special, stripped down version, which looks like so in the default setup:
  
<source lang="java">
+
<source lang="xml">
// create global highlighting configuration (required, enables highlighting)
+
<Record >
    final HighlightingQueryConfigAdapter highlighting = new HighlightingQueryConfigAdapter();
+
  <Map key="_solr.query">
     highlighting.setHighlightingFields("Content");
+
     <Map key="terms">
    highlighting.setHighlightingSimplePre("<b>");
+
      <Val key="terms" type="boolean">true</Val>
    highlighting.setHighlightingSimplePost("</b>");
+
      <Val key="terms.fl">Content</Val>
     builder.addHighlightingConfiguration(highlighting);
+
      <Val key="terms.prefix">con</Val>
 +
     </Map>
 +
    <Val key="qt">/terms</Val>
 +
  </Map>
 +
</Record>
 
</source>
 
</source>
  
More information about all highlighting parameters: http://wiki.apache.org/solr/HighlightingParameters
+
The only items present have to be the <tt>terms</tt> map and <tt>qt</tt> entry that needs to be set to an appropriate handler (by default this is /terms). The entries in the <tt>terms</tt> map are passed as is to solr. 
 +
For more information about terms configuration and parameters see http://wiki.apache.org/solr/TermsComponent.
  
=== Facets ===
+
The results are returned in the <tt>_solr.result/terms</tt> map with the key as the actual completed word and its value tells you how many documents in the index contain this word. 
  
The <tt>FacetQueryConfigAdapter</tt> provide only one constructor but takes a FacetType parameter instead. At least there is one configuration of <tt>FacetType.GLOBAL</tt> required (global configuration) to enabled facets. The other types are <tt>FacetType.DATE</tt>, <tt>FacetType.FIELD</tt> and <tt>FacetType.QUERY</tt> which takes an Array of Strings.
+
<source lang="xml">
 +
<Record >
 +
  <Seq key="records"></Seq>
 +
  <Val key="runtime" type="long">3</Val>
 +
  <Map key="_solr.result">
 +
    <Map key="terms">
 +
      <Val key="congratulations" type="long">1</Val>
 +
      <Val key="conjugate" type="long">1</Val>
 +
      <Val key="containing" type="long">1</Val>
 +
    </Map>
 +
  </Map>
 +
</Record>
 +
</source> 
 +
 
 +
In SMILA code this can be done like so:
  
 
<source lang="java">
 
<source lang="java">
    // create global facet configuration (required, enables facets)
+
     final TermsQueryConfigAdapter terms = new TermsQueryConfigAdapter(_solrField);
     final FacetQueryConfigAdapter facet_global = new FacetQueryConfigAdapter(FacetType.GLOBAL);
+
    terms.setTermsPrefix("con");
     builder.addFacetConfiguration(SolrConstants.GLOBAL, facet_global);
+
     _queryBuilder.setTermsConfiguration(terms);
 +
    _queryBuilder.setRequestHandler("/terms");
 +
</source>
  
    // create field facet configuration
+
==== Spellcheck (Did you mean) ====
    final FacetQueryConfigAdapter facet_field = new FacetQueryConfigAdapter(FacetType.FIELD);
+
    facet_field.setFacetLimit(10);
+
    builder.addFacetConfiguration("Extension", facet_field);
+
  
    // create facet date configuration
+
SIMLA's default setup has spell checking (Did you mean) for the <tt>Content</tt> field enabled. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml) and this has been done. Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent uses a separate index which is created on the fly and updated on every commit. Therefore, to retrieve alternative suggestions for possibly misspelled input words, you just need to add the <tt>spellcheck</tt> map to <tt>_solr.query</tt>:
    final FacetQueryConfigAdapter facet_date = new FacetQueryConfigAdapter(FacetType.DATE);
+
    facet_date.setFacetDateStart("NOW/DAY-5DAYS");
+
    facet_date.setFacetDateGap("+1DAY");
+
    facet_date.setFacetDateEnd("NOW/DAY+1DAY");
+
    builder.addFacetConfiguration("LastModifiedDate", facet_date);
+
  
    // create facet query configuration (range example)
+
<source lang="xml">
     final String[] fq = { "* TO 1000", "1000 TO 5000", "5000 * TO" };
+
  <Map key="_solr.query">
    final FacetQueryConfigAdapter facet_query = new FacetQueryConfigAdapter(FacetType.QUERY, fq);
+
    ....
    builder.addFacetConfiguration("Size", facet_query);
+
     <Map key="spellcheck">
 +
      <Val key="spellcheck" type="boolean">true</Val>
 +
      <Val key="spellcheck.count" type="long">5</Val>
 +
      <Val key="spellcheck.extendedResults" type="boolean">true</Val>
 +
      <Val key="spellcheck.collate" type="boolean">true</Val>
 +
    </Map>
 +
  </Map>
 
</source>
 
</source>
  
More on how different facet types and their parameter work: http://wiki.apache.org/solr/SimpleFacetParameters
+
The map contains solr parameters (see http://wiki.apache.org/solr/SpellCheckComponent) that are passed "as is" to solr.
  
=== Terms (Auto-suggest) ===
+
This will add the <tt>spellcheck</tt> map to <tt>_solr.result</tt>:
  
The <tt>TermsQueryConfigAdapeter</tt> comes with a default constructor which enables terms. To use the TermsComponent the RequestHandler must set to the matching configuration (solrconfig.xml, default: /terms).
+
<source lang="xml">
 
+
  <Map key="_solr.result">
<source lang="java">
+
    ...
// create terms configuration (auto suggest example)
+
    <Map key="spellcheck">
    final TermsQueryConfigAdapter terms = new TermsQueryConfigAdapter("Title");
+
      <Map key="rust">
    terms.setTermsLower("auto");
+
        <Val key="just" type="long">1</Val>
    terms.setTermsPrefix("sug");
+
        <Val key="bust" type="long">1</Val>
     builder.setTermsConfiguration(terms);
+
      </Map>
 +
      <Val key="collation">Content:just</Val>
 +
    </Map>
 +
     ...
 +
  </Map>
 
</source>
 
</source>
  
More information about terms configuration and parameters: http://wiki.apache.org/solr/TermsComponent
+
For each misspelled word there is a nested map containing the corrections, where the key is the corrected term and the value is the frequency of the term in the index. The value for the frequency must be turned on via spellcheck.extendedResults and defaults to -1 otherwise.
  
=== Spellcheck (Did you mean) ===
+
When <tt>collate</tt> is on then you can also find a full alternative query under the key <tt>collation</tt>.
  
The default constructor of <tt>SpellCheckQueryConfigAdapter</tt> is required to enable SpellCheckComponent. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml). Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent use a separate index which is create on the fly (in SMILAS's default configuration on every commit).
+
The code for the above XML snippets has been generated with the following code:
  
 
<source lang="java">
 
<source lang="java">
// create spell check configuration
+
    addSolrDoc("1",
 +
      "This is a simple text without real meaning as i dont want to bust my behind for smth. with more sense.");
 +
    addSolrDoc("2", "It is just used for testing.");
 +
    indexAndCommit();
 +
 
 +
    // setup search
 
     final SpellCheckQueryConfigAdapter spellcheck = new SpellCheckQueryConfigAdapter();
 
     final SpellCheckQueryConfigAdapter spellcheck = new SpellCheckQueryConfigAdapter();
 
     spellcheck.setSpellCheckCount(5);
 
     spellcheck.setSpellCheckCount(5);
 
     spellcheck.setSpellCheckExtendedResults(true);
 
     spellcheck.setSpellCheckExtendedResults(true);
 
     spellcheck.setSpellCheckCollate(true);
 
     spellcheck.setSpellCheckCollate(true);
     builder.setSpellCheckConfiguration(spellcheck);
+
     _queryBuilder.setSpellCheckConfiguration(spellcheck);
 +
    _queryBuilder.setQuery("Content:rust");
 
</source>
 
</source>
  
More information about SpellCheckComponent configuration and parameters: http://wiki.apache.org/solr/SpellCheckComponent
+
==== More Like This / What's related ====
  
=== SolrSearchRecord XML ===
+
Solr offers a feature to return ''related''  documents which is called in Solr ''More Like This'' (MLT). There are 2 modes supported:
  
With the aid of the SolrQueryBuilder and the adapters above the following Record structure will be created.
+
# return for all items in the SRL the top N related documents, see [http://wiki.apache.org/Solr/MoreLikeThis]
 +
# the other does this ad-hoc for just one document for which it uses an own request handler, see [http://wiki.apache.org/Solr/MoreLikeThisHandler]
 +
It is obvious that the first variant requires much more performance than the 2nd.
  
 +
Both modes are supported through SMILA and configured very similar. SMILA doesn't do anything special to the arguments you pass in with the record and hands them on to Solr as-is, except that it performes any necessary URL encoding for you. While you may assign specific data types to the parameters, this is not necessary and all values may be given as strings as this is what is being passed on to Solr anyhow.
 +
 +
Which mode is active ultimatly depends on your handler configuration in solrconfig.xml. However, we will assume here SMILA's default setup which binds the MLT handler to <tt>/mlt</tt> and a normal query to <tt>/select</tt>. 
 +
 +
Both modes share most of the MLT parameters but also need/support specific ones.
 +
 
 
<source lang="xml">
 
<source lang="xml">
<?xml version="1.1" encoding="utf-8"?>
+
<record>
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
+
 
<Val key="_recordid">SolrSearchRecordId: 61ff6a7e-d314-4c86-ab05-5f8bdcc50f90
+
  <!-- this is the lucene query expression that is executed in both cases. -->
</Val>
+
  <Val key="query">euklid</Val>
<!-- query (q) -->
+
  ...
<Val key="query">query</Val>
+
  <Map key="_solr.query">
<!-- start -->
+
    <!-- this select the solr request handler. set it to /mlt when u want to use the MLT handler  -->
<Val key="offset" type="long">10</Val>
+
    <!-- <Val key="qt">/mlt</Val> -->
<!-- rows -->
+
    <!-- determines the list of fields returned for both the normal results as well as the MLT results  -->
<Val key="maxcount" type="long">5</Val>
+
    <Val key="fl" >Id,score,Size</Val>
<!-- result field (fl) -->
+
    ...
<Seq key="resultAttributes">
+
    <Map key="moreLikeThis">
<Val>Path</Val>
+
      <Val key="mlt" >true</Val>
<Val>Size</Val>
+
      <Val key="mlt.fl" >Content</Val>
<Val>Content</Val>
+
      <Val key="mlt.mindf">1</Val>
</Seq>
+
      <Val key="mlt.mintf">1</Val>
<Map key="_solr.query">
+
      ...
<!-- filter query (fq) -->
+
    </Map>
<Seq key="fq">
+
  </Map>
<Val>Size:[500 TO 1000]</Val>
+
<record>
</Seq>
+
</source>
<!-- shards -->
+
 
<Seq key="shards">
+
 
<Val>http://localhost:8983/solr</Val>
+
===== MLT Results w/o Handler =====
<Val>http://remote-server:8983/solr</Val>
+
 
</Seq>
+
In this case solr will add the <tt>moreLikeThis</tt> section on the same level as the normal <tt>response</tt> section and you need to manually look up the MLT docs for each given result item. SMILA on the other hand transforms the solr result in that it converts the MLT information as a nested part of SMILA's result item, like so:
<!-- request handler (qt) -->
+
 
<Val key="qt">/terms</Val>
+
<source lang="xml">
<!-- highlighting configuration -->
+
<Seq key="records">
<Seq key="highlighting">
+
  <Map>
<!-- a global map to enable highlighting -->
+
    <Val key="_recordid">file:Euklid.html</Val>
<Map>
+
    <Val key="_weight" type="double">0.7635468</Val>
<Val key="attribute">global.solr.params</Val>
+
    <Map key="_mlt.meta">
<Val key="hl" type="boolean">true</Val>
+
      <Val key="start" type="long">0</Val>
<Val key="hl.fl">Content</Val>
+
      <Val key="count" type="long">3</Val>
<Val key="hl.simple.pre">&lt;b&gt;</Val>
+
      <Val key="max_score" type="double">0.8115930557250977</Val>
<Val key="hl.simple.post">&lt;/b&gt;</Val>
+
    </Map>
</Map>
+
    <Seq key='_mlt'>
<!-- other maps with attribute = field name for per-field configuration -->
+
      <Map>
</Seq>
+
        <Val key="_recordid">file:Archytas_von_Tarent_7185.html</Val>
<!-- terms configuration -->
+
        <Val key="_weight" type="double">0.5511907</Val>
<Map key="terms">
+
        <Val key="Size" type="long">47934</Val>
<Val key="terms" type="boolean">true</Val>
+
        ...               
<Val key="terms.fl">Title</Val>
+
      </Map>
<Val key="terms.lower">auto</Val>
+
      <Map>
<Val key="terms.prefix">sug</Val>
+
        <Val key="_recordid">file:Aristoxenos.html</Val>
</Map>
+
        <Val key="_weight" type="double">0.44604447</Val>
<!-- spell check configuration -->
+
        <Val key="Size" type="long">39332</Val>
<Map key="spellcheck">
+
        ...               
<Val key="spellcheck" type="boolean">true</Val>
+
      </Map>
<Val key="spellcheck.count" type="long">5</Val>
+
      ...
<Val key="spellcheck.extendedResults" type="boolean">true</Val>
+
    </Seq>
<Val key="spellcheck.collate" type="boolean">true</Val>
+
    ...
</Map>
+
  </Map>
</Map>
+
  ...
<!-- facet configuration -->
+
</Seq>
<Seq key="groupby">
+
</source>
<!-- a global map to enable facets -->
+
 
<Map>
+
This sample contains the Solr result item with the id <tt>file:Euklid.html</tt>. With MLT turned on, it now contains a nested  <tt>_mlt</tt> Seq which holds the N related docs for that result item each represented by a Map (MLT-Map) (yes, this prevents you from having a solr doc field of the same name and have it returned in this MLT mode). The Val elements in each MLT-Map are defined by the list of fields in the <tt>fl</tt> parameter. But how do the <tt>_recordid</tt> and <tt>_weight</tt> VALs get in there if the value is actually <tt>Id,score,Size</tt>? Well, SMILA defines the fields <tt>Id</tt> and <tt>score</tt> and automatically maps them to <tt>_recordid</tt> and <tt>_weight</tt>.  Any other field that you include thru <tt>fl</tt> is added as a Val element to the MLT result item having the same key as the field name, as is shown for ''Size'' here.
<Val key="facet" type="boolean">true</Val>
+
There is also the <tt>_mlt.meta</tt> Map that contains result info regarding the MLT result, such as number of items, start (offset), and max_score. The keys of these values are the same as for the normal result. 
<Val key="attribute">global.solr.params</Val>
+
 
</Map>
+
===== MLT Results with Handler =====
<!-- per-field configuration for facet.field-->
+
 
<Map>
+
The more common use case of MLT is to actually return the related docs for just one document due to performance considerations.  This is done by making a request against the MLT handler itself.
<Val key="_facet">facet.field</Val>
+
 
<Val key="facet.limit" type="long">10</Val>
+
The document for which you want the related docs is usually known, e.g. from a previous search and your rendered result list contains a link to fetch/show related docs. In this case the query just selects the given document by its Id ( as shown in the example below). But you also may provide any other query here. However, if the query returns >1 docs it will select just one depending on the other MLT parameter and return only the related docs for that document.
<Val key="attribute">Extension</Val>
+
 
</Map>
+
The differences to the query record above are like so:
<!-- per-field configuration for facet.date-->
+
 
<Map>
+
<source lang="xml">
<Val key="_facet">facet.date</Val>
+
<record>
<Val key="facet.date.start">NOW/DAY-5DAYS</Val>
+
 
<Val key="facet.date.gap">+1DAY</Val>
+
  <!-- this is the lucene query to select an document by its Id. Note, the escaping of the ID string! -->
<Val key="facet.date.end">NOW/DAY+1DAY</Val>
+
  <Val key="query">Id:file\:Euklid.html</Val>
<Val key="attribute">LastModifiedDate</Val>
+
  ...
</Map>
+
  <Map key="_solr.query">
<!-- per-field configuration for facet.query-->
+
    <!-- this select the solr MLT request handler. --> 
<Map>
+
    <Val key="qt">/mlt</Val>
<Val key="_facet">facet.query</Val>
+
    ...
<Seq key="_fc">
+
  </Map>
<Val>* TO 1000</Val>
+
<record>
<Val>1000 TO 5000</Val>
+
</source>
<Val>5000 * TO</Val>
+
 
</Seq>
+
The results for such an MLT request are contained in the standard <tt>records</tt> Seq the same way that normal search results are returned, except that they signify MLT docs.
<Val key="attribute">Size</Val>
+
 
</Map>
+
<source lang="xml">
</Seq>
+
<Seq key="records">
</Record>
+
  <Map>
 +
    <Val key="_recordid">file:Archytas_von_Tarent_7185.html</Val>
 +
    <Val key="_weight" type="double">0.5511907</Val>
 +
    <Val key="Size" type="long">47934</Val>
 +
  </Map>
 +
  <Map>
 +
    <Val key="_recordid">file:Aristoxenos.html</Val>
 +
    <Val key="_weight" type="double">0.44604447</Val>
 +
    <Val key="Size" type="long">39332</Val>
 +
  </Map>
 +
  ...
 +
</Seq>
 +
 
 +
</source>
 +
 
 +
{{note| help wanted |due to lack of need, returning the mlt.interestingTerms has not been impl'ed yet}}
 +
 
 +
In case of <tt>mlt.interestingTerms=details</tt> the result record will contain the following additional information:
 +
 
 +
<source lang="xml">
 +
<Map key="_solr.result">
 +
    ...
 +
    <Map key="interestingTerms">
 +
      <Val key="Content:euklid" type="double">1.0</Val>
 +
      <Val key="Content:geometrie" type="double">1.0</Val>
 +
      ...
 +
    </Map>
 +
    ...
 +
  </Seq>
 +
</Map>
 +
</source>
 +
 
 +
or in case of <tt>mlt.interestingTerms=list</tt> just:
 +
 
 +
<source lang="xml">
 +
<Map key="_solr.result">
 +
    ...
 +
    <Seq key="interestingTerms">
 +
     
 +
      <Val>euklid</Val>
 +
      <Val>geometrie</Val>
 +
      ...
 +
    </Map>
 +
    ...
 +
  </Seq>
 +
</Map>
 
</source>
 
</source>

Latest revision as of 08:08, 1 December 2014

Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.

Note.png
The current implementation is a Work In Progress. The goal is to use this implementation as the default search implementation for SMILA replacing the current embedded Lucene integration. As a consequence things are likely to change in future versions. So stay tuned.


SolrServerManager & SolrProperties

Solr can run as stand alone remote server as well as embedded server within SMILA. There exist a properties file to control the running mode: configuration/org.eclipse.smila.solr/solr.properties

##### If true SMILA load default configuration for an embedded Solr instance (see below) #####
solr.embedded=true
 
##### Alternative workspace folder equals solr.home (embedded only) #####
solr.workspaceFolder=./workspace/.metadata/.plugins/org.eclipse.smila.solr
 
##### Server url for http connections to Solr server (remote only) #####
solr.serverUrl=http://localhost:8983/solr

Configuration

SMILA supports Solr only in multicore setup ("core" is the solr word for a search index), regardless whether Solr runs embedded or remote.

DefaultCore

The default configuration included in SMILA is defined in configuration/org.eclipse.smila.solr. The default mode is 'embedded' in which case SMILA starts up its own internal solr server. The full solr multicore configuration which is present in the configuration folder is used when the mode is set to embedded. This setup defines the sole DefaultCore holding that is suitable for the HowTo cases in SMILA.

If SMILA should connect to an already running Solr server instead of starting up an own instance, the property solr.embedded must be set to false. In that case the URL to connect to the (external) Solr server URL has to be provided by setting the property solr.serverUrl in the properties file.

Please note that you have to add the PingRequestHandler in each cores solrconfig.xml file, see section solrconfig.xml

More information about solr cores and their configuration can be found at: http://wiki.apache.org/solr/CoreAdmin

If SMILA starts up for the first time and Solr is configured embedded, the configuration is copied to Solr workspace (solr.home).

schema.xml

One of the most import configuration files is configuration/org.eclipse.smila.solr/DefaultCore/conf/schema.xml. This file defines index fields and types. SMILA comes with the following set of predefined fields:

<field name="Id" type="string_id" indexed="true" stored="true" required="true" />
<field name="LastModifiedDate" type="date" indexed="true" stored="true" />
<field name="Filename" type="text_path" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Path" type="text_path" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Extension" type="textgen" indexed="true" stored="true" />
<field name="Size" type="long" indexed="true" stored="true" />
<field name="MimeType" type="textgen" indexed="true" stored="true" />
<field name="Content" type="textgen" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Title" type="textgen" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true" />

The schema.xml also contains the uniqueKey property which Solr needs to know what field is used to id the documents and transparently handles add/updated accordingly. By default it is set to Id.

Information about other configuration possibilities like field types, default search field, copy fields and many more can be found here: http://wiki.apache.org/solr/SchemaXml

solrconfig.xml

Another major configuration file is configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.

Please refer to its documentation here: http://wiki.apache.org/solr/SolrConfigXml

Important for SMILA is that in the embedded case the dataDir property defaults to the data/ sub folder of the core instance (e.i. solr.home/DefaultCore/data/. Hence, in embedded mode the SMILA workspace may grow quite large. Use this property in this file or set it through solr.xml at the core to provide an alternative location.

SMILA uses autoCommit via solr.DirectUpdateHandler2. It tells Solr to commit automatically every 60 seconds or after 1000 documents were added. If this property is not set, no commit will occur and the indexed data will not be persistent or search-able unless you send appropriate solr commands yourself. The values are a compromise where these factors play a role:

  • how soon shall/must a user that searches see the updates?
  • how many update request are sent to solr?

Note, that during commit the solr server stalls updates which might lead to index pipelet timeouts.

Note that when using an external Solr server, you have to add the PingRequestHandler since this handler is required by the SolrAdminHttpHandler to check if the cores exist and are alive before adressing them. You have to add the handler to each core's configuration file:

<requestHandler name="/admin/ping" class="PingRequestHandler">
  <lst name="defaults">
    <str name="qt">standard</str>
    <str name="q">solrpingquery</str>
    <str name="echoParams">all</str>
  </lst>
</requestHandler>

Setup another core

If you don't want to use the default solr index (DefaultCore), you can easily setup your own core. Just copy the DefaultCore configuration folder (see SMILA/configuration/org.eclipse.smila.solr) with another name, e.g. MyCore, in the same directory and adapt the configuration files described before to your needs.

Afterwards add your new core to the file SMILA.application/configuration/org.eclipse.smila.solr/solr.xml:

<?xml version='1.0' encoding='UTF-8'?>
 <solr persistent="true">
  <cores adminPath="/admin/cores">
   <core name="DefaultCore" instanceDir="DefaultCore"/>
   <core name="MyCore" instanceDir="MyCore"/>
  </cores>
 </solr>

How to use Solr with SMILA

Indexing data

The SolrIndexPipelet can add, update or delete records (equates to Solr documents) in an index.

Configuration in addpipeline:

	<extensionActivity>
      <proc:invokePipelet name="SolrIndexPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.index.SolrIndexPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <!-- either ADD or DELETE. -->
          <rec:Val key="ExecutionMode">ADD</rec:Val>
          <!-- defines the default core into which the record will be written. optional, but if missing then the target core 
            must be set in the record via SolrConstants.DYNAMIC_TARGET_CORE -->
          <rec:Val key="CoreName">DefaultCore</rec:Val>
          <!-- seq of fields that are to be filled. each tuple is a map that defines the target core field, the source field 
            (optional) and the source type (optional ) -->
          <rec:Seq key="CoreFields">
            <rec:Map>
              <!-- target field name in the solr core -->
              <rec:Val key="FieldName">Folder</rec:Val>
              <!-- name of the source attribute or attachment in the record. optional, defaults to the target field name -->
              <rec:Val key="RecSourceName">Path</rec:Val>
              <!-- either ATTRIBUTE or ATTACHMENT. optional, defaults to ATTIRBUTE. -->
              <rec:Val key="RecSourceType">ATTRIBUTE</rec:Val>
            </rec:Map>
            <rec:Map>
              <rec:Val key="FieldName">Filename</rec:Val>
            </rec:Map>
            ...
          </rec:Seq>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Configuration in deletepipeline:

	<extensionActivity>
      <proc:invokePipelet name="SolrIndexPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.index.SolrIndexPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <rec:Val key="ExecutionMode">DELETE</rec:Val>
          <rec:Val key="CoreName">DefaultCore</rec:Val>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Search

The SMILA standard search servlet already uses solr to search via the SolrSearchPipelet since SMILA version 1.0. Up to version 0.9 the SMILA standard search servlet used plain lucene search.

Search Pipelet Config

The SolrSearchPipelet offers the possibility to search a Solr index. The pipelet needs only a small configuration without any special parameters.

    <extensionActivity>
      <proc:invokePipelet name="invokeSolrSearchPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.search.SolrSearchPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Solr Specific Search Record

For full feature support an enhanced search record is required. This sections will provide both, XML samples on how the features are configured in the search record as well as description on helper classes that are available from within SMILA. Path notations for the elements in the record just have their key names of the respective elements as the path element and always start from the root; e.g. _solr.params/highlighting.

To understand the following section you must know the standard SMILA search record

Standard Parameters

The following SMILA standard query parameters are supported:

  • maxcount
  • offset
  • indexname, this must correspond to an existing solr core name
  • resultAttributes
  • query

The solr pipelet suports only a sole query element as a string value, which it passes unaltered to solr. The solr default handler assumes this to be a valid Lucene query string, but ultimately this depends on the configured handler. All escaping needs to be done by the one constructing the search record (Note: There is no need to URL encode it, as this is done internally).

<Record xmlns="http://www.eclipse.org/SMILA/record" version="2.0">
  <!-- query (q) -->
  <Val key="query">Content:solr Content:eclipse</Val>
  <Val key="maxcount" type="long">3</Val>
  <Val key="offset" type="long">3</Val>  
  <Val key="indexname">wikipedia</Val>
  <Seq key="resultAttributes">
    <Val>Content</Val>
    <Val>Id</Val>
  </Seq>
    ...
<Record>

The above sample shows a query on the index field Content for the string "solr eclipse".

Highlighting

Highlighting for Solr deviates from the standard SMILA way to support solr features. The configuration is contained in _solr.query/highlighting

<Map key="_solr.query">
  ...
  <Seq key="highlighting">
    <Map>
      <Val key="attribute">global.solr.params</Val>
      <Val key="hl" type="boolean">true</Val>
      <!-- list of fields to be highlighted, space delimited -->
      <Val key="hl.fl">Content  Title</Val>
      <Val key="hl.simple.pre">&lt;b&gt;</Val>
      <Val key="hl.simple.post">&lt;/b&gt;</Val>
    </Map>
    <!-- other maps with attribute = field name for per-field configuration -->
  </Seq>
  ...
</Map>

The configuration can be done globally (applies to all HL fields) as well as per field and are contained in maps that must have an entry attribute that either contains the value golabl.solr.params which then signifies the the global highlight settings or the name of the attribute/filed that is to be highlight-configured. The other entries in this map correspond in name and values to the ones solr supports. See http://wiki.apache.org/solr/HighlightingParameters.

In order to turn on highlighting, at least the global config must be present with the entry hl=true.

Programmatic highlighting configuration is done though HihglightingQueryConfigAdapter. The default constructor creates a configuration object with global highlighting parameters which is required to enable highlighting. The other constructor provides an optional per-field configuration.

   // create global highlighting configuration (required, enables highlighting)
    final HighlightingQueryConfigAdapter highlighting = new HighlightingQueryConfigAdapter();
    highlighting.setHighlightingFields("Content Title");
    highlighting.setHighlightingSimplePre("<b>");
    highlighting.setHighlightingSimplePost("</b>");
    builder.addHighlightingConfiguration(highlighting);

Other than in SMILA, the _highlight annotation is not created per result item but replaces the normally returned field value, i.e. when you have the Content field to be returned in your search and you also configured highlighting on it, then the search returns only the highlighted value for the Content field.

Facets

Facets are specified for solr through the /facetby Seq as defined in the standard. However, the following differences exist:

  • maxcount is optional
  • solr doesn't support ordering of facets, so if this is set, then there is a warning in the log but otherwise ignored.

Faceting is turned on as soon as the facetby Seq is present.

Note, that the attibute value must be the solr field-name as the mapping from the solrSearchPipelet is not applied.

The values in the nativeParametes Map are passed to solr for the field verbatim after the pattern f.${attribute}.${key}=${value}. This allows you to just specify any valid solr parameter/value pair on field level without any interaction on our part. Global facet parameters may be defined in the _solr.query map.

Solr supports different kinds of faceting and this can be selected with the type parameter. It's value is solr's respective parameter name and is passed as given. No checks are performed here as to allow future methods OOB. However, it defaults to facet.field if missing. Solr's facet.query is not supported thru this structure ATM as it needs to be formulated quite differently and hence must be formulated as global parameters in the _solr.query map. Nontheless, the facets are retuned the normal way.

<Seq key="facetby">
  <!-- per-field configuration for facet.field -->
  <Map>
    <Val key="type">facet.field</Val>
    <Val key="maxcount" type="long">10</Val>
    <Val key="attribute">Extension</Val>
  </Map>
  <!-- per-field configuration for facet.date -->
  <Map>
    <Val key="type">facet.date</Val>
    <Val key="attribute">LastModifiedDate</Val>
    <Map key="nativeParameters">
      <Val key="facet.date.start">NOW/DAY-5DAYS</Val>
      <Val key="facet.date.gap">+1DAY</Val>
      <Val key="facet.date.end">NOW/DAY+1DAY</Val>
    </Map>
  </Map>
</Seq>

Facets are returned the SMILA standard way in the facets map.

Solr Specific Parameters (_solr.query)

Some configuration deviations from the SMILA standard and other solr specialties are put into a Solr specific _solr.query Map element at top level of the search record.

The following are supported:

  • filters
  • shards
  • request handler

Filters

Solr filters may only be specified directly, i.e. as native query strings via the fq element. Multiple ones will be automatically ANDed. Note, that the QueryBuilder's methods to add filters and the /filter Seq are not supported (yet).

Shards

Shards are only supported in remote mode and may be defined through the _solr.query/shards Seq.

Solr Request Handler

To select another solr request handler add the _solr.query/qt entry.

The following XML snippet shall illustrate these cases:

<Record xmlns="http://www.eclipse.org/SMILA/record" version="2.0">
  ...
  <Map key="_solr.query">
    <!-- filter query (fq) -->
    <Seq key="fq">
      <Val>Size:[500 TO 1000]</Val>
      <Val>Author:"H. Simpson"</Val>
    </Seq>
 
    <!-- shards -->
    <Seq key="shards">
      <Val>http://localhost:8983/solr</Val>
      <Val>http://remote-server:8983/solr</Val>
    </Seq>
 
    <!-- request handler (qt) -->
    <Val key="qt">/custom</Val>
 
</Record>

SolrQueryBuilder

Instead of assembling the XML/Record yourself you can use the SolrQueryBuilder from within SMILA. This class extends native QueryBuilder with methods to configure a Solr request and special Solr features like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters.

   // create Solr specific query builder
    final SolrQueryBuilder builder = new SolrQueryBuilder();
 
    // set query
    builder.setQuery("query");
 
    // set start (equals offset, default: 0)
    builder.setStart(10);
 
    // set rows (equals max count, default: 10)
    builder.setRows(5);
 
    // set fields (equals result attributes, default: Id, score)
    final String[] fl = { "Path", "Size", "Content" };
    builder.addFields(fl);
 
    // add a filter query (example: size between 500 and 1000)
    builder.addFilterQuery("Size:[500 TO 1000]");
 
    // set shards
    final String[] shards = { "http://localhost:8983/solr", "http://remote-server:8983/solr" };
    builder.setShards(shards);
 
    // set request handler
    builder.setRequestHandler("/terms");

Auxillary Search Functions

Auto-suggest/Terms

Auto suggest/completion is also done via a search request, albeit a very special, stripped down version, which looks like so in the default setup:

<Record >
  <Map key="_solr.query">
    <Map key="terms">
      <Val key="terms" type="boolean">true</Val>
      <Val key="terms.fl">Content</Val>
      <Val key="terms.prefix">con</Val>
    </Map>
    <Val key="qt">/terms</Val>
  </Map>
</Record>

The only items present have to be the terms map and qt entry that needs to be set to an appropriate handler (by default this is /terms). The entries in the terms map are passed as is to solr. For more information about terms configuration and parameters see http://wiki.apache.org/solr/TermsComponent.

The results are returned in the _solr.result/terms map with the key as the actual completed word and its value tells you how many documents in the index contain this word.

<Record >
  <Seq key="records"></Seq>
  <Val key="runtime" type="long">3</Val>
  <Map key="_solr.result">
    <Map key="terms">
      <Val key="congratulations" type="long">1</Val>
      <Val key="conjugate" type="long">1</Val>
      <Val key="containing" type="long">1</Val>
    </Map>
  </Map>
</Record>

In SMILA code this can be done like so:

    final TermsQueryConfigAdapter terms = new TermsQueryConfigAdapter(_solrField);
    terms.setTermsPrefix("con");
    _queryBuilder.setTermsConfiguration(terms);
    _queryBuilder.setRequestHandler("/terms");

Spellcheck (Did you mean)

SIMLA's default setup has spell checking (Did you mean) for the Content field enabled. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml) and this has been done. Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent uses a separate index which is created on the fly and updated on every commit. Therefore, to retrieve alternative suggestions for possibly misspelled input words, you just need to add the spellcheck map to _solr.query:

  <Map key="_solr.query">
     ....
    <Map key="spellcheck">
      <Val key="spellcheck" type="boolean">true</Val>
      <Val key="spellcheck.count" type="long">5</Val>
      <Val key="spellcheck.extendedResults" type="boolean">true</Val>
      <Val key="spellcheck.collate" type="boolean">true</Val>
    </Map>
  </Map>

The map contains solr parameters (see http://wiki.apache.org/solr/SpellCheckComponent) that are passed "as is" to solr.

This will add the spellcheck map to _solr.result:

  <Map key="_solr.result">
    ...
    <Map key="spellcheck">
      <Map key="rust">
        <Val key="just" type="long">1</Val>
        <Val key="bust" type="long">1</Val>
      </Map>
      <Val key="collation">Content:just</Val>
    </Map>
    ...
  </Map>

For each misspelled word there is a nested map containing the corrections, where the key is the corrected term and the value is the frequency of the term in the index. The value for the frequency must be turned on via spellcheck.extendedResults and defaults to -1 otherwise.

When collate is on then you can also find a full alternative query under the key collation.

The code for the above XML snippets has been generated with the following code:

    addSolrDoc("1",
      "This is a simple text without real meaning as i dont want to bust my behind for smth. with more sense.");
    addSolrDoc("2", "It is just used for testing.");
    indexAndCommit();
 
    // setup search
    final SpellCheckQueryConfigAdapter spellcheck = new SpellCheckQueryConfigAdapter();
    spellcheck.setSpellCheckCount(5);
    spellcheck.setSpellCheckExtendedResults(true);
    spellcheck.setSpellCheckCollate(true);
    _queryBuilder.setSpellCheckConfiguration(spellcheck);
    _queryBuilder.setQuery("Content:rust");

More Like This / What's related

Solr offers a feature to return related documents which is called in Solr More Like This (MLT). There are 2 modes supported:

  1. return for all items in the SRL the top N related documents, see [1]
  2. the other does this ad-hoc for just one document for which it uses an own request handler, see [2]

It is obvious that the first variant requires much more performance than the 2nd.

Both modes are supported through SMILA and configured very similar. SMILA doesn't do anything special to the arguments you pass in with the record and hands them on to Solr as-is, except that it performes any necessary URL encoding for you. While you may assign specific data types to the parameters, this is not necessary and all values may be given as strings as this is what is being passed on to Solr anyhow.

Which mode is active ultimatly depends on your handler configuration in solrconfig.xml. However, we will assume here SMILA's default setup which binds the MLT handler to /mlt and a normal query to /select.

Both modes share most of the MLT parameters but also need/support specific ones.

<record>
 
  <!-- this is the lucene query expression that is executed in both cases. -->  
  <Val key="query">euklid</Val>
  ...
  <Map key="_solr.query">
    <!-- this select the solr request handler. set it to /mlt when u want to use the MLT handler  -->  
    <!-- <Val key="qt">/mlt</Val> -->
    <!-- determines the list of fields returned for both the normal results as well as the MLT results  -->  
    <Val key="fl" >Id,score,Size</Val>
    ...
    <Map key="moreLikeThis">
      <Val key="mlt" >true</Val>
      <Val key="mlt.fl" >Content</Val>
      <Val key="mlt.mindf">1</Val>
      <Val key="mlt.mintf">1</Val>
      ...
    </Map>
  </Map>
<record>


MLT Results w/o Handler

In this case solr will add the moreLikeThis section on the same level as the normal response section and you need to manually look up the MLT docs for each given result item. SMILA on the other hand transforms the solr result in that it converts the MLT information as a nested part of SMILA's result item, like so:

<Seq key="records">
  <Map>
    <Val key="_recordid">file:Euklid.html</Val>
    <Val key="_weight" type="double">0.7635468</Val>
    <Map key="_mlt.meta">
      <Val key="start" type="long">0</Val>
      <Val key="count" type="long">3</Val>
      <Val key="max_score" type="double">0.8115930557250977</Val>
    </Map>
    <Seq key='_mlt'>
      <Map>
        <Val key="_recordid">file:Archytas_von_Tarent_7185.html</Val>
        <Val key="_weight" type="double">0.5511907</Val>
        <Val key="Size" type="long">47934</Val>
        ...                
      </Map>
      <Map>
        <Val key="_recordid">file:Aristoxenos.html</Val>
        <Val key="_weight" type="double">0.44604447</Val>
        <Val key="Size" type="long">39332</Val>
        ...                
      </Map>
      ... 
    </Seq>
    ...
  </Map>
   ... 
</Seq>

This sample contains the Solr result item with the id file:Euklid.html. With MLT turned on, it now contains a nested _mlt Seq which holds the N related docs for that result item each represented by a Map (MLT-Map) (yes, this prevents you from having a solr doc field of the same name and have it returned in this MLT mode). The Val elements in each MLT-Map are defined by the list of fields in the fl parameter. But how do the _recordid and _weight VALs get in there if the value is actually Id,score,Size? Well, SMILA defines the fields Id and score and automatically maps them to _recordid and _weight. Any other field that you include thru fl is added as a Val element to the MLT result item having the same key as the field name, as is shown for Size here. There is also the _mlt.meta Map that contains result info regarding the MLT result, such as number of items, start (offset), and max_score. The keys of these values are the same as for the normal result.

MLT Results with Handler

The more common use case of MLT is to actually return the related docs for just one document due to performance considerations. This is done by making a request against the MLT handler itself.

The document for which you want the related docs is usually known, e.g. from a previous search and your rendered result list contains a link to fetch/show related docs. In this case the query just selects the given document by its Id ( as shown in the example below). But you also may provide any other query here. However, if the query returns >1 docs it will select just one depending on the other MLT parameter and return only the related docs for that document.

The differences to the query record above are like so:

<record>
 
  <!-- this is the lucene query to select an document by its Id. Note, the escaping of the ID string! -->  
  <Val key="query">Id:file\:Euklid.html</Val>
  ...
  <Map key="_solr.query">
    <!-- this select the solr MLT request handler. -->  
    <Val key="qt">/mlt</Val>
    ...
  </Map>
<record>

The results for such an MLT request are contained in the standard records Seq the same way that normal search results are returned, except that they signify MLT docs.

<Seq key="records">
  <Map>
    <Val key="_recordid">file:Archytas_von_Tarent_7185.html</Val>
    <Val key="_weight" type="double">0.5511907</Val>
    <Val key="Size" type="long">47934</Val>
  </Map>
  <Map>
    <Val key="_recordid">file:Aristoxenos.html</Val>
    <Val key="_weight" type="double">0.44604447</Val>
    <Val key="Size" type="long">39332</Val>
  </Map>
  ...
</Seq>
Note.png
help wanted
due to lack of need, returning the mlt.interestingTerms has not been impl'ed yet


In case of mlt.interestingTerms=details the result record will contain the following additional information:

<Map key="_solr.result">
    ...
    <Map key="interestingTerms">
      <Val key="Content:euklid" type="double">1.0</Val>
      <Val key="Content:geometrie" type="double">1.0</Val>
      ...
    </Map>
    ...
  </Seq>
</Map>

or in case of mlt.interestingTerms=list just:

<Map key="_solr.result">
    ...
    <Seq key="interestingTerms">
 
      <Val>euklid</Val>
      <Val>geometrie</Val>
      ...
    </Map>
    ...
  </Seq>
</Map>

Back to the top