Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Solr"

m (Solr Specific Search Record: wiki syntax fixes)
Line 1: Line 1:
 
Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.
 
Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.
 +
 +
{{Note|The current implementation is a Work In Progress. The goal is to use this implementation as the default search implementation for SMILA replacing the current embedded Lucene integration. As a consequence things are likely to change in future versions. So stay tuned.}}
  
 
= SolrServerManager & SolrProperties =
 
= SolrServerManager & SolrProperties =

Revision as of 10:33, 14 July 2011

Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.

Note.png
The current implementation is a Work In Progress. The goal is to use this implementation as the default search implementation for SMILA replacing the current embedded Lucene integration. As a consequence things are likely to change in future versions. So stay tuned.


SolrServerManager & SolrProperties

Solr can run as stand alone remote server as well as embedded server within SMILA. There exist a properties file to control the running mode: configuration/org.eclipse.smila.solr/solr.properties

##### If true SMILA load default configuration for an embedded Solr instance (see below) #####
solr.embedded=true
 
##### Alternative workspace folder equals solr.home (embedded only) #####
solr.workspaceFolder=./workspace/.metadata/.plugins/org.eclipse.smila.solr
 
##### Server url for http connections to Solr server (remote only) #####
solr.serverUrl=http://localhost:8983/solr

Default configuration

SMILA supports Solr only in multicore setup (core is the solr word for a search index), regardless whether Solr runs embedded or remote.

The default configuration included in SMILA is defined in configuration/org.eclipse.smila.solr. The default mode is 'remote' in which case no internal solr server is started. However, the same folder contains a full solr multicore configuration that is used when the mode is set to embedded. This setup defines the sole DefaultCore holding that is suitable for the HowTo cases in SMILA.

More information about solr cores and their configuration can be found at: http://wiki.apache.org/solr/CoreAdmin

If SMILA starts up for the first time and Solr is configured embedded, the configuration is copied to Solr workspace (solr.home).

schema.xml

One of the most import configuration files is configuration/org.eclipse.smila.solr/DefaultCore/conf/schema.xml. This file defines index fields and types. SMILA comes with the following set of predefined fields:

<field name="Id" type="string_id" indexed="true" stored="true" required="true" />
<field name="LastModifiedDate" type="date" indexed="true" stored="true" />
<field name="Filename" type="text_path" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Path" type="text_path" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Extension" type="textgen" indexed="true" stored="true" />
<field name="Size" type="long" indexed="true" stored="true" />
<field name="MimeType" type="textgen" indexed="true" stored="true" />
<field name="Content" type="textgen" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Title" type="textgen" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true" />

The schema.xml also contains the uniqueKey property which Solr needs to know what field is used to id the documents and transparently handles add/updated accordingly. By default it is set to Id.

All other configuration possibilities like field types, default search field copy field and many more you can look up here: http://wiki.apache.org/solr/SchemaXml

solrconfig.xml

Another major configuration file is configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.

Please refer to its documentation here: http://wiki.apache.org/solr/SolrConfigXml

Important for SMILA is that in the embedded case the dataDir property defaults to the data/ sub folder of the core instance (e.i. solr.home/DefaultCore/data/. Hence, in embedded mode the SMILA workspace may grow quite large. Use this property in this file or set it through solr.xml at the core to provide an alternative location.

SMILA uses autoCommit via solr.DirectUpdateHandler2. It tells Solr to commit automatically every 60 seconds or after 1000 documents were added. If this property is not set, no commit will occur and the indexed data is not persistent or search-able unless you send appropriate solr commands yourself. The values are a compromise where these factors play a role: - how soon shall/must a user that searches see the updates? - how many update request are sent to solr ? Note, that during commit the solr server server stalls updates which might lead to index pipelet timeouts.

How to use Solr with SMILA

Indexing data

The SolrIndexPipelet can add, update or delete records (equates to Solr documents) in an index.

Configuration in addpipeline:

	<extensionActivity>
      <proc:invokePipelet name="SolrIndexPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.index.SolrIndexPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <!-- either ADD or DELETE. -->
          <rec:Val key="ExecutionMode">ADD</rec:Val>
          <!-- defines the default core into which the record will be written. optional, but if missing then the target core 
            must be set in the record via SolrConstants.DYNAMIC_TARGET_CORE -->
          <rec:Val key="CoreName">DefaultCore</rec:Val>
          <!-- seq of fields that are to be filled. each tuple is a map that defines the target core field, the source field 
            (optional) and the source type (optional ) -->
          <rec:Seq key="CoreFields">
            <rec:Map>
              <!-- target field name in the solr core -->
              <rec:Val key="FieldName">Folder</rec:Val>
              <!-- name of the source attribute or attachment in the record. optional, defaults to the target field name -->
              <rec:Val key="RecSourceName">Path</rec:Val>
              <!-- either ATTRIBUTE or ATTACHMENT. optional, defaults to ATTIRBUTE. -->
              <rec:Val key="RecSourceType">ATTRIBUTE</rec:Val>
            </rec:Map>
            <rec:Map>
              <rec:Val key="FieldName">Filename</rec:Val>
            </rec:Map>
            ...
          </rec:Seq>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Configuration in deletepipeline:

	<extensionActivity>
      <proc:invokePipelet name="SolrIndexPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.index.SolrIndexPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <rec:Val key="ExecutionMode">DELETE</rec:Val>
          <rec:Val key="CoreName">DefaultCore</rec:Val>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Search

At the time of writing the SMILA standard search servlet does not work yet with the solr search. Hence, you either need to write an own servlet and integrate it in SMILA or you use the generic Record Search Servlet. The search pipeline used needs to be configured manually to use the SolrSearchPipelet because it is not the default at the moment.

Search Pipelet Config

The SolrSearchPipelet offers the possibility to search a Solr index. The pipelet needs only a small configuration without any special parameters.

    <extensionActivity>
      <proc:invokePipelet name="invokeSolrSearchPipelet">
        <proc:pipelet class="org.eclipse.SMILA.solr.search.SolrSearchPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Solr Specific Search Record

For full feature support an enhanced search record is required. This sections will provide both, XML samples on how the features are configured in the search record as well as description on helper classes that are available from within SMILA. Path notations for the elements in the record just have their key names of the respective elements as the path element and always start from the root; e.g. _solr.params/highlighting.

To understand the following section you must know the standard SMILA search record

Standard Parameters

The following SMILA standard query parameters are supported:

  • maxcount
  • offset
  • indexname, this must correspond to an existing solr core name
  • resultAttributes
  • query

The solr pipelet suports only a sole query element as a string value, which it passes unaltered to solr. The solr default handler assumes this to be a valid Lucene query string, but ultimately this depends on the configured handler. All escaping needs to be done by the one constructing the search record (Note: There is no need to URL encode it, as this is done internally).

<Record xmlns="http://www.eclipse.org/SMILA/record" version="2.0">
  <!-- query (q) -->
  <Val key="query">Content:solr Content:eclipse</Val>
  <Val key="maxcount" type="long">3</Val>
  <Val key="offset" type="long">3</Val>  
  <Val key="indexname">wikipedia</Val>
  <Seq key="resultAttributes">
    <Val>Content</Val>
    <Val>Id</Val>
  </Seq>
    ...
<Record>

The above sample shows a query on the index field Content for the string "solr eclipse".

Highlighting

Highlighting for Solr deviates from the standard SMILA way to support solr features. The configuration is contained in _solr.query/highlighting

<Map key="_solr.query">
  ...
  <Seq key="highlighting">
    <Map>
      <Val key="attribute">global.solr.params</Val>
      <Val key="hl" type="boolean">true</Val>
      <!-- list of fields to be highlighted, space delimited -->
      <Val key="hl.fl">Content  Title</Val>
      <Val key="hl.simple.pre">&lt;b&gt;</Val>
      <Val key="hl.simple.post">&lt;/b&gt;</Val>
    </Map>
    <!-- other maps with attribute = field name for per-field configuration -->
  </Seq>
  ...
</Map>

The configuration can be done globally (applies to all HL fields) as well as per field and are contained in maps that must have an entry attribute that either contains the value golabl.solr.params which then signifies the the global highlight settings or the name of the attribute/filed that is to be highlight-configured. The other entries in this map correspond in name and values to the ones solr supports. See http://wiki.apache.org/solr/HighlightingParameters.

In order to turn on highlighting, at least the global config must be present with the entry hl=true.

Programmatic highlighting configuration is done though HihglightingQueryConfigAdapter. The default constructor creates a configuration object with global highlighting parameters which is required to enable highlighting. The other constructor provides an optional per-field configuration.

   // create global highlighting configuration (required, enables highlighting)
    final HighlightingQueryConfigAdapter highlighting = new HighlightingQueryConfigAdapter();
    highlighting.setHighlightingFields("Content Title");
    highlighting.setHighlightingSimplePre("<b>");
    highlighting.setHighlightingSimplePost("</b>");
    builder.addHighlightingConfiguration(highlighting);

Other than in SMILA, the _highlight annotation is not created per result item but replaces the normally returned field value, i.e. when you have the Content field to be returned in your search and you also configured highlighting on it, then the search returns only the highlighted value for the Content field.

Facets

Facets are specified for solr through the /groupby seq as defined in the standard. However, the contained elements have been expanded/changed to

  • leverage solr's features, i.e. enable also range facet on date and numeric fields
  • simplify programming, i.e. no mapping between solr and SMILA constants

The latter allows you to just specify any valid solr parameter/value pair on both global and field level without any interaction on our part, i.e. all entries with the exception of attribute and those starting with _ are passed to solr as is.

The global facet parameters are defined in a map that contains attribute=global.solr.params. In order to turn faceting on it also must contain facet=true. Faceting configs on field level are done by adding maps that contain attribute=<field name>.

To tell which kind of faceting is to be performed you need to provide the _facet annotation which may take these values:

  • facet.field
  • facet.date
  • facet.query


<Seq key="groupby">
  <!-- a global map to enable facets -->
  <Map>
    <Val key="facet" type="boolean">true</Val>
    <Val key="attribute">global.solr.params</Val>
  </Map>
  <!-- per-field configuration for facet.field -->
  <Map>
    <Val key="_facet">facet.field</Val>
    <Val key="facet.limit" type="long">10</Val>
    <Val key="attribute">Extension</Val>
  </Map>
  <!-- per-field configuration for facet.date -->
  <Map>
    <Val key="_facet">facet.date</Val>
    <Val key="facet.date.start">NOW/DAY-5DAYS</Val>
    <Val key="facet.date.gap">+1DAY</Val>
    <Val key="facet.date.end">NOW/DAY+1DAY</Val>
    <Val key="attribute">LastModifiedDate</Val>
  </Map>
  <!-- per-field configuration for facet.query -->
  <Map>
    <Val key="_facet">facet.query</Val>
    <Seq key="_fc">
      <Val>* TO 1000</Val>
      <Val>1000 TO 5000</Val>
      <Val>5000 * TO</Val>
    </Seq>
    <Val key="attribute">Size</Val>
  </Map>
</Seq>

Facets are returned the SMILA the standard way in the groups map.

The FacetQueryConfigAdapter provides only one constructor but takes a FacetType parameter instead. At least there is one configuration of FacetType.GLOBAL required (global configuration) to enabled facets. The other types are FacetType.DATE, FacetType.FIELD and FacetType.QUERY which takes an Array of Strings.

    // create global facet configuration (required, enables facets)
    final FacetQueryConfigAdapter facet_global = new FacetQueryConfigAdapter(FacetType.GLOBAL);
    builder.addFacetConfiguration(SolrConstants.GLOBAL, facet_global);
 
    // create field facet configuration
    final FacetQueryConfigAdapter facet_field = new FacetQueryConfigAdapter(FacetType.FIELD);
    facet_field.setFacetLimit(10);
    builder.addFacetConfiguration("Extension", facet_field);
 
    // create facet date configuration
    final FacetQueryConfigAdapter facet_date = new FacetQueryConfigAdapter(FacetType.DATE);
    facet_date.setFacetDateStart("NOW/DAY-5DAYS");
    facet_date.setFacetDateGap("+1DAY");
    facet_date.setFacetDateEnd("NOW/DAY+1DAY");
    builder.addFacetConfiguration("LastModifiedDate", facet_date);
 
    // create facet query configuration (range example)
    final String[] fq = { "* TO 1000", "1000 TO 5000", "5000 * TO" };
    final FacetQueryConfigAdapter facet_query = new FacetQueryConfigAdapter(FacetType.QUERY, fq);
    builder.addFacetConfiguration("Size", facet_query);

More on how different facet types and their parameter work see: http://wiki.apache.org/solr/SimpleFacetParameters

Solr Specific Parameters (_solr.query)

Some configuration deviations from the SMILA standard and other solr specialties are put into a Solr specific _solr.query Map element at top level of the search record.

The following are supported:

  • filters
  • shards
  • request handler

Filters

Solr filters may only be specified directly, i.e. as native query strings via the fq element. Multiple ones will be automatically ANDed. Note, that the QueryBuilder's methods to add filters and the /filter Seq are not supported (yet).

Shards

Shards are only supported in remote mode and may be defined through the _solr.query/shards Seq.

Solr Request Handler

To select another solr request handler add the _solr.query/qt entry.

The following XML snippet shall illustrate these cases:

<Record xmlns="http://www.eclipse.org/SMILA/record" version="2.0">
  ...
  <Map key="_solr.query">
    <!-- filter query (fq) -->
    <Seq key="fq">
      <Val>Size:[500 TO 1000]</Val>
      <Val>Author:"H. Simpson"</Val>
    </Seq>
 
    <!-- shards -->
    <Seq key="shards">
      <Val>http://localhost:8983/solr</Val>
      <Val>http://remote-server:8983/solr</Val>
    </Seq>
 
    <!-- request handler (qt) -->
    <Val key="qt">/custom</Val>
 
</Record>

SolrQueryBuilder

Instead of assembling the XML/Record yourself you can use the SolrQueryBuilder from within SMILA. This class extends native QueryBuilder with methods to configure a Solr request and special Solr features like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters.

   // create Solr specific query builder
    final SolrQueryBuilder builder = new SolrQueryBuilder();
 
    // set query
    builder.setQuery("query");
 
    // set start (equals offset, default: 0)
    builder.setStart(10);
 
    // set rows (equals max count, default: 10)
    builder.setRows(5);
 
    // set fields (equals result attributes, default: Id, score)
    final String[] fl = { "Path", "Size", "Content" };
    builder.addFields(fl);
 
    // add a filter query (example: size between 500 and 1000)
    builder.addFilterQuery("Size:[500 TO 1000]");
 
    // set shards
    final String[] shards = { "http://localhost:8983/solr", "http://remote-server:8983/solr" };
    builder.setShards(shards);
 
    // set request handler
    builder.setRequestHandler("/terms");

Auxillary Search Functions

Auto-suggest/Terms

Auto suggest/completion is also done via a search request, albeit a very special, stripped down version, which looks like so in the default setup:

<Record >
  <Map key="_solr.query">
    <Map key="terms">
      <Val key="terms" type="boolean">true</Val>
      <Val key="terms.fl">attr_test_Simple</Val>
      <Val key="terms.prefix">con</Val>
    </Map>
    <Val key="qt">/terms</Val>
  </Map>
</Record>

The only items present have to be the terms map and qt entry that needs to be set to an appropriate handler (by default this is /terms). The entries in the terms map are passed as is to solr. For more information about terms configuration and parameters see http://wiki.apache.org/solr/TermsComponent.

The results are returned in the _solr.result/terms map with the key as the actual completed word and its value tells you how many documents in the index contain this word.

<Record >
  <Seq key="records"></Seq>
  <Val key="runtime" type="long">3</Val>
  <Map key="_solr.result">
    <Map key="terms">
      <Val key="congratulations" type="long">1</Val>
      <Val key="conjugate" type="long">1</Val>
      <Val key="containing" type="long">1</Val>
    </Map>
  </Map>
</Record>

In SMILA code this can be done like so:

    final TermsQueryConfigAdapter terms = new TermsQueryConfigAdapter(_solrField);
    terms.setTermsPrefix("con");
    _queryBuilder.setTermsConfiguration(terms);
    _queryBuilder.setRequestHandler("/terms");

Spellcheck (Did you mean)

SIMLA's default setup has spell checking (Did you mean) for the Content field enabled. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml) and this has been done. Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent uses a separate index which is created on the fly and updated on every commit. Therefore, to retrieve alternative suggestions for possibly misspelled input words, you just need to add the spellcheck map to _solr.query:

  <Map key="_solr.query">
     ....
    <Map key="spellcheck">
      <Val key="spellcheck" type="boolean">true</Val>
      <Val key="spellcheck.count" type="long">5</Val>
      <Val key="spellcheck.extendedResults" type="boolean">true</Val>
      <Val key="spellcheck.collate" type="boolean">true</Val>
    </Map>
  </Map>

The map contains solr parameters (see http://wiki.apache.org/solr/SpellCheckComponent) that are passed "as is" to solr.

This will add the spellcheck map to _solr.result:

  <Map key="_solr.result">
    ...
    <Map key="spellcheck">
      <Map key="rust">
        <Val key="just" type="long">1</Val>
        <Val key="bust" type="long">1</Val>
      </Map>
      <Val key="collation">Content:just</Val>
    </Map>
    ...
  </Map>

For each misspelled word there is a nested map containing the corrections, where the key is the corrected term and the value is the frequency of the term in the index. The value for the frequency must be turned on via spellcheck.extendedResults and defaults to -1 otherwise.

When collate is on then you can also find a full alternative query under the key collation.

The code for the above XML snippets has been generated with the following code:

    addSolrDoc("1",
      "This is a simple text without real meaning as i dont want to bust my behind for smth. with more sense.");
    addSolrDoc("2", "It is just used for testing.");
    indexAndCommit();
 
    // setup search
    final SpellCheckQueryConfigAdapter spellcheck = new SpellCheckQueryConfigAdapter();
    spellcheck.setSpellCheckCount(5);
    spellcheck.setSpellCheckExtendedResults(true);
    spellcheck.setSpellCheckCollate(true);
    _queryBuilder.setSpellCheckConfiguration(spellcheck);
    _queryBuilder.setQuery("Content:rust");

Back to the top