Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Solr"

m (spelling)
Line 1: Line 1:
Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of build in features like highlighting, facets, auto-suggest and spell checking.
+
Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.
  
 
= SolrServerManager & SolrProperties =
 
= SolrServerManager & SolrProperties =
Line 16: Line 16:
  
 
= Default configuration =
 
= Default configuration =
SMILA support Solr only in multicore configuration, regardless if Solr server runs embedded or remote. If SMILA starts up for the first time and Solr is configured embedded, SolrHelper copy the default configuration to Solr workspace (solr.home). The default configuration include <tt>configuration/org.eclipse.smila.solr/solr.xml</tt> which is required for multicore support. It defines all cores (like an index in Lucene universe) and map them to whose their configurations. Besides there is a folder <tt>configuration/org.eclipse.smila.solr/DefaultCore</tt> which holds a default solr core configuration which support all features implemented in SMILA.
+
 
More information about solr.xml: http://wiki.apache.org/solr/CoreAdmin
+
SMILA supports Solr only in multicore setup (core is the solr word for a search index), regardless whether Solr runs embedded or remote.  
 +
 
 +
The default configuration included in SMILA is defined in <tt>configuration/org.eclipse.smila.solr</tt>. The default mode is 'remote' in which case no internal solr server is started. However, the same folder contains a full solr multicore configuration that is used when the mode is set to {Code:embedded}. This setup defines the sole DefaultCore holding that is suitable for the HowTo cases in SMILA.
 +
 
 +
More information about solr cores and their configuration can be found at: http://wiki.apache.org/solr/CoreAdmin
 +
 
 +
If SMILA starts up for the first time and Solr is configured embedded, the configuration is copied to Solr workspace (solr.home).
  
 
== schema.xml ==
 
== schema.xml ==
Line 35: Line 41:
 
</source>
 
</source>
  
Also in schema.xml the '''uniqueKey''' property could be set. If it is set Solr know on his own when to add or update a document in index. Because in SMILA there is always an Id the Id field should be declared here.
+
The schema.xml also contains the '''uniqueKey''' property which Solr needs to know what field is used to id the documents and transparently handles add/updated accordingly. By default it is set to '''Id'''.
  
 
All other configuration possibilities like field types, default search field copy field and many more you can look up here: http://wiki.apache.org/solr/SchemaXml
 
All other configuration possibilities like field types, default search field copy field and many more you can look up here: http://wiki.apache.org/solr/SchemaXml
  
 
== solrconfig.xml ==
 
== solrconfig.xml ==
An other major configuration file is <tt>configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml</tt>. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.
+
Another major configuration file is <tt>configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml</tt>. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.
 +
 
 +
Please refer to its documentation here: http://wiki.apache.org/solr/SolrConfigXml
  
Very important is the '''dataDir''' property. By default the index data goes to <tt>solr.home/DefaultCore/data/</tt>. So in embedded mode the SMILA workspace could be grow very fast. With the '''dataDir''' an alternate directory can be specified.
+
Important for SMILA is that in the embedded case the '''dataDir''' property defaults to the data/ sub folder of the core instance (e.i. <tt>solr.home/DefaultCore/data/</tt>. Hence, in embedded mode the SMILA workspace may grow quite large. Use this property in this file or set it through solr.xml at the core to provide an alternative location.
  
Furhermore SMILA uses '''autoCommit''' via solr.DirectUpdateHandler2. It tells Solr to commit automatically all 60 seconds or after 1000 documents were added. If this property is not set no commit will occur and the indexed data is not persistent or search-able.
+
SMILA uses '''autoCommit''' via solr.DirectUpdateHandler2. It tells Solr to commit automatically every 60 seconds or after 1000 documents were added. If this property is not set, no commit will occur and the indexed data is not persistent or search-able unless you send appropriate solr commands yourself. The values are a compromise where these factors play a role:
 +
- how soon shall/must a user that searches see the updates?
 +
- how many update request are sent to solr ? Note, that during commit the solr server server stalls updates which might lead to index pipelet timeouts.
  
Complete solrconfig.xml documentation: http://wiki.apache.org/solr/SolrConfigXml
 
  
 
= How to use Solr with SMILA =
 
= How to use Solr with SMILA =
Line 105: Line 114:
 
== Search ==
 
== Search ==
  
The SolrSearchPipelet offers the possibility to search indexed data on a Solr server. The pipelet need only a small configuration without any special parameters.
+
The SolrSearchPipelet offers the possibility to search indexed data on a Solr server. The pipelet needs only a small configuration without any special parameters.
  
 
<source lang="xml">
 
<source lang="xml">
Line 118: Line 127:
 
</source>
 
</source>
  
For full feature support a special kind of SearchRecord is required. The easiest way to create such a record is to use the '''SolrQueryBuilder'''. This class extend native '''QueryBuilder''' with methods to configure a Solr request and special Solr feature like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters. All Solr specific parameters are stored in a map within SearchRecord named '''_solr.query'''.  
+
For full feature support a special kind of SearchRecord is required. The easiest way to create such a record is to use the '''SolrQueryBuilder'''. This class extends native '''QueryBuilder''' with methods to configure a Solr request and special Solr feature like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters. All Solr specific parameters are stored in a map within SearchRecord named '''_solr.query'''.  
  
 
<source lang="java">
 
<source lang="java">
Line 196: Line 205:
 
=== Terms (Auto-suggest) ===
 
=== Terms (Auto-suggest) ===
  
The <tt>TermsQueryConfigAdapeter</tt> comes with a default constructor which enables terms. To use the TermsComponent the RequestHandler must set to the matching configuration (solrconfig.xml, default: /terms).
+
The <tt>TermsQueryConfigAdapeter</tt> comes with a default constructor which enables terms. To use the TermsComponent the RequestHandler must be set according to its configuration (solrconfig.xml, default: /terms).
  
 
<source lang="java">
 
<source lang="java">
Line 210: Line 219:
 
=== Spellcheck (Did you mean) ===
 
=== Spellcheck (Did you mean) ===
  
The default constructor of <tt>SpellCheckQueryConfigAdapter</tt> is required to enable SpellCheckComponent. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml). Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent use a separate index which is create on the fly (in SMILAS's default configuration on every commit).
+
The default constructor of <tt>SpellCheckQueryConfigAdapter</tt> is required to enable SpellCheckComponent. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml). Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent use a separate index which is created on the fly (in SMILAS's default configuration on every commit).
  
 
<source lang="java">
 
<source lang="java">
Line 230: Line 239:
 
<?xml version="1.1" encoding="utf-8"?>
 
<?xml version="1.1" encoding="utf-8"?>
 
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
 
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
<Val key="_recordid">SolrSearchRecordId: 61ff6a7e-d314-4c86-ab05-5f8bdcc50f90
 
</Val>
 
 
<!-- query (q) -->
 
<!-- query (q) -->
 
<Val key="query">query</Val>
 
<Val key="query">query</Val>

Revision as of 03:44, 12 July 2011

Solr is an open source search server based on the Lucene search engine. In addition to a powerful full-text-search, sorting and filtering, Solr comes with a lot of built-in features like highlighting, facets, auto-suggest and spell checking.

SolrServerManager & SolrProperties

Solr can run as stand alone remote server as well as embedded server within SMILA. There exist a properties file to control the running mode: configuration/org.eclipse.smila.solr/solr.properties

##### If true SMILA load default configuration for an embedded Solr instance (see below) #####
solr.embedded=true
 
##### Alternative workspace folder equals solr.home (embedded only) #####
solr.workspaceFolder=./workspace/.metadata/.plugins/org.eclipse.smila.solr
 
##### Server url for http connections to Solr server (remote only) #####
solr.serverUrl=http://localhost:8983/solr

Default configuration

SMILA supports Solr only in multicore setup (core is the solr word for a search index), regardless whether Solr runs embedded or remote.

The default configuration included in SMILA is defined in configuration/org.eclipse.smila.solr. The default mode is 'remote' in which case no internal solr server is started. However, the same folder contains a full solr multicore configuration that is used when the mode is set to {Code:embedded}. This setup defines the sole DefaultCore holding that is suitable for the HowTo cases in SMILA.

More information about solr cores and their configuration can be found at: http://wiki.apache.org/solr/CoreAdmin

If SMILA starts up for the first time and Solr is configured embedded, the configuration is copied to Solr workspace (solr.home).

schema.xml

One of the most import configuration files is configuration/org.eclipse.smila.solr/DefaultCore/conf/schema.xml. This file defines index fields and types. SMILA comes with the following set of predefined fields:

<field name="Id" type="string_id" indexed="true" stored="true" required="true" />
<field name="LastModifiedDate" type="date" indexed="true" stored="true" />
<field name="Filename" type="text_path" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Path" type="text_path" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Extension" type="textgen" indexed="true" stored="true" />
<field name="Size" type="long" indexed="true" stored="true" />
<field name="MimeType" type="textgen" indexed="true" stored="true" />
<field name="Content" type="textgen" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="Title" type="textgen" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true" />

The schema.xml also contains the uniqueKey property which Solr needs to know what field is used to id the documents and transparently handles add/updated accordingly. By default it is set to Id.

All other configuration possibilities like field types, default search field copy field and many more you can look up here: http://wiki.apache.org/solr/SchemaXml

solrconfig.xml

Another major configuration file is configuration/org.eclipse.smila.solr/DefaultCore/conf/solfconfig.xml. This is the configuration for all SearchComponents, RequestHandlers and the general indexing and query configuration.

Please refer to its documentation here: http://wiki.apache.org/solr/SolrConfigXml

Important for SMILA is that in the embedded case the dataDir property defaults to the data/ sub folder of the core instance (e.i. solr.home/DefaultCore/data/. Hence, in embedded mode the SMILA workspace may grow quite large. Use this property in this file or set it through solr.xml at the core to provide an alternative location.

SMILA uses autoCommit via solr.DirectUpdateHandler2. It tells Solr to commit automatically every 60 seconds or after 1000 documents were added. If this property is not set, no commit will occur and the indexed data is not persistent or search-able unless you send appropriate solr commands yourself. The values are a compromise where these factors play a role: - how soon shall/must a user that searches see the updates? - how many update request are sent to solr ? Note, that during commit the solr server server stalls updates which might lead to index pipelet timeouts.


How to use Solr with SMILA

Indexing data

The SolrIndexPipelet can add, update or delete records (equates to Solr documents) in an index.

Configuration in addpipeline:

	<extensionActivity>
      <proc:invokePipelet name="SolrIndexPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.index.SolrIndexPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <!-- either ADD or DELETE. -->
          <rec:Val key="ExecutionMode">ADD</rec:Val>
          <!-- defines the default core into which the record will be written. optional, but if missing then the target core 
            must be set in the record via SolrConstants.DYNAMIC_TARGET_CORE -->
          <rec:Val key="CoreName">DefaultCore</rec:Val>
          <!-- seq of fields that are to be filled. each tuple is a map that defines the target core field, the source field 
            (optional) and the source type (optional ) -->
          <rec:Seq key="CoreFields">
            <rec:Map>
              <!-- target field name in the solr core -->
              <rec:Val key="FieldName">Folder</rec:Val>
              <!-- name of the source attribute or attachment in the record. optional, defaults to the target field name -->
              <rec:Val key="RecSourceName">Path</rec:Val>
              <!-- either ATTRIBUTE or ATTACHMENT. optional, defaults to ATTIRBUTE. -->
              <rec:Val key="RecSourceType">ATTRIBUTE</rec:Val>
            </rec:Map>
            <rec:Map>
              <rec:Val key="FieldName">Filename</rec:Val>
            </rec:Map>
            ...
          </rec:Seq>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Configuration in deletepipeline:

	<extensionActivity>
      <proc:invokePipelet name="SolrIndexPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.index.SolrIndexPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
          <rec:Val key="ExecutionMode">DELETE</rec:Val>
          <rec:Val key="CoreName">DefaultCore</rec:Val>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

Search

The SolrSearchPipelet offers the possibility to search indexed data on a Solr server. The pipelet needs only a small configuration without any special parameters.

    <extensionActivity>
      <proc:invokePipelet name="invokeSolrSearchPipelet">
        <proc:pipelet class="org.eclipse.smila.solr.search.SolrSearchPipelet" />
        <proc:variables input="request" output="request" />
        <proc:configuration>
        </proc:configuration>
      </proc:invokePipelet>
    </extensionActivity>

For full feature support a special kind of SearchRecord is required. The easiest way to create such a record is to use the SolrQueryBuilder. This class extends native QueryBuilder with methods to configure a Solr request and special Solr feature like highlighting or facets. To configure additional Solr features there exist adapter classes which give an overview of possible parameters. All Solr specific parameters are stored in a map within SearchRecord named _solr.query.

	 // create Solr specific query builder
    final SolrQueryBuilder builder = new SolrQueryBuilder();
 
    // set query
    builder.setQuery("query");
 
    // set start (equals offset, default: 0)
    builder.setStart(10);
 
    // set rows (equals max count, default: 10)
    builder.setRows(5);
 
    // set fields (equals result attributes, default: Id, score)
    final String[] fl = { "Path", "Size", "Content" };
    builder.addFields(fl);
 
    // add a filter query (example: size between 500 and 1000)
    builder.addFilterQuery("Size:[500 TO 1000]");
 
    // set shards
    final String[] shards = { "http://localhost:8983/solr", "http://remote-server:8983/solr" };
    builder.setShards(shards);
 
    // set request handler
    builder.setRequestHandler("/terms");

More information about common query parameters: http://wiki.apache.org/solr/CommonQueryParameters

Highlighting

To configure highlighting the HihglightingQueryConfigAdapter is used. The default constructor create a configuration object with global highlighting parameters which is required to enable highlighting. The other constructor provides an optional per-field configuration.

	 // create global highlighting configuration (required, enables highlighting)
    final HighlightingQueryConfigAdapter highlighting = new HighlightingQueryConfigAdapter();
    highlighting.setHighlightingFields("Content");
    highlighting.setHighlightingSimplePre("<b>");
    highlighting.setHighlightingSimplePost("</b>");
    builder.addHighlightingConfiguration(highlighting);

More information about all highlighting parameters: http://wiki.apache.org/solr/HighlightingParameters

Facets

The FacetQueryConfigAdapter provide only one constructor but takes a FacetType parameter instead. At least there is one configuration of FacetType.GLOBAL required (global configuration) to enabled facets. The other types are FacetType.DATE, FacetType.FIELD and FacetType.QUERY which takes an Array of Strings.

    // create global facet configuration (required, enables facets)
    final FacetQueryConfigAdapter facet_global = new FacetQueryConfigAdapter(FacetType.GLOBAL);
    builder.addFacetConfiguration(SolrConstants.GLOBAL, facet_global);
 
    // create field facet configuration
    final FacetQueryConfigAdapter facet_field = new FacetQueryConfigAdapter(FacetType.FIELD);
    facet_field.setFacetLimit(10);
    builder.addFacetConfiguration("Extension", facet_field);
 
    // create facet date configuration
    final FacetQueryConfigAdapter facet_date = new FacetQueryConfigAdapter(FacetType.DATE);
    facet_date.setFacetDateStart("NOW/DAY-5DAYS");
    facet_date.setFacetDateGap("+1DAY");
    facet_date.setFacetDateEnd("NOW/DAY+1DAY");
    builder.addFacetConfiguration("LastModifiedDate", facet_date);
 
    // create facet query configuration (range example)
    final String[] fq = { "* TO 1000", "1000 TO 5000", "5000 * TO" };
    final FacetQueryConfigAdapter facet_query = new FacetQueryConfigAdapter(FacetType.QUERY, fq);
    builder.addFacetConfiguration("Size", facet_query);

More on how different facet types and their parameter work: http://wiki.apache.org/solr/SimpleFacetParameters

Terms (Auto-suggest)

The TermsQueryConfigAdapeter comes with a default constructor which enables terms. To use the TermsComponent the RequestHandler must be set according to its configuration (solrconfig.xml, default: /terms).

	 // create terms configuration (auto suggest example)
    final TermsQueryConfigAdapter terms = new TermsQueryConfigAdapter("Title");
    terms.setTermsLower("auto");
    terms.setTermsPrefix("sug");
    builder.setTermsConfiguration(terms);

More information about terms configuration and parameters: http://wiki.apache.org/solr/TermsComponent

Spellcheck (Did you mean)

The default constructor of SpellCheckQueryConfigAdapter is required to enable SpellCheckComponent. In most cases it's useful to configure the default request handler to use SpellCheckComponent (solrconfig.xml). Otherwise the correct request handler must be set (solrconfig.xml example: /spell). By default SpellCheckComponent use a separate index which is created on the fly (in SMILAS's default configuration on every commit).

	 // create spell check configuration
    final SpellCheckQueryConfigAdapter spellcheck = new SpellCheckQueryConfigAdapter();
    spellcheck.setSpellCheckCount(5);
    spellcheck.setSpellCheckExtendedResults(true);
    spellcheck.setSpellCheckCollate(true);
    builder.setSpellCheckConfiguration(spellcheck);

More information about SpellCheckComponent configuration and parameters: http://wiki.apache.org/solr/SpellCheckComponent

SolrSearchRecord XML

With the aid of the SolrQueryBuilder and the adapters above the following Record structure will be created.

<?xml version="1.1" encoding="utf-8"?>
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0">
	<!-- query (q) -->
	<Val key="query">query</Val>
	<!-- start -->
	<Val key="offset" type="long">10</Val>
	<!-- rows -->
	<Val key="maxcount" type="long">5</Val>
	<!-- result field (fl) -->
	<Seq key="resultAttributes">
		<Val>Path</Val>
		<Val>Size</Val>
		<Val>Content</Val>
	</Seq>
	<Map key="_solr.query">
		<!-- filter query (fq) -->
		<Seq key="fq">
			<Val>Size:[500 TO 1000]</Val>
		</Seq>
		<!-- shards -->
		<Seq key="shards">
			<Val>http://localhost:8983/solr</Val>
			<Val>http://remote-server:8983/solr</Val>
		</Seq>
		<!-- request handler (qt) -->
		<Val key="qt">/terms</Val>
		<!-- highlighting configuration -->
		<Seq key="highlighting">
			<!-- a global map to enable highlighting -->
			<Map>
				<Val key="attribute">global.solr.params</Val>
				<Val key="hl" type="boolean">true</Val>
				<Val key="hl.fl">Content</Val>
				<Val key="hl.simple.pre">&lt;b&gt;</Val>
				<Val key="hl.simple.post">&lt;/b&gt;</Val>
			</Map>
			<!-- other maps with attribute = field name for per-field configuration -->
		</Seq>
		<!-- terms configuration -->
		<Map key="terms">
			<Val key="terms" type="boolean">true</Val>
			<Val key="terms.fl">Title</Val>
			<Val key="terms.lower">auto</Val>
			<Val key="terms.prefix">sug</Val>
		</Map>
		<!-- spell check configuration -->
		<Map key="spellcheck">
			<Val key="spellcheck" type="boolean">true</Val>
			<Val key="spellcheck.count" type="long">5</Val>
			<Val key="spellcheck.extendedResults" type="boolean">true</Val>
			<Val key="spellcheck.collate" type="boolean">true</Val>
		</Map>
	</Map>
	<!-- facet configuration -->
	<Seq key="groupby">
		<!-- a global map to enable facets -->
		<Map>
			<Val key="facet" type="boolean">true</Val>
			<Val key="attribute">global.solr.params</Val>
		</Map>
		<!-- per-field configuration for facet.field-->
		<Map>
			<Val key="_facet">facet.field</Val>
			<Val key="facet.limit" type="long">10</Val>
			<Val key="attribute">Extension</Val>
		</Map>
		<!-- per-field configuration for facet.date-->
		<Map>
			<Val key="_facet">facet.date</Val>
			<Val key="facet.date.start">NOW/DAY-5DAYS</Val>
			<Val key="facet.date.gap">+1DAY</Val>
			<Val key="facet.date.end">NOW/DAY+1DAY</Val>
			<Val key="attribute">LastModifiedDate</Val>
		</Map>
		<!-- per-field configuration for facet.query-->
		<Map>
			<Val key="_facet">facet.query</Val>
			<Seq key="_fc">
				<Val>* TO 1000</Val>
				<Val>1000 TO 5000</Val>
				<Val>5000 * TO</Val>
			</Seq>
			<Val key="attribute">Size</Val>
		</Map>
	</Seq>
</Record>

Back to the top