Jump to: navigation, search

SMILA/Specifications/LuceneIntegration

Description

This page is about the integration of Lucene as a sample indexing/search engine in Smila.

Discussion

Status Quo

At the moment we have two ProcessingServices for indexing and searching records in Lucene:

  • LuceneIndexService
  • LuceneSearchService

Both services support multiple indexes, selectable via annotations. As an integration layer between these services and the Lucene api the brox anyfinder classes are used for now. The configuration of the services and the Lucene index and some search properties is a mixture of a service specific record to index field mapping file (mappings.xml) and anyfinders own DataDictionary (DataDictionary.xml).


Technical proposal

One of the goals of Smila was to create the framework from scratch without any legacy code. Therefore wo should refactor the anyfinder Lucene integration to contain only the classes that are needed. Below are some thoughts about issues with the current implementation and what to reuse/refactor:

Features

The following features should be supported by the integration:

  • configuration of index fields (analyzers, indexation, tokenization)
  • simple search (query over a dedicated text field)
  • advanced search (query over multiple fields and filter support)
  • simple highlighting (return a formated html text)
  • advanced highlighting (return highlight positions and weights)


Lucene specific vs. generic

Anyfinder abstracts from concrete search engines, offering a generic api for search engine integration. Smila offers the same, using the BPEL Pipelet/ProcessingService approach. Therefore most abstract classes or interfaces of anyfinder can be removed or merged with concrete Lucene implementations. This will minimize the number of classes.


Configuration

The configuration files mappings.xml and DataDictionary.xml should be merged into one xml configuration. The configuration for result and highlighting attributes should be a default configuration which is used if the search process does not explicitly requests other results. As it is not relevant for the LuceneIndexService it could be moved in a separate config file. The defined mapping of record attributes/attachments to index fields should be reused by the LuceneSearchService (by having a reference to the LuceneIndexService and providing methods to acess the mapping information in both directions).

Here are my ideas for a index and search configuration, reusing anyfinder concepts:

<LuceneIndexConfig>
    <Index Name="test_index" MaxConnections="5">
        <IndexStructure>
            <Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <Attribute name="Title">
                <IndexField Name="Title" IndexValue="true" StoreText="true" Tokenize="true" Type="Text"/>            
            <Attribute/>
            <Attribute name="Url">
                <IndexField Name="Url" IndexValue="true" StoreText="true" Tokenize="false" Type="Text">
                    <Analyzer ClassName="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
                </IndexField>
            <Attribute/>
            <Attribute name="LastModifiedDate">    
                <IndexField Name="LastModifiedDate" IndexValue="true"  StoreText="true" Tokenize="false" Type="Text"/>
            <Attribute/>
            ...
            <Attachment path="Content">
                <IndexField  Name="Content" IndexValue="true" StoreText="true" Tokenize="true" Type="Text"/>
            <Attachment/>
            ...
        </IndexStructure>
    </Index>
    <Index Name="another_index" MaxConnections="5">
        ...
    </Index>
</LuceneIndexConfig>
<LuceneSearchConfig>
    <Index Name="test_index">
        <Result>            
            <Attribute name="MimeType"/>
            <Attribute name="LastModifiedDate" />
            <Attribute name="Url" />
            ...
            <Attachment name="Content">
                <HighlightingTransformer Name="urn:Sentence">
                    <ParameterSet>
                        <Parameter Name="MaxLength" xsi:type="Integer">
                            <Value>300</Value>
                        </Parameter>
                        <Parameter Name="MaxHLElements" xsi:type="Integer">
                            <Value>999</Value>
                        </Parameter>
                        <Parameter Name="MaxSucceedingCharacters" xsi:type="Integer">
                            <Value>30</Value>
                        </Parameter>
                        <Parameter Name="SucceedingCharacters" xsi:type="String">
                            <Value>...</Value>
                        </Parameter>
                        <Parameter Name="SortAlgorithm" xsi:type="String">
                            <Value>Occurrence</Value>
                        </Parameter>
                        <Parameter Name="TextHandling" xsi:type="String">
                            <Value>ReturnSnipplet</Value>
                        </Parameter>
                    </ParameterSet>
                </HighlightingTransformer>
            </Attachment>
            ...
        </Result>
    </Index>
</LuceneSearchConfig>


In addition, a Lucene index needs two special IndexFields that are not configurable but fixed:

  • ID: this is a hashed version of the record Id. It is used to identify the record in the index
  • XMLID: this contains the xml representation of the record Id. It is only stored in the index and part of every result, as it is used to create Id objects from it


Bundles, Packages, Extension Points

All classes neeeded for Lucene integration should be in the bundle org.eclipse.smila.lucene or in bundles extending this package structure. org.eclipse.smila.search should be reserved for Smila Search API and more generic stuff to come (perhaps the highlighting transformer could fit in there).

There are some packages and lots of classes I don't know what they are used for:

  • org.eclipse.smila.transformation (except the Highlighting* classes)
  • org.eclipse.smila.transformation.transformer
  • org.eclipse.smila.search.datadictionary - should most of these classes be generated by Jaxb ?
  • org.eclipse.smila.search.feature
  • org.eclipse.smila.search.irm
  • org.eclipse.smila.search.tools - why are such common classes like exception in here ?
  • org.eclipse.smila.search.tools.indexstructur (seems to be obsolete if merged with org.eclipse.smila.lucene)
  • what are all thos D-classes for. Why are the duplicate class names in different packages ? Semms to be some wrapper classes where in turn Lucene classes could be used.
  • what are all those template classes about ? I guess we don't need them anymore.

Also anyfinder bundles make use of extension points. What is it used for ? I don't think that it is needed for a concrete Lucene integration.