Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Orion/Search Plan

< Orion
Revision as of 13:41, 27 August 2014 by John arthorne.ca.ibm.com (Talk | contribs) (Crawler)

This page is a design document for the search component in the Orion project.

Current State

Orion currently has two search implementations, each with distinct advantages and drawbacks. They are used for different sets of end user search capabilities exposed in the user interface.

Crawler

The first implementation is a client-side search implementation written in JavaScript. It does not use an index, but traverses the entire document space on each search.

Advantages

  • Highly accurate, will match arbitrary character sequences in search expression precisely in the document
  • Supports case-sensitive vs insensitive search
  • Supports regular expression search.
  • Agnostic of server implementation. This means it works equally on Java and JavaScript server implementations

Disadvantages

  • Extremely slow, both because the content to be search has to be transferred across the network to the client, and because it exhaustively examines every document on every search. For example each search on a modest developer workspace can take minutes to complete.
  • No pagination. Does not scale well when there are thousands of matches

Indexed Search

The second search implementation runs on the Java server using Apache Lucene/Solr. It is index based, making it very fast even over large document spaces. It supports both case-sensitive and insensitive search on filenames, but only case-insensitive search of file content. Lucene can support both, but doing so doubles the size of the search index. Currently the indexer runs periodically and has no smarts about what files have changed. This means there can be a significant lag between changes to files, and when the indexer is updated (several minutes, or even an hour on a very large workspace such as OrionHub with 25,000 users.

A more subtle problem is that the indexed search by default will only match searches on whole words or word prefixes. For example, if the document contains the word "oranges", then searching for "orange" will match, but searching for "ranges" will not match. This is intuitive and expected behavior when performing natural language searches, but does not meet expectations of a developer performing a code search. For example a developer might be looking for functions "getOranges" and "setOranges" with a search for the term "etOranges". Lucene also supports suffix queries, but they are much more expensive because they must traverse the entire index. The Orion indexed search will find both "getOranges" and "setOranges" if the user searches for "*etOranges", but adding the wildcard by default makes all searches slow.

The indexed search also cannot currently combine phrase and partial word search. For example when a document contains "hello world", searching for "hello wor" will not find it. There is an indexing performance trade-off to make this kind of search work.

The indexed search uses whitespace and punctuation as word delimiters, and therefore does not support search phrases containing a combination of text and punctuation. This is again expected when searching natural language, but not for a developer searching over code, where punctuation and whitespace are important.

Back to the top