Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "Orion/Search Plan"

(Created page with "This page is a design document for the search component in the Orion project. = Current State = Orion currently has two search implementations, each with distinct advant...")
 
(Indexed Search)
Line 11: Line 11:
 
== Indexed Search ==
 
== Indexed Search ==
  
The second search implementation runs on the Java server using Apache Lucene/Solr. It is index based, making it very fast even over large document spaces. It supports both case-sensitive search on filenames, but only case-insensitive search of file content. Lucene can support both, but doing so doubles the size of the search index. Currently the indexer runs periodically and has no smarts about what files have changed. This means there can be a significant lag between changes to files, and when the indexer is updated (several minutes, or even an hour on a very large workspace such as OrionHub with 25,000 users.
+
The second search implementation runs on the Java server using Apache Lucene/Solr. It is index based, making it very fast even over large document spaces. It supports both case-sensitive search on filenames, but only case-insensitive search of file content. Lucene can support both, but doing so doubles the size of the search index. Currently the indexer runs periodically and has no smarts about what files have changed. This means there can be a significant lag between changes to files, and when the indexer is updated (several minutes, or even an hour on a very large workspace such as OrionHub with 25,000 users.
 +
 
 +
A more subtle problem is that the indexed search by default will only match searches on whole words or word prefixes. For example, if the document contains the word "oranges", then searching for "orange" will match, but searching for "ranges" will not match. This is intuitive and expected behavior when performing natural language searches, but does not meet expectations of a developer performing a code search. For example a developer might be looking for functions "getOranges" and "setOranges" with a search for the term "etOranges". Lucene also supports suffix queries, but they are much more expensive because they must traverse the entire index. The Orion indexed search will find both "getOranges" and "setOranges" if the user searches for "*etOranges", but adding the wildcard by default makes all searches slow.

Revision as of 11:53, 27 August 2014

This page is a design document for the search component in the Orion project.

Current State

Orion currently has two search implementations, each with distinct advantages and drawbacks. They are used for different sets of end user search capabilities exposed in the user interface.

Crawler

The first implementation is a client-side search implementation written in JavaScript. It does not use an index, but traverses the entire document space on each search. It is highly accurate, and supports case-sensitive vs insensitive search, and regular expression search. Since it is client side, it is agnostic of server implementation. This means it works equally on Java and JavaScript server implementations. On the downside it is extremely slow, both because the content to be search has to be transferred across the network to the client, and because it exhaustively examines every document on every search.

Indexed Search

The second search implementation runs on the Java server using Apache Lucene/Solr. It is index based, making it very fast even over large document spaces. It supports both case-sensitive search on filenames, but only case-insensitive search of file content. Lucene can support both, but doing so doubles the size of the search index. Currently the indexer runs periodically and has no smarts about what files have changed. This means there can be a significant lag between changes to files, and when the indexer is updated (several minutes, or even an hour on a very large workspace such as OrionHub with 25,000 users.

A more subtle problem is that the indexed search by default will only match searches on whole words or word prefixes. For example, if the document contains the word "oranges", then searching for "orange" will match, but searching for "ranges" will not match. This is intuitive and expected behavior when performing natural language searches, but does not meet expectations of a developer performing a code search. For example a developer might be looking for functions "getOranges" and "setOranges" with a search for the term "etOranges". Lucene also supports suffix queries, but they are much more expensive because they must traverse the entire index. The Orion indexed search will find both "getOranges" and "setOranges" if the user searches for "*etOranges", but adding the wildcard by default makes all searches slow.

Back to the top