Skip to main content
Jump to: navigation, search

Orion/Search Plan

This page is a design document for the search component in the Orion project.

Current State

Orion currently has two search implementations, each with distinct advantages and drawbacks. They are used for different sets of end user search capabilities exposed in the user interface.

Crawler

The first implementation is a client-side search implementation written in JavaScript. It does not use an index, but traverses the entire document space on each search.

Advantages

  • Highly accurate, will match arbitrary character sequences in search expression precisely in the document
  • Supports case-sensitive vs insensitive search
  • Supports regular expression search.
  • Agnostic of server implementation. This means it works equally on Java and JavaScript server implementations

Disadvantages

  • Extremely slow, both because the content to be search has to be transferred across the network to the client, and because it exhaustively examines every document on every search. For example each search on a modest developer workspace can take minutes to complete.
  • No pagination. Does not scale well when there are thousands of matches
  • No accurate model of what files are binary versus text. Currently a blacklist is used to avoid known binary types (webEditingPlugin.js line 53-62) and known folder names to exclude (searchCrawler.js line 16)

Indexed Search

The second search implementation runs on the Java server using Apache Lucene/Solr.

Advantages

  • Extremely fast even over large document spaces (nearly instantaneous on a modest developer workspace).
  • Supports both case-sensitive and insensitive search on filenames
  • Supports pagination and sorting of search results
  • Supports searching by file type, size, last modified (not currently used)

Disadvantages

  • Does not support case-sensitive search on file contents. Lucene can support this, but doing so doubles the size of the search index.
  • Currently the indexer runs periodically and has no smarts about what files have changed. This means there can be a significant lag between changes to files, and when the indexer is updated (several minutes, or even an hour on a very large workspace such as OrionHub with 25,000 users.
  • No suffix matching by default. By default will only match searches on whole words or word prefixes. For example, if the document contains the word "oranges", then searching for "orange" will match, but searching for "ranges" will not match. This is intuitive and expected behavior when performing natural language searches, but does not meet expectations of a developer performing a code search. For example a developer might be looking for functions "getOranges" and "setOranges" with a search for the term "etOranges". Lucene also supports suffix queries, but they are much more expensive because they must traverse the entire index. The Orion indexed search will find both "getOranges" and "setOranges" if the user searches for "*etOranges", but adding the wildcard by default makes all searches slow.
  • Does not currently combine phrase and partial word search. For example when a document contains "hello world", searching for "hello wor" will not find it. There is an indexing performance trade-off to make this kind of search work.
  • Does not support terms containing whitespace or punctuation. The indexed search uses whitespace and punctuation as word delimiters, and therefore does not support search phrases containing a combination of text and punctuation. This is again expected when searching natural language, but not for a developer searching over code, where punctuation and whitespace are important.
  • Index scalability needs to be managed. When the search space becomes very large (500+ GB) search performance begins to decline dramatically. Solr supports sharding to manage scalability on very large indexes but this requires engineering work in Orion that we have not yet done.
  • No support for regular expressions. Supports simple ? and * wildcards only
  • No accurate model of what files are binary versus text. Currently a whitelist is used to include only known text file types.

Current usage

  • Filename searches are performed exclusively with the indexed search. This works very well apart from the lag when new files are created before first indexing
  • Text body searches use the crawler if either "case sensitive" or "regular expression" mode is used.
  • Other text body searches use a combination of indexed and crawling search.

Back to the top