This plan lays out the work areas required to re architect the current Orion Java server for very large scale deployment. This is not all work we will be able to scope within a single release, but is intended to provide a roadmap of the work required to make this happen. Work items are roughly sorted by order of priority.
- 1 Assessment of current state
- 2 Goal architecture
- 3 Work areas for scalability
- 4 Links
Assessment of current state
Much of the current Orion Java server is designed to enable scalability. There are a number of problem areas detailed in the next section, but the current server has some existing strengths:
- Nearly stateless. With the exception of Java session state for authentication, and some metadata caching for performance, the current server is stateless. The server can be killed and restarted at any time with no data loss. All data is persisted to disk at the end of each HTTP request, with the exception of asynchronous long running operations that occur in a background thread and span multiple HTTP requests.
- Fast startup. The current server starts within 2-3 seconds, most of which is Java and OSGi startup. No data is loaded until client requests start to arrive. The exception is the background search indexer, which starts in a background thread immediately and starts crawling. If the search index has been deleted (common practice on server upgraded) it can take several minutes for it to complete indexing. This in no way blocks client requests - search results will just be potentially incomplete until the index is fully populated.
- Externalized configuration. Most server configuration is centrally managed in the orion.conf file. There are a small number of properties specified in orion.ini as well. Generally server configuration can be managed without code changes.
- Scalable core technologies. All of the core technologies we are using on the server are known to be capable of significant scaling: Equinox, Jetty, JGit, and Apache Solr. In some cases we will need to change how these services are configured but nothing we need to throw away and start from scratch. Solr doesn't handle search crawling but we could use complementary packages like Apache Nutch for crawling if needed.
- Pluggable credential store. The Orion server currently supports a completely pluggable credential store for Orion account information and passwords. We have a default implementation that uses Equinox secure storage, but we have also done implementations that support OpenID, Persona, and LDAP. While the default backing store is not suitable for enterprise scale use, we are confident that a scalable off the shelf credential solution can be plugged in with no code change.
Our goal is to support an architecture that enables very large scale Orion deployments. What would be required if we had a million accounts or 100,000 simultaneous users? A vanilla Orion download would not need to support this out of the box without further configuration, but we want it to be possible to support such scenarios.
Our goal is to support the classic three tier web server architecture:
- Front end server used for load balancing, SSL encyrption, URL mapping and possibly caching of static content. Examples include the Apache front-end used at orionhub.org, or light front-ends such as nginx or Squid.
- Process servers of varying types, scaled independently. Current examples include Orion file server, git server, and search server.
- Persistence layer. Our current persistence layer is the file system, but it should be possible to interchange with other persistence implementations (e.g., database) without code changes in the Orion servers.
Work areas for scalability
Orion metadata is stored in simple JSON files, on a per-user basis. As such, contention for access to these files only occurs when multiple requests/processes are being carried out for the same user concurrently. To avoid corruption or invalid behavior we must use file-level locking to ensure integrity of the metadata when concurrent requests do occur. Action: sweep through all metadata read/write and ensure file locking is performed.
Orion instances are generally designed to be disposable. Killing the Orion process is generally safe, but may result in a user operation being left incomplete. We need to ensure Orion responds promptly to a stop request and is able to do an orderly shutdown. Action: ensure all long running tasks that span requests respond to cancellation promptly.
Scalable search engine and indexer
Our interaction with the search engine is entirely over HTTP to an in-process Solr server. The indexer is currently run as a pair of background threads in the Orion server process (one for updating, one for handling deletion). Action: ensure there is only one indexer running when there are multiple Orion instances in play.
Longer term, the Solr server should be forked out of process so it can be scaled out independently. We will likely need to eventually move to a more scalable search indexer such as Apache Nutch or Elasticsearch.
Move long running Git operations out of process
Long running Git operations are currently run in background threads within the Orion server process. This should be moved to separate Git processes so that it can be scaled out independently. This would also enable Git processes to continue running if the Orion server node falls over or gets switched out. This can be implemented with a basic job queue. The Orion server writes a description of the long running Git operation to execute to a persistent queue. A separate Git server polls that queue and performs the git operations.
In addition we should investigate optimizing the configuration of JGit and any JVM running JGit. JGit makes a number of speed/space trade-offs that can be adjusted by configuring it different. For details see bug 344143.
Move Java session state out of process
Authentication between client and Orion server is based on Java session state. We should investigate externalizing this state to an external storage such as memcached. This would allow sessions to persist across Orion servers being recycled, or subsequent requests from the same client being handled by different Orion server instances from the one that initiated authentication.
Orion primarily uses SLF4J for logging, written to stdout/stderr of the Orion server process. The log configuration is stored in a bundle of the Orion server. It needs to be possible to configure the log external to the Orion build itself. I.e., the Orion server configuration controls the logging rather than settings embedded with the code. We should continue writing log data to stdout so that external applications can be used to persist logs for remote monitoring and debugging.
Search for implicit assumptions about system tools (exec calls)
The Orion server is pure Java code so it should be able to run anywhere that can run a modern JVM (generally Java 7 or higher). We need to validate that we don't make assumptions about external system tools (i.e., System.exec calls). We should bundle the Java 7 back end of the Eclipse file system API to enable running any Orion build on arbitrary hardware.
Main bug report: bug 404935
Reading on scalability:
Assorted cloud platform docs: