Jump to: navigation, search

SLES 10 Upgrade Plan

Revision as of 07:34, 13 November 2006 by Denis.roy.eclipse.org (Talk | contribs)

What is this?

This is a Work Plan to upgrade the eclipse.org server infrastructure to SLES 10 (SuSE Linux Entreprise Server)


Motivation

There is a need to upgrade to SLES 10 for the following reasons:

  • to stay up-to-date with the latest server OS
  • to upgrade to new OS packages that will provide new, desired functionality, such as MySQL 5.x and PHP 5.x
  • to upgrade to new OS packages that will provide better system stability, such as MySQL 5.x


Why document this plan here?

The upgrade process will require short service interruptions. Also, it is expected that some services become "read-only" while we synchronize and switch servers.


Server classes in need of an upgrade

  • Eclipse Cluster, backend
    • This is a pair of redundant servers offering backend data and authentication services: NFS, MySQL and LDAP
    • These servers cannot be interchanged transparently
  • Eclipse Cluster, frontend
    • This is a set of four servers that are load-balanced.
    • Any number of these may be taken offline transparently, as server load permits
  • Eclipse Foundation servers
    • These servers include a pair of redundant servers offering Foundation services, such as e-mail, database and internal web. These servers cannot be interchanged transparently
    • Also includes one local file and print server
    • Also includes one offsite eclipse.org Disaster Recovery server
  • build.eclipse.org
    • This server is used for project builds and JAR signing


Upgrade plan

https://secure-support.novell.com/KanisaPlatform/Publishing/60/3647244_f.SAL_Public.html

"The only supported method of upgrading is by doing the upgrade in "down-server upgrade" style, meaning the server must be taken down to do the upgrade."

Because the Cluster servers perform mission-critical tasks, their upgrade must be thoroughly tested and isolated.


Eclipse Cluster, backend

Currently, one backend server acts as a primary MySQL server, and the other acts as the primary NFS and LDAP server. At this time, the "secondary" backend cannot be taken offline as it is the primary MySQL server. The master DB server function will need to be moved to the Primary backend server, so that the Secondary can be taken offline.

  1. Advise comunity that all DB-related services will be in read-only mode for approximately 1 hour
  2. Set databases and applications in read-only mode
  3. Perform a binary replication of all the MySQL data from the Master (Secondary) to the Slave (Primary)
  4. Switch the dbmaster/dbslave hostname lookups on all nodes
  5. Re-enable all services

When this is complete, the Secondary server may be taken offline indefinitely to be upgraded, then re-entered into production as the Secondary server for testing. When testing is complete, the Primary-Secondary server functions need to be interchanged:

  1. Advise comunity that all services will be in read-only momentarily. Both DB servers should be in perfect sync at this point.
  2. Set databases and applications in read-only mode
  3. Ensure Primary and Secondary are perfectly in sync
  4. Switch Primary/Secondary designations all nodes
  5. Re-enable all services
  6. Standby, ready to swicth P/S back to their original state

At this point, the former Secondary (SLES 9) server is in its permanent function as Primary (SLES 10). The Secondary server can be taken offline and upgraded, then re-entered into service as the Secondary. However, before doing this, we will wait several weeks until we feel that the new SLES10 Primary is capable of standing on its own.


Eclipse Cluster, frontend

Because each front-end server is identical, any one can be taken out-of-service. The upgrade strategy is to take one, and only one server offline and format-reinstall it as a new SLES10 node. It can then be tested while out-of-service, to ensure all core services function normally. Once it is deemed to work, we can then enter it into service, effectively handling 1/4 of the eclipse.org load. During this time, we do not upgrade any other servers, and remain vigilent to reports that "sometimes this error happens", meaning that something doesn't work particularly well on that specific node.

  1. Set node1 out-of-service. Once all traffic from this node has ceased, take offline
  2. Reformat the local disks, install SLES 10 from scratch
  3. Enable as a node server (scripts do this)
  4. Put online (yet out-of-service) and test all core service functionality.
  5. Put in service

When we are confident that the SLES10 node1 has been performing reliably, repeat above steps for nodes 2, 3 and 4.


Eclipse Foundation servers

The pair of redundant servers offering Foundation services, such as e-mail, database and internal web, will be upgraded in a similar fashion to the redundant backend servers: Secondary first, test, promote to Primary, then Primary (as the new Secondary).


build.eclipse.org

Build.eclipse.org is not redundant in any way, therefore it is expected that it will be unavailable for the entire duration of the upgrade.


For those using PHP

This upgrade means the current PHP 4.3.4 will be 5.0.x once all is done. In order to test your PHP scripts, we will be upgrading a single front-end server (node1) first and it will be made available via a special URL to allow you to test your site on PHP 5. We'll advise you on the actual procedure later on.


Downtime and service unavailability summary

Although our redundant infrastructure allows us to anticipate no downtime at all, some services will need to be disabled and/or restricted while we switch from one backend server to the other. To minimize the impact on the community, this will happen on Sunday mornings, between 6:00am and 8:00am Eastern time - our least busy time of the week.

  1. Master Database to Primary switch: All database services must be read-only, for approximately one hour. Bugzilla will be disabled, Wiki will be read-only, eclipseplugincentral will be disabled, eclipse.org website will be read-only, IPZilla will be disabled. Expected duration: one hour.
  1. Primary <-> Secondary permutation: All "write" services, including RSYNC, CVS, SSH and Database, will need to be disabled and/or placed in read-only mode while the nodes switch from Primary to the Secondary. Expected duration: one hour.


Key dates

October 9-13: Upgrade Foundation servers

October 16: Upgrade node1 + test

Sunday November 12 6:00am *: Move DB Master to Primary, disable Secondary

Sunday November 12: Upgrade Secondary + node2 + node3

Sunday November 26: Upgrade build.eclipse.org + node4

Sunday December 3 6:00am *: Interchange Primary <-> Secondary, keep secondary online "just in case"

Sunday December 10: Upgrade secondary


The * denotes service interruptions