Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SLES 10 Upgrade Plan
Contents
What is this?
This is a Work Plan to upgrade the eclipse.org server infrastructure to SLES 10 (SuSE Linux Entreprise Server)
Motivation
There is a need to upgrade to SLES 10 for the following reasons:
- to stay up-to-date with the latest server OS
- to upgrade to new OS packages that will provide new, desired functionality, such as MySQL 5.x and PHP 5.x
- to upgrade to new OS packages that will provide better system stability, such as MySQL 5.x
Why document this plan here?
The upgrade process will require short service interruptions. Also, it is expected that some services become "read-only" while we synchronize and switch servers.
Server classes in need of an upgrade
- Eclipse Cluster, backend
- This is a pair of redundant servers offering backend data and authentication services: NFS, MySQL and LDAP
- These servers cannot be interchanged transparently
- Eclipse Cluster, frontend
- This is a set of four servers that are load-balanced.
- Any number of these may be taken offline transparently, as server load permits
- Eclipse Foundation servers
- These servers include a pair of redundant servers offering Foundation services, such as e-mail, database and internal web. These servers cannot be interchanged transparently
- Also includes one local file and print server
- Also includes one offsite eclipse.org Disaster Recovery server
- build.eclipse.org
- This server is used for project builds and JAR signing
Upgrade plan
https://secure-support.novell.com/KanisaPlatform/Publishing/60/3647244_f.SAL_Public.html
"The only supported method of upgrading is by doing the upgrade in "down-server upgrade" style, meaning the server must be taken down to do the upgrade."
Because the Cluster servers perform mission-critical tasks, their upgrade must be thoroughly tested and isolated.
Eclipse Cluster, backend
Currently, one backend server acts as a primary MySQL server, and the other acts as the primary NFS and LDAP server. At this time, the "secondary" backend cannot be taken offline as it is the primary MySQL server. The master DB server function will need to be moved to the Primary backend server, so that the Secondary can be taken offline.
- Advise comunity that all DB-related services will be in read-only mode for approximately 1 hour
- Set databases and applications in read-only mode
- Perform a binary replication of all the MySQL data from the Master (Secondary) to the Slave (Primary)
- Switch the dbmaster/dbslave hostname lookups on all nodes
- Re-enable all services
When this is complete, the Secondary server may be taken offline indefinitely to be upgraded, then re-entered into production as the Secondary server for testing. When testing is complete, the Primary-Secondary server functions need to be interchanged:
- Advise comunity that all services will be in read-only momentarily. Both DB servers should be in perfect sync at this point.
- Set databases and applications in read-only mode
- Ensure Primary and Secondary are perfectly in sync
- Switch Primary/Secondary designations all nodes
- Re-enable all services
- Standby, ready to swicth P/S back to their original state
At this point, the former Secondary (SLES 9) server is in its permanent function as Primary (SLES 10). The Secondary server can be taken offline and upgraded, then re-entered into service as the Secondary. However, before doing this, we will wait several weeks until we feel that the new SLES10 Primary is capable of standing on its own.
Eclipse Cluster, frontend
Because each front-end server is identical, any one can be taken out-of-service. The upgrade strategy is to take one, and only one server offline and format-reinstall it as a new SLES10 node. It can then be tested while out-of-service, to ensure all core services function normally. Once it is deemed to work, we can then enter it into service, effectively handling 1/4 of the eclipse.org load. During this time, we do not upgrade any other servers, and remain vigilent to reports that "sometimes this error happens", meaning that something doesn't work particularly well on that specific node.
- Set node1 out-of-service. Once all traffic from this node has ceased, take offline
- Reformat the local disks, install SLES 10 from scratch
- Enable as a node server (scripts do this)
- Put online (yet out-of-service) and test all core service functionality.
- Put in service
When we are confident that the SLES10 node1 has been performing reliably, repeat above steps for nodes 2, 3 and 4.
Eclipse Foundation servers
The pair of redundant servers offering Foundation services, such as e-mail, database and internal web, will be upgraded in a similar fashion to the redundant backend servers: Secondary first, test, promote to Primary, then Primary (as the new Secondary).
build.eclipse.org
Build.eclipse.org is not redundant in any way, therefore it is expected that it will be unavailable for the entire duration of the upgrade.
For those using PHP
This upgrade means the current PHP 4.3.4 will be 5.0.x once all is done. In order to test your PHP scripts, we will be upgrading a single front-end server (node1) first and it will be made available via a special URL to allow you to test your site on PHP 5. We'll advise you on the actual procedure later on.
Although our redundant infrastructure allows us to anticipate no downtime at all, some services will need to be disabled and/or restricted while we switch from one backend server to the other. To minimize the impact on the community, this will happen on Sunday mornings, between 6:00am and 8:00am Eastern time - our least busy time of the week.
- Master Database to Primary switch: All database services must be read-only, for approximately one hour. Bugzilla will be disabled, Wiki will be read-only, eclipseplugincentral will be disabled, eclipse.org website will be read-only, IPZilla will be disabled. Expected duration: one hour. This actually took 30 minutes. --Denis.roy.eclipse.org 11:03, 13 November 2006 (EST)
- Primary <-> Secondary permutation: All "write" services, including RSYNC, CVS, SSH and Database, will need to be disabled and/or placed in read-only mode while the nodes switch from Primary to the Secondary. Expected duration: one hour. This actually took 90 minutes. --Denis.roy.eclipse.org 16:22, 4 December 2006 (EST)
Key dates
October 9-13: Upgrade Foundation servers
October 16: Upgrade node1 + test
Sunday November 12 6:00am *: Move DB Master to Primary, disable Secondary
Sunday November 12: Upgrade Secondary + node2 + node3
Sunday November 26: Upgrade build.eclipse.org + node4
Sunday December 3 6:00am *: Interchange Primary <-> Secondary, keep secondary online "just in case"
Friday, December 15: Upgrade secondary
Sunday, December 17: resync slave database, use secondary backend as database master for load sharing
The * denotes service interruptions