SLES 10 Upgrade Plan

Note: This page is deprecated, as it discussed our migration plan to SLES 10. The migration was successfully completed Tuesday, December 19th.

What is this?

This is a Work Plan to upgrade the eclipse.org server infrastructure to SLES 10 (SuSE Linux Entreprise Server)

Motivation

There is a need to upgrade to SLES 10 for the following reasons:

to stay up-to-date with the latest server OS
to upgrade to new OS packages that will provide new, desired functionality, such as MySQL 5.x and PHP 5.x
to upgrade to new OS packages that will provide better system stability, such as MySQL 5.x

Why document this plan here?

The upgrade process will require short service interruptions. Also, it is expected that some services become "read-only" while we synchronize and switch servers.

Server classes in need of an upgrade

Eclipse Cluster, backend
- This is a pair of redundant servers offering backend data and authentication services: NFS, MySQL and LDAP
- These servers cannot be interchanged transparently
Eclipse Cluster, frontend
- This is a set of four servers that are load-balanced.
- Any number of these may be taken offline transparently, as server load permits
Eclipse Foundation servers
- These servers include a pair of redundant servers offering Foundation services, such as e-mail, database and internal web. These servers cannot be interchanged transparently
- Also includes one local file and print server
- Also includes one offsite eclipse.org Disaster Recovery server
build.eclipse.org
- This server is used for project builds and JAR signing

Upgrade plan

https://secure-support.novell.com/KanisaPlatform/Publishing/60/3647244_f.SAL_Public.html

"The only supported method of upgrading is by doing the upgrade in "down-server upgrade" style, meaning the server must be taken down to do the upgrade."

Because the Cluster servers perform mission-critical tasks, their upgrade must be thoroughly tested and isolated.

Eclipse Cluster, backend

Currently, one backend server acts as a primary MySQL server, and the other acts as the primary NFS and LDAP server. At this time, the "secondary" backend cannot be taken offline as it is the primary MySQL server. The master DB server function will need to be moved to the Primary backend server, so that the Secondary can be taken offline.

Advise comunity that all DB-related services will be in read-only mode for approximately 1 hour
Set databases and applications in read-only mode
Perform a binary replication of all the MySQL data from the Master (Secondary) to the Slave (Primary)
Switch the dbmaster/dbslave hostname lookups on all nodes
Re-enable all services

When this is complete, the Secondary server may be taken offline indefinitely to be upgraded, then re-entered into production as the Secondary server for testing. When testing is complete, the Primary-Secondary server functions need to be interchanged:

Advise comunity that all services will be in read-only momentarily. Both DB servers should be in perfect sync at this point.
Set databases and applications in read-only mode
Ensure Primary and Secondary are perfectly in sync
Switch Primary/Secondary designations all nodes
Re-enable all services
Standby, ready to swicth P/S back to their original state

At this point, the former Secondary (SLES 9) server is in its permanent function as Primary (SLES 10). The Secondary server can be taken offline and upgraded, then re-entered into service as the Secondary. However, before doing this, we will wait several weeks until we feel that the new SLES10 Primary is capable of standing on its own.

Eclipse Cluster, frontend

Because each front-end server is identical, any one can be taken out-of-service. The upgrade strategy is to take one, and only one server offline and format-reinstall it as a new SLES10 node. It can then be tested while out-of-service, to ensure all core services function normally. Once it is deemed to work, we can then enter it into service, effectively handling 1/4 of the eclipse.org load. During this time, we do not upgrade any other servers, and remain vigilent to reports that "sometimes this error happens", meaning that something doesn't work particularly well on that specific node.

Set node1 out-of-service. Once all traffic from this node has ceased, take offline
Reformat the local disks, install SLES 10 from scratch
Enable as a node server (scripts do this)
Put online (yet out-of-service) and test all core service functionality.
Put in service

When we are confident that the SLES10 node1 has been performing reliably, repeat above steps for nodes 2, 3 and 4.

Eclipse Foundation servers

The pair of redundant servers offering Foundation services, such as e-mail, database and internal web, will be upgraded in a similar fashion to the redundant backend servers: Secondary first, test, promote to Primary, then Primary (as the new Secondary).

build.eclipse.org

Build.eclipse.org is not redundant in any way, therefore it is expected that it will be unavailable for the entire duration of the upgrade.

For those using PHP

This upgrade means the current PHP 4.3.4 will be 5.0.x once all is done. In order to test your PHP scripts, we will be upgrading a single front-end server (node1) first and it will be made available via a special URL to allow you to test your site on PHP 5. We'll advise you on the actual procedure later on.

Downtime and service unavailability summary

Although our redundant infrastructure allows us to anticipate no downtime at all, some services will need to be disabled and/or restricted while we switch from one backend server to the other. To minimize the impact on the community, this will happen on Sunday mornings, between 6:00am and 8:00am Eastern time - our least busy time of the week.

Master Database to Primary switch: All database services must be read-only, for approximately one hour. Bugzilla will be disabled, Wiki will be read-only, eclipseplugincentral will be disabled, eclipse.org website will be read-only, IPZilla will be disabled. Expected duration: one hour. This actually took 30 minutes. --Denis.roy.eclipse.org 11:03, 13 November 2006 (EST)

Primary <-> Secondary permutation: All "write" services, including RSYNC, CVS, SSH and Database, will need to be disabled and/or placed in read-only mode while the nodes switch from Primary to the Secondary. Expected duration: one hour. This actually took 90 minutes. --Denis.roy.eclipse.org 16:22, 4 December 2006 (EST)

Key dates

~~October 9-13: Upgrade Foundation servers~~

~~October 16: Upgrade node1 + test~~

~~Sunday November 12 6:00am *: Move DB Master to Primary, disable Secondary~~

~~Sunday November 12: Upgrade Secondary + node2 + node3~~

Sunday November 26: ~~Upgrade build.eclipse.org + node4~~

Sunday December 3 6:00am *: ~~Interchange Primary <-> Secondary, keep secondary online "just in case"~~

Friday, December 15: ~~Upgrade secondary~~

Sunday, December 17: ~~resync slave database, use secondary backend as database master for load sharing~~

The * denotes service interruptions

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SLES 10 Upgrade Plan

Contents

What is this?

Motivation

Why document this plan here?

Server classes in need of an upgrade

Upgrade plan

Eclipse Cluster, backend

Eclipse Cluster, frontend

Eclipse Foundation servers

build.eclipse.org

For those using PHP

Downtime and service unavailability summary

Key dates

Breadcrumbs

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SLES 10 Upgrade Plan

Contents

What is this?

Motivation

Why document this plan here?

Server classes in need of an upgrade

Upgrade plan

Eclipse Cluster, backend

Eclipse Cluster, frontend

Eclipse Foundation servers

build.eclipse.org

For those using PHP

Downtime and service unavailability summary

Key dates