Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "IT SLA"

(Requesting Support)
(Maintenance)
(14 intermediate revisions by 4 users not shown)
Line 1: Line 1:
The Eclipse Foundation's IT team (the Webmasters) provides computer and network services and support that enable the Eclipse community, committers, members and EMO staff to access information and networked applications in a timely manner.
+
The Eclipse Foundation's IT team (the Webmasters) provides computer and network services and support that enable the Eclipse community, committers, members and EMO staff to access information and networked applications in a timely manner. [http://status.eclipse.org Access the Status Page].
  
 
= Webmaster Support =
 
= Webmaster Support =
Line 16: Line 16:
 
|-
 
|-
 
! Blocker  
 
! Blocker  
(Tier 1 service down; blocking entire team/project)
+
(Tier 1 service down)
| IM (if available), Email to Webmaster
+
| IM/SMS text (if available), Email to Webmaster
| Strategic Members: see [[Support Policy]]
+
| Entreprise/Strategic Members: see [[Support Policy]]
 
Others: IM/SMS text (if available), Email to Webmaster  
 
Others: IM/SMS text (if available), Email to Webmaster  
 
| Immediate  
 
| Immediate  
Line 24: Line 24:
 
|-
 
|-
 
! Major  
 
! Major  
(Tier 2 service down; password reset; permissions preventing commit & unable to commit; other issues blocking an individual committer)
+
(Tier 2 service down, and: password reset; signing, permissions preventing commit & unable to commit; other issues blocking an individual committer or project)
|IM/SMS text (if available), Email to Webmaster || Strategic Members: see [[Support Policy]]  Others: Email to Webmaster || Within 2 hours || Strategic Members: Upon notification  Others: next business day
+
|IM/SMS text (if available), Email to Webmaster || Entreprise/Strategic Members: see [[Support Policy]]  Others: Email to Webmaster || Within 2 hours || Strategic Members: Upon notification  Others: next business day
 
|-
 
|-
 
! Normal  
 
! Normal  
(Tier 3 service down; regular, non-blocking requests; signing)
+
(Tier 3 service down; and: non-blocking requests)
 
|Open Bug || Open Bug || Within 4 hours || Within next business day
 
|Open Bug || Open Bug || Within 4 hours || Within next business day
 
|-
 
|-
Line 50: Line 50:
 
== Maintenance ==
 
== Maintenance ==
  
Occasionally, services must be shut down for maintenance. The maintenance window is Sunday, from 6:00am to 8:00am ET. 
+
Occasionally, services must be shut down for maintenance. Two maintenance windows will be utilized for systems upkeep, depending on the impact to the service:
  
At least three (3) days notice will be given for scheduled maintenance on Tier 1 and Tier 2 services affecting all users. In cases where the maintenance affects specific projects (such as CVS refactoring, or CVS/SVN migrations), notification and scheduling will be co-ordinated with the affected projects via bugzilla or public mailing list.  
+
* Tier I & II, Blocking: Sunday, from 6:00am to 8:00am ET.
 +
* All Tiers, Non-blocking: The last Friday of every month, from 1:00pm ET to 5:00pm ET
 +
 
 +
Blocking maintenance means a service will be completely down for more than five minutes. Examples include: upgrading a service to a new version. 
 +
 
 +
Non-blocking maintenance means the service is not completely taken down and remains available; however, compute jobs can be interrupted during this maintenance while the service is transitioned to a different compute node.
 +
 
 +
At least three (3) days notice will be given for maintenance on Tier 1 and Tier 2 services affecting all users. In cases where the maintenance affects specific projects (such as SCM refactoring, or SCM migrations), notification and scheduling will be co-ordinated with the affected projects via bugzilla or public mailing list.
  
 
Emergency maintenance may occur at any time, and service notices will be made on a "Best Effort" basis.
 
Emergency maintenance may occur at any time, and service notices will be made on a "Best Effort" basis.
  
 +
=== Jenkins/JIPP upgrades ===
 +
 +
Instances will be upgraded over the course of a week. The "upgrade week" will be announced two weeks in advance. During that week, Jenkins instances will be upgraded when no build is running. Instances will be turned to "quiet mode" to make it happen. In that mode, builds in progress run to completion but Jenkins doesn’t start any new one. New build requests are being queued instead. Queued builds will be started once the upgrade is done. Upgrades usually requires a restart of the Jenkins instance that will be down for at most 1 hour.
  
 
== Services Covered ==
 
== Services Covered ==
Line 63: Line 73:
  
 
* Bugzilla
 
* Bugzilla
* SCM (CVS (pserver and SSH) / Subversion (svn and svn+ssh) / Git (git and SSH)
+
* SCM: Git/Gerrit
 
* Website: www.eclipse.org
 
* Website: www.eclipse.org
  
Line 69: Line 79:
 
These services offer support for important Eclipse-related activities, and their availability is based on "best effort"; Webmasters may be contacted (by authorized persons) on mobile devices for problem resolution, and will make a reasonable effort to restore service outside of support hours.
 
These services offer support for important Eclipse-related activities, and their availability is based on "best effort"; Webmasters may be contacted (by authorized persons) on mobile devices for problem resolution, and will make a reasonable effort to restore service outside of support hours.
  
* Build server, Hudson infra
+
* CBI (Common build) services: JIPPs, ci.eclipse.org, signing, packaging, nexus (repo.eclipse.org)
 
* Mailing lists
 
* Mailing lists
* Websites: dev, download, wiki, EclipseCON
+
* Websites: git.eclipse.org, Downloads, Wiki, EclipseCON, Marketplace
  
 
=== Tier 3 - Next Business Day ===
 
=== Tier 3 - Next Business Day ===
Line 77: Line 87:
  
 
* Project vservers
 
* Project vservers
* Websites: help, EPIC, EclipseLive, PlanetEclipse, Blogs
+
* Websites: Infocenter Help, PlanetEclipse
 +
* CBI: Sonar
 
* Other services not listed in Tier 1 and Tier 2
 
* Other services not listed in Tier 1 and Tier 2
  
Line 89: Line 100:
 
|-
 
|-
 
! Tier 1
 
! Tier 1
| 99.99%  
+
| >99.98%  
 
|-
 
|-
 
! Tier 2
 
! Tier 2
|99%  
+
|Best Effort (>99%)
 
|-
 
|-
 
! Tier 3
 
! Tier 3
|Best Effort
+
|Next Business Day (>95%)
 
|}
 
|}
  
Line 117: Line 128:
 
* Installed software must be production quality - no Alpha or Beta code.
 
* Installed software must be production quality - no Alpha or Beta code.
 
* Only required software is to be installed and used on Tier 1 and Tier 2 clusters. Software that is not required for the basic operation of the service increases the risk of memory leaks and security vulnerabilities, and may negatively affect performance.
 
* Only required software is to be installed and used on Tier 1 and Tier 2 clusters. Software that is not required for the basic operation of the service increases the risk of memory leaks and security vulnerabilities, and may negatively affect performance.
* Server-side services, such as CVS and Apache, must be bundled with the Entreprise OS we use. Web-based services, such as Bugzilla, can be compiled from source, as they use an underlying OS service to manage ports, access and privilege separation.
+
* Server-side services, such as SCM systems and Apache, must be bundled with the Entreprise OS we use. Web-based services, such as Bugzilla, can be compiled from source, as they use an underlying OS service to manage ports, access and privilege separation.
 
* Installed software must be tested on an isolated node to ensure it doesn't impact the other services.
 
* Installed software must be tested on an isolated node to ensure it doesn't impact the other services.
  
Line 134: Line 145:
 
* Read and apply the Software Upgrade policies - no betas, etc.
 
* Read and apply the Software Upgrade policies - no betas, etc.
 
* One cluster node is usually set up with make/gcc etc. We don't usually leave the make tools on all nodes.
 
* One cluster node is usually set up with make/gcc etc. We don't usually leave the make tools on all nodes.
* Only download/compile software from a reputable source. Run MD5 sums.
+
* Only download/compile software from a reputable source. Run MD5/SHA1 sums.
 
* If software must be compiled from source, '''software must be compiled as a non-root user'''. This is non-negotiable, as there is no reason to compile as root. Document any compilation and/or installation process so we can upgrade later.
 
* If software must be compiled from source, '''software must be compiled as a non-root user'''. This is non-negotiable, as there is no reason to compile as root. Document any compilation and/or installation process so we can upgrade later.
 
* If software is to be installed on each cluster node, such as SVN, create an RPM package and/or use a 'make install' procedure so that we can repeat the installation on other nodes.
 
* If software is to be installed on each cluster node, such as SVN, create an RPM package and/or use a 'make install' procedure so that we can repeat the installation on other nodes.
Line 145: Line 156:
 
* OS upgrades must be tested on an isolated node, and tested in a production environment before being deployed to the entire cluster.
 
* OS upgrades must be tested on an isolated node, and tested in a production environment before being deployed to the entire cluster.
 
* Backend servers (storage, database, authentication) are *not* upgraded unless a problem arises where upgrading may solve it (i.e., MySQL) or there is a security issue that poses a risk to Tier 1 Services.
 
* Backend servers (storage, database, authentication) are *not* upgraded unless a problem arises where upgrading may solve it (i.e., MySQL) or there is a security issue that poses a risk to Tier 1 Services.
 +
 +
=== Backup Coverage ===
 +
 +
We maintain backups for all tier 1 data.  Some tier 2 and 3 services/data are also covered.

Revision as of 09:38, 6 May 2020

The Eclipse Foundation's IT team (the Webmasters) provides computer and network services and support that enable the Eclipse community, committers, members and EMO staff to access information and networked applications in a timely manner. Access the Status Page.

Webmaster Support

Webmaster Hours

Eclipse Webmasters are available full-time from Monday to Friday, from 8:00am to 5:00pm Eastern Time, and on call outside those hours.

Requesting Support

Webmasters will attempt to provide support and resolve issues in a timely manner according to the severity of the issue and prevailing conditions. Due to the varying nature of requests and the fluctuating demands on the Webmasters, resolution times may vary. For service definitions, please see Services Covered below.

Webmaster Support Request
Severity Request Process (webmaster hours) Request process (outside webmaster hours) Response time [1] (webmaster hours) Response time [1] (outside webmaster hours)
Blocker

(Tier 1 service down)

IM/SMS text (if available), Email to Webmaster Entreprise/Strategic Members: see Support Policy

Others: IM/SMS text (if available), Email to Webmaster

Immediate Upon notification
Major

(Tier 2 service down, and: password reset; signing, permissions preventing commit & unable to commit; other issues blocking an individual committer or project)

IM/SMS text (if available), Email to Webmaster Entreprise/Strategic Members: see Support Policy Others: Email to Webmaster Within 2 hours Strategic Members: Upon notification Others: next business day
Normal

(Tier 3 service down; and: non-blocking requests)

Open Bug Open Bug Within 4 hours Within next business day
Provisioning

(Account; Project; vserver; code restructuring)

Open Bug Open Bug Within next 5 business days Within next 5 business days
Enhancement

(Requesting new software; site improvements; etc)

Open Bug Open Bug Best Effort Best Effort

[1] Typical time to respond to a request. Time to complete a request will vary according to the complexity of the request and the time required to gather all the information needed to complete the request.

Computer Systems

Service Hours

All services are expected to be available 24 hours a day, 365 days per year, except during scheduled maintenance periods.

Maintenance

Occasionally, services must be shut down for maintenance. Two maintenance windows will be utilized for systems upkeep, depending on the impact to the service:

  • Tier I & II, Blocking: Sunday, from 6:00am to 8:00am ET.
  • All Tiers, Non-blocking: The last Friday of every month, from 1:00pm ET to 5:00pm ET

Blocking maintenance means a service will be completely down for more than five minutes. Examples include: upgrading a service to a new version.

Non-blocking maintenance means the service is not completely taken down and remains available; however, compute jobs can be interrupted during this maintenance while the service is transitioned to a different compute node.

At least three (3) days notice will be given for maintenance on Tier 1 and Tier 2 services affecting all users. In cases where the maintenance affects specific projects (such as SCM refactoring, or SCM migrations), notification and scheduling will be co-ordinated with the affected projects via bugzilla or public mailing list.

Emergency maintenance may occur at any time, and service notices will be made on a "Best Effort" basis.

Jenkins/JIPP upgrades

Instances will be upgraded over the course of a week. The "upgrade week" will be announced two weeks in advance. During that week, Jenkins instances will be upgraded when no build is running. Instances will be turned to "quiet mode" to make it happen. In that mode, builds in progress run to completion but Jenkins doesn’t start any new one. New build requests are being queued instead. Queued builds will be started once the upgrade is done. Upgrades usually requires a restart of the Jenkins instance that will be down for at most 1 hour.

Services Covered

Tier 1 - Critical

These services are the backbone of the Eclipse.org community and must be available at all times.

  • Bugzilla
  • SCM: Git/Gerrit
  • Website: www.eclipse.org

Tier 2 - Best Effort

These services offer support for important Eclipse-related activities, and their availability is based on "best effort"; Webmasters may be contacted (by authorized persons) on mobile devices for problem resolution, and will make a reasonable effort to restore service outside of support hours.

  • CBI (Common build) services: JIPPs, ci.eclipse.org, signing, packaging, nexus (repo.eclipse.org)
  • Mailing lists
  • Websites: git.eclipse.org, Downloads, Wiki, EclipseCON, Marketplace

Tier 3 - Next Business Day

These services are supported during webmaster hours. Webmasters may tend to issues during off-hours if they happen to be observed at that time.

  • Project vservers
  • Websites: Infocenter Help, PlanetEclipse
  • CBI: Sonar
  • Other services not listed in Tier 1 and Tier 2

Service Availability

Service is considered unavailable if it is unable to respond to user requests after 5 attempts in three minutes. The service is not considered unavailable if it is simply degraded or slow, although the IT team will consider degraded performance a high priority issue.

Service Availability
Tier Availability
Tier 1 >99.98%
Tier 2 Best Effort (>99%)
Tier 3 Next Business Day (>95%)

Please note: scheduled maintenance does not constitute a down time.

SLA strategies

As a rule, the IT team observe by the following guidelines to ensure server uptime, responsiveness and stability:

  • Eclipse.org production servers are not used as test machines.
  • Beta, Alpha, or test code on production servers is prohibited.
  • Anything that poses a threat to the availability, the data integrity or the performance Tier 1 and Tier 2 services can and must be terminated.
  • Committers and EMO staff are not permitted to run code on any server or hardware hosting a Tier 1 service.
  • Eclipse.org IT uses F/OSS software only.


Software installation policies and procedures

  • Clusters are used for Tier 1 and Tier 2 services where fault tolerance, scalability and performance are required.
  • Installed software must be production quality - no Alpha or Beta code.
  • Only required software is to be installed and used on Tier 1 and Tier 2 clusters. Software that is not required for the basic operation of the service increases the risk of memory leaks and security vulnerabilities, and may negatively affect performance.
  • Server-side services, such as SCM systems and Apache, must be bundled with the Entreprise OS we use. Web-based services, such as Bugzilla, can be compiled from source, as they use an underlying OS service to manage ports, access and privilege separation.
  • Installed software must be tested on an isolated node to ensure it doesn't impact the other services.


Software upgrade policies and procedures

  • Release-quality software is used. No Release Candidates or Milestones.
  • A period of at least 10 working days must pass before software is upgraded, to allow the maintainers to detect and fix any defects with the shipped product.
  • Software upgrades must be tested on an isolated node to minimize impact on other services.
  • If software is to be compiled from source (avoid!), follow the Software Compiling policies


Software Compiling policies

  • As much as possible, avoid compiling software from source, as maintenance is tedious. Use a vendor OS package instead.
  • Read and apply the Software Upgrade policies - no betas, etc.
  • One cluster node is usually set up with make/gcc etc. We don't usually leave the make tools on all nodes.
  • Only download/compile software from a reputable source. Run MD5/SHA1 sums.
  • If software must be compiled from source, software must be compiled as a non-root user. This is non-negotiable, as there is no reason to compile as root. Document any compilation and/or installation process so we can upgrade later.
  • If software is to be installed on each cluster node, such as SVN, create an RPM package and/or use a 'make install' procedure so that we can repeat the installation on other nodes.


Operating System upgrade policies and procedures

  • Only upgrade to Release-quality software. No Release Candidates or Milestones.
  • Kernel upgrades must be tested on an isolated node, and tested in a production environment before being deployed to the entire cluster.
  • OS upgrades must be tested on an isolated node, and tested in a production environment before being deployed to the entire cluster.
  • Backend servers (storage, database, authentication) are *not* upgraded unless a problem arises where upgrading may solve it (i.e., MySQL) or there is a security issue that poses a risk to Tier 1 Services.

Backup Coverage

We maintain backups for all tier 1 data. Some tier 2 and 3 services/data are also covered.

Back to the top