Difference between revisions of "IT SLA"
(→Tier 1 - Critical)
|Line 73:||Line 73:|
* SCM: Git/Gerrit
* SCM: Git/Gerrit
* Website: www.eclipse.org
* Website: www.eclipse.org
Revision as of 12:41, 6 January 2022
The Eclipse Foundation's IT team (the Webmasters) provides computer and network services and support that enable the Eclipse community, committers, members and EMO staff to access information and networked applications in a timely manner. Access the Status Page.
- 1 Webmaster Support
- 2 Computer Systems
- 2.1 Service Hours
- 2.2 Maintenance
- 2.3 Services Covered
- 2.4 Service Availability
- 2.5 SLA strategies
Eclipse Webmasters are available full-time from Monday to Friday, from 8:00am to 5:00pm Eastern Time, and on call outside those hours.
Webmasters will attempt to provide support and resolve issues in a timely manner according to the severity of the issue and prevailing conditions. Due to the varying nature of requests and the fluctuating demands on the Webmasters, resolution times may vary. For service definitions, please see Services Covered below.
|Severity||Request Process (webmaster hours)||Request process (outside webmaster hours)||Response time  (webmaster hours)||Response time  (outside webmaster hours)|
(Tier 1 service down)
|IM/SMS text (if available), Email to Webmaster|| Entreprise/Strategic Members: see Support Policy
Others: IM/SMS text (if available), Email to Webmaster
(Tier 2 service down, and: password reset; signing, permissions preventing commit & unable to commit; other issues blocking an individual committer or project)
|IM/SMS text (if available), Email to Webmaster||Entreprise/Strategic Members: see Support Policy Others: Email to Webmaster||Within 2 hours||Strategic Members: Upon notification Others: next business day|
(Tier 3 service down; and: non-blocking requests)
|Open Bug||Open Bug||Within 4 hours||Within next business day|
(Account; Project; vserver; code restructuring)
|Open Bug||Open Bug||Within next 5 business days||Within next 5 business days|
(Requesting new software; site improvements; etc)
|Open Bug||Open Bug||Best Effort||Best Effort|
 Typical time to respond to a request. Time to complete a request will vary according to the complexity of the request and the time required to gather all the information needed to complete the request.
All services are expected to be available 24 hours a day, 365 days per year, except during scheduled maintenance periods.
Occasionally, services must be shut down for maintenance. Two maintenance windows will be utilized for systems upkeep, depending on the impact to the service:
- Tier I & II, Blocking: Sunday, from 6:00am to 8:00am ET.
- All Tiers, Non-blocking: The last Friday of every month, from 1:00pm ET to 5:00pm ET
Blocking maintenance means a service will be completely down for more than five minutes. Examples include: upgrading a service to a new version.
Non-blocking maintenance means the service is not completely taken down and remains available; however, compute jobs can be interrupted during this maintenance while the service is transitioned to a different compute node.
At least three (3) days notice will be given for maintenance on Tier 1 and Tier 2 services affecting all users. In cases where the maintenance affects specific projects (such as SCM refactoring, or SCM migrations), notification and scheduling will be co-ordinated with the affected projects via bugzilla or public mailing list.
Emergency maintenance may occur at any time, and service notices will be made on a "Best Effort" basis.
Instances will be upgraded over the course of a week. The "upgrade week" will be announced two weeks in advance. During that week, Jenkins instances will be upgraded when no build is running. Instances will be turned to "quiet mode" to make it happen. In that mode, builds in progress run to completion but Jenkins doesn’t start any new one. New build requests are being queued instead. Queued builds will be started once the upgrade is done. Upgrades usually requires a restart of the Jenkins instance that will be down for at most 1 hour.
Tier 1 - Critical
These services are the backbone of the Eclipse.org community and must be available at all times.
- SCM: GitLab, Git/Gerrit
- Website: www.eclipse.org
Tier 2 - Best Effort
These services offer support for important Eclipse-related activities, and their availability is based on "best effort"; Webmasters may be contacted (by authorized persons) on mobile devices for problem resolution, and will make a reasonable effort to restore service outside of support hours.
- CBI (Common build) services: JIPPs, ci.eclipse.org, signing, packaging, nexus (repo.eclipse.org)
- Mailing lists
- Websites: git.eclipse.org, Downloads, Wiki, EclipseCON, Marketplace
Tier 3 - Next Business Day
These services are supported during webmaster hours. Webmasters may tend to issues during off-hours if they happen to be observed at that time.
- Project vservers
- Websites: Infocenter Help, PlanetEclipse
- CBI: Sonar
- Other services not listed in Tier 1 and Tier 2
Service is considered unavailable if it is unable to respond to user requests after 5 attempts in three minutes. The service is not considered unavailable if it is simply degraded or slow, although the IT team will consider degraded performance a high priority issue.
|Tier 2||Best Effort (>99%)|
|Tier 3||Next Business Day (>95%)|
Please note: scheduled maintenance does not constitute a down time.
As a rule, the IT team observe by the following guidelines to ensure server uptime, responsiveness and stability:
- Eclipse.org production servers are not used as test machines.
- Beta, Alpha, or test code on production servers is prohibited.
- Anything that poses a threat to the availability, the data integrity or the performance Tier 1 and Tier 2 services can and must be terminated.
- Committers and EMO staff are not permitted to run code on any server or hardware hosting a Tier 1 service.
- Eclipse.org IT uses F/OSS software only.
Software installation policies and procedures
- Clusters are used for Tier 1 and Tier 2 services where fault tolerance, scalability and performance are required.
- Installed software must be production quality - no Alpha or Beta code.
- Only required software is to be installed and used on Tier 1 and Tier 2 clusters. Software that is not required for the basic operation of the service increases the risk of memory leaks and security vulnerabilities, and may negatively affect performance.
- Server-side services, such as SCM systems and Apache, must be bundled with the Entreprise OS we use. Web-based services, such as Bugzilla, can be compiled from source, as they use an underlying OS service to manage ports, access and privilege separation.
- Installed software must be tested on an isolated node to ensure it doesn't impact the other services.
Software upgrade policies and procedures
- Release-quality software is used. No Release Candidates or Milestones.
- A period of at least 10 working days must pass before software is upgraded, to allow the maintainers to detect and fix any defects with the shipped product.
- Software upgrades must be tested on an isolated node to minimize impact on other services.
- If software is to be compiled from source (avoid!), follow the Software Compiling policies
Software Compiling policies
- As much as possible, avoid compiling software from source, as maintenance is tedious. Use a vendor OS package instead.
- Read and apply the Software Upgrade policies - no betas, etc.
- One cluster node is usually set up with make/gcc etc. We don't usually leave the make tools on all nodes.
- Only download/compile software from a reputable source. Run MD5/SHA1 sums.
- If software must be compiled from source, software must be compiled as a non-root user. This is non-negotiable, as there is no reason to compile as root. Document any compilation and/or installation process so we can upgrade later.
- If software is to be installed on each cluster node, such as SVN, create an RPM package and/or use a 'make install' procedure so that we can repeat the installation on other nodes.
Operating System upgrade policies and procedures
- Only upgrade to Release-quality software. No Release Candidates or Milestones.
- Kernel upgrades must be tested on an isolated node, and tested in a production environment before being deployed to the entire cluster.
- OS upgrades must be tested on an isolated node, and tested in a production environment before being deployed to the entire cluster.
- Backend servers (storage, database, authentication) are *not* upgraded unless a problem arises where upgrading may solve it (i.e., MySQL) or there is a security issue that poses a risk to Tier 1 Services.
We maintain backups for all tier 1 data. Some tier 2 and 3 services/data are also covered.