Incident Management
We are currently building out this section
What is an Incident
The term "incident" in this context is generally understood as a disruption or degradation of a service that in some way impacts normal operation. Incidents can vary in impact and urgency.
Incident Severity and Priority
Severity
We define Severity as the formal measurement of the impact of a given incident. Essentially, how severely does the issue affect the experience of end users (or potentially internal users)?
Note: Currently these levels are used as a relative judge of impact, not all of the criteria for the description necessarily have to be met exactly.
Level | Impact | Description |
---|---|---|
SEV0 | Critical | The site or a core component is completely unreachable or unusable. |
SEV1 | High | The site or a core component is experiencing a major degradation in service quality or usability that impacts functionality. |
SEV2 | Normal | The site or a core component is experiencing a minor degradation impacting service quality or usability while still being functional. |
SEV3 | Low | The site or a core component is experiencing a degradation with no impacts on usability, but possibly impacts aesthetics or branding. |
Priority
We define Priority as the formal measurement of how quickly the incident needs to be addressed. This can be assessed by considering several factors, including but not limited to:
- How widespread is the issue? Does it affect everyone, or only one user?
- Is there a business motivation behind the issue (marketing, business operations, etc.)?
Note: Currently these levels are used as a relative judge of urgency, not all of the criteria for the description necessarily have to be met exactly.
Level | Urgency | Description |
---|---|---|
P0 | Urgent | Affects all users, or is critically important for business functioning. |
P1 | High | Affects many users, or is important for business functioning. |
P2 | Normal | Affects a single or a small number of users, or has little impact on business functioning. |
P3 | Low | Affects no users. No impact on business functioning. The issue is also not likely grow in severity. |
P4 | Trivial | Not used for incidents |
Superhero
What is a Superhero
A Superhero is a critical site outage that requires immediate attention. We use PagerDuty to handle our on-call schedules and notification routing.
Core features
Superhero events should only be triggered when core features have major degraded functionality that is impacting the majority of the user base.
Examples of a core feature outage
- Minds.com is returning 50X errors and is inaccessible
- Users are consistently unable to register or login to the site and this can be reliably reproduced
- Users newsfeeds are inaccessible and will not load
- Users are unable to create posts
Examples of of non core features
The following examples can be created with the "Priority::Urgent" label and resolved during office hours, but should not be consider Superhero events:
- Rewards were not issued
- Push notifications are not being delivered or are delayed
- Rich embed thumbnails are not displaying on posts
- Analytics are showing no results, or they are inaccurate
Minds Chat
Minds Chat relies on Synapse which provides limited scaling abilities and no high availability support. Until these fundamental technical issues are resolved, it's stability is consider outside of Superhero support.
How to declare a Superhero
Follow the diagram below to determine if a Superhero should be called:
- Always open a new browser window, clear all cookies and session data.
- Ensure you are not in Canary mode or have any canary cookies set
- Leave a message in the Superhero room
- Create an new issue at gitlab.com/minds/minds with the Superhero template.
- The template will automatically apply the Type::Superhero and Superhero::Triggered labels.
- Pagerduty will automatically be triggered via Gitlab. The Gitlab issue should be treated as the central communication hub with the #superhero room on Zulip used for additional offline support.
How to communicate a non-Superhero issue
If you've followed the above steps and determined that the issue in question is not a Superhero, follow these steps:
Before you begin: Is this known to be an application issue? Stop, and follow the steps here to create a bug report.
- Create an issue here with the
Priority::High
label. Do not use the Superhero template. - Create a post in the create a post in the DevOps channel.
- In order to reduce troubleshooting time, provide the following information if possible:
- What environment is this issue occuring in? (Staging, Canary, or Production)
- If the issue is easily replicatable and these steps are known, please share.
- Are there relevant error messages or logs present?
- Are there any Sentry alarms for this issue?
- Are there any suspicious metrics in Grafana?
If these points are not relevant and/or you do not know, just create a post with what you have. However, providing this information up front when available will greatly reduce time to resolution.
Managing a Superhero Incident
Roles and Responsibilities
Superhero
The Superhero is the primary on-call, and in most cases serves the traditional SRE responsibilities of Incident Commander (IC) and Ops Lead (OL). As a Superhero, your responsibilities entail:
- Own the outage, meaning that you are responsible for driving the incident to its conclusion.
- Perform any and all changes needed to the application or the environment, as well as keeping track of what was attempted and the result of the attempt.
- Communicate your learnings effectively with the Sidekick, so that they can have adequate information to perform their task.
- After resolving the issue, coordinating and leading the postmortem for the incident.
Sidekick
The Sidekick serves as the Communications Lead (CL). As a Sidekick, your responsibilities entail:
- Communicate effectively with the Superhero to understand what has been attempted and the result of the attempts.
- Field any and all questions that may surface from those joining the call, in order to protect the Superhero's focus while working.
- Perform both internal and external communications to the relevant channels, currently the Superhero group chat as well as the status page.
Postmortem
Following the resolution of an incident, the Superhero will lead the Postmortem (sometimes referred to as the "5 Why's" or the Root Cause Analysis). During this ceremony, the team members involved in the incident will attempt to determine the root cause of the issue through the use of historical data (metrics, traces, etc.) and continually asking "Why?" to trace back the symptoms of the incident until the true catalyst is discovered.
A few things to consider when conducting a Postmortem:
- The focus is on learning the cause of the incident so that it can be prevented in the future. The focus is never to find who personally was at fault for the incident. While we do uncover mistakes, we should strive for a blameless culture of learning so that we may all learn and improve. Often mistakes can be prevented in the future through automation and process improvements.
- Each Postmortem should conclude with at least one GitLab issue in response to the conversation in order to prevent the issue from occurring again.
- These sessions should also be taken advantage of to take stock the current monitoring and alerting for the project, and gaps should be addressed. This serves as the re-entry into the "Prepare" stage in the Superhero lifecycle, as we learn from this outage and prepare for the next.
Runbooks
Currently the Runbooks are documented here.
You will need access to the GitLab project to view the Runbooks.