Incident Handling: Difference between revisions

Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
= This is the process =
= This is the process =
This document describes the process. This document supersedes all prior documents on process.  
This document is an authoritative description of the process. This document supersedes all prior documents on the process.  


= Deviating from the process =
= Deviating from the process =
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, ideally as soon as possible after deciding to deviate.
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate.


= Checklist =
= Checklist =
Line 14: Line 14:
# Check if the incident is still ongoing.
# Check if the incident is still ongoing.
# Determine whether the incident is ongoing
# Determine whether the incident is ongoing
# If this report came in via SRE - Report:
## keep that thread open until the incident is resolved
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
# Determine whether clients are potentially affected, if so:
# Determine whether clients are potentially affected, if so:
## notify the affected clients (Slack preferred)
## notify the affected clients (Slack preferred if available)
## share the message sent to the client in the incident Zulip thread
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Document all actions taken in the Zulip topic.
Line 26: Line 29:
## Mark Zulip topic as resolved if no other incidents for the host.
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.
## If there were any SRE - Report threads:
### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
### post that summary message to any client channels such as Slack too.
### close the thread in SRE - Report


Common issues that have occurred previously, and ''could'' occur again:
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check the devteam email
* HTTPS down on Sunday: this can be due to GitLab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
118

edits

Navigation menu