Incident Handling: Difference between revisions

Jump to navigation Jump to search
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 3: Line 3:


= Deviating from the process =
= Deviating from the process =
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, ideally as soon as possible after deciding to deviate.
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate.


= Checklist =
= Checklist =
Line 14: Line 14:
# Check if the incident is still ongoing.
# Check if the incident is still ongoing.
# Determine whether the incident is ongoing
# Determine whether the incident is ongoing
# If this report came in via SRE - Report:
## keep that thread open until the incident is resolved
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
# Determine whether clients are potentially affected, if so:
# Determine whether clients are potentially affected, if so:
## notify the affected clients (Slack preferred)
## notify the affected clients (Slack preferred if available)
## share the message sent to the client in the incident Zulip thread
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Document all actions taken in the Zulip topic.
Line 26: Line 29:
## Mark Zulip topic as resolved if no other incidents for the host.
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.
## If there were any SRE - Report threads:
### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
### post that summary message to any client channels such as Slack too.
### close the thread in SRE - Report


Common issues that have occurred previously, and ''could'' occur again:
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check the devteam email
* HTTPS down on Sunday: this can be due to GitLab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
118

edits

Navigation menu