118
edits
| (10 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
= This is the process = | |||
This document is an authoritative description of the process. This document supersedes all prior documents on the process. | |||
= Deviating from the process = | |||
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate. | |||
= Checklist = | = Checklist = | ||
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material. | This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material. | ||
| Line 8: | Line 14: | ||
# Check if the incident is still ongoing. | # Check if the incident is still ongoing. | ||
# Determine whether the incident is ongoing | # Determine whether the incident is ongoing | ||
# If this report came in via SRE - Report: | |||
## keep that thread open until the incident is resolved | |||
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related | |||
# Determine whether clients are potentially affected, if so: | # Determine whether clients are potentially affected, if so: | ||
## notify the affected clients (Slack preferred) | ## notify the affected clients (Slack preferred if available) | ||
## share the message sent to the client in the incident Zulip thread | ## share the message sent to the client in the incident Zulip thread | ||
# Document all actions taken in the Zulip topic. | # Document all actions taken in the Zulip topic. | ||
| Line 20: | Line 29: | ||
## Mark Zulip topic as resolved if no other incidents for the host. | ## Mark Zulip topic as resolved if no other incidents for the host. | ||
## Check for related triggers and resolve them. | ## Check for related triggers and resolve them. | ||
## If there were any SRE - Report threads: | |||
### post a summary describing the high-level incident, that it is resolved, and how it was resolved. | |||
### post that summary message to any client channels such as Slack too. | |||
### close the thread in SRE - Report | |||
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely. | |||
=== Non-Critical Incidents === | === Non-Critical Incidents === | ||
| Line 139: | Line 149: | ||
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | ||
== Non-Critical incidents == | == Non-Critical incidents == | ||
| Line 196: | Line 200: | ||
== Handover == | == Handover == | ||
See [[Handover]] | |||
edits