118
edits
No edit summary |
|||
| (12 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
= This is the process = | |||
This document is an authoritative description of the process. This document supersedes all prior documents on the process. | |||
= Deviating from the process = | |||
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate. | |||
= Checklist = | = Checklist = | ||
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material. | This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material. | ||
| Line 6: | Line 12: | ||
# Acknowledge trigger in Zabbix. | # Acknowledge trigger in Zabbix. | ||
# Check if incident is still ongoing. | # Check if the incident is still ongoing. | ||
# If | # Determine whether the incident is ongoing | ||
# Document all actions taken in Zulip topic. | # If this report came in via SRE - Report: | ||
# Create plan of action. | ## keep that thread open until the incident is resolved | ||
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related | |||
# Determine whether clients are potentially affected, if so: | |||
## notify the affected clients (Slack preferred if available) | |||
## share the message sent to the client in the incident Zulip thread | |||
# Document all actions taken in the Zulip topic. | |||
# Create a plan of action. | |||
# Execute plan and document results in Zabbix thread. | # Execute plan and document results in Zabbix thread. | ||
# If unresolved, create new plan. | # If unresolved, create a new plan. | ||
# When resolved: | # When resolved: | ||
## Verify trigger is no longer firing. | ## Verify trigger is no longer firing. | ||
## Decide on when to notify affected clients (that you have notified of the incident) the incident has been resolved, and communicate this internally | ## Decide on when to notify affected clients (that you have notified of the incident), the incident has been resolved, and communicate this internally | ||
## Mark Zulip topic as resolved if no other incidents for host. | ## Mark Zulip topic as resolved if no other incidents for the host. | ||
## Check for related triggers and resolve them. | ## Check for related triggers and resolve them. | ||
## If there were any SRE - Report threads: | |||
### post a summary describing the high-level incident, that it is resolved, and how it was resolved. | |||
### post that summary message to any client channels such as Slack too. | |||
### close the thread in SRE - Report | |||
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely. | |||
=== Non-Critical Incidents === | === Non-Critical Incidents === | ||
| Line 57: | Line 70: | ||
# When an incident is in progress, and person A is handling it, then all incidents in area X, are handled by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over. | # When an incident is in progress, and person A is handling it, then all incidents in area X, are handled by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over. | ||
# FR always has the last word on what solution to apply for resolving an incident. | # FR always has the last word on what solution to apply for resolving an incident. | ||
== Critical incidents == | == Critical incidents == | ||
| Line 148: | Line 149: | ||
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | ||
== Non-Critical incidents == | == Non-Critical incidents == | ||
| Line 205: | Line 200: | ||
== Handover == | == Handover == | ||
See [[Handover]] | |||
edits