Incident Handling: Difference between revisions

Jump to navigation Jump to search
(remove common issues in imperative checklist, better listed below)
 
(3 intermediate revisions by the same user not shown)
Line 14: Line 14:
# Check if the incident is still ongoing.
# Check if the incident is still ongoing.
# Determine whether the incident is ongoing
# Determine whether the incident is ongoing
# If this report came in via SRE - Report:
## keep that thread open until the incident is resolved
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
# Determine whether clients are potentially affected, if so:
# Determine whether clients are potentially affected, if so:
## notify the affected clients (Slack preferred)
## notify the affected clients (Slack preferred if available)
## share the message sent to the client in the incident Zulip thread
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Document all actions taken in the Zulip topic.
Line 26: Line 29:
## Mark Zulip topic as resolved if no other incidents for the host.
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.
## If there were any SRE - Report threads:
### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
### post that summary message to any client channels such as Slack too.
### close the thread in SRE - Report
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.
Non-critical incidents must be acknowledged within one working day and resolved within 3 weeks.  


# Acknowledge in Zabbix thread
# Acknowledge in Zabbix thread
Line 142: Line 151:


== Non-Critical incidents ==
== Non-Critical incidents ==
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.
Non-critical incidents are acknowledged within one working day and resolved within three weeks.


=== Acknowledging ===
=== Acknowledging ===
120

edits

Navigation menu