Incident Handling: Difference between revisions

Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
= This is the process =
= This is the process =
This document describes the process. This document supersedes all prior documents on process.  
This document is an authoritative description of the process. This document supersedes all prior documents on the process.  


= Deviating from the process =
= Deviating from the process =
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, ideally as soon as possible after deciding to deviate.
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate.


= Checklist =
= Checklist =
Line 14: Line 14:
# Check if the incident is still ongoing.
# Check if the incident is still ongoing.
# Determine whether the incident is ongoing
# Determine whether the incident is ongoing
# If this report came in via SRE - Report:
## keep that thread open until the incident is resolved
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
# Determine whether clients are potentially affected, if so:
# Determine whether clients are potentially affected, if so:
## notify the affected clients (Slack preferred)
## notify the affected clients (Slack preferred if available)
## share the message sent to the client in the incident Zulip thread
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Document all actions taken in the Zulip topic.
Line 26: Line 29:
## Mark Zulip topic as resolved if no other incidents for the host.
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.
## If there were any SRE - Report threads:
### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
### post that summary message to any client channels such as Slack too.
### close the thread in SRE - Report


Common issues that have occurred previously, and ''could'' occur again:
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check the devteam email
* HTTPS down on Sunday: this can be due to GitLab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
Line 145: Line 149:


When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.
===Additional context===
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical '''SLA - Critical'''].


== Non-Critical incidents ==
== Non-Critical incidents ==
Line 199: Line 200:


== Handover ==
== Handover ==
When handing over the responsibility of '''first responder''' (FR), the following needs to happen:
See [[Handover]]
* The handover can be initiated by both the upcoming FR or the acting FR
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
* The upcoming FR makes sure they are subscribed to the right channels.
 
The following steps can be done async or in person:
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.
* If there are any particularities the upcoming FR needs to be aware of, those are shared.
* The upcoming FR asks their questions until they are satisfied and able to take over the FR
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].
* The upcoming FR announces/informs that they are now the acting FR over Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
118

edits

Navigation menu