Incident Handling: Difference between revisions

Jump to navigation Jump to search
mNo edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= This is the process =
This document is an authoritative description of the process. This document supersedes all prior documents on the process.
= Deviating from the process =
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate.
= Checklist =
= Checklist =
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.
Line 6: Line 12:


# Acknowledge trigger in Zabbix.
# Acknowledge trigger in Zabbix.
# Check if incident is still ongoing.
# Check if the incident is still ongoing.
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# Determine whether the incident is ongoing
# Document all actions taken in Zulip topic.
# If this report came in via SRE - Report:
# Create plan of action.
## keep that thread open until the incident is resolved
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
# Determine whether clients are potentially affected, if so:
## notify the affected clients (Slack preferred if available)
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Create a plan of action.
# Execute plan and document results in Zabbix thread.  
# Execute plan and document results in Zabbix thread.  
# If unresolved, create new plan.
# If unresolved, create a new plan.
# When resolved:
# When resolved:
## Verify trigger is no longer firing.
## Verify trigger is no longer firing.
## Mark Zulip topic as resolved if no other incidents for host.
## Decide on when to notify affected clients (that you have notified of the incident), the incident has been resolved, and communicate this internally
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.
## If there were any SRE - Report threads:
### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
### post that summary message to any client channels such as Slack too.
### close the thread in SRE - Report


Common issues that have occurred previously, and ''could'' occur again:
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* HTTPS down on Sunday: this can be due to Gitlab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
Line 56: Line 70:
# When an incident is in progress, and person A is handling it, then all incidents in area X, are handled by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over.
# When an incident is in progress, and person A is handling it, then all incidents in area X, are handled by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over.
# FR always has the last word on what solution to apply for resolving an incident.
# FR always has the last word on what solution to apply for resolving an incident.
== Zulip migration ==
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix
* Triggers are grouped in a topic on Zulip per host
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved
* There's no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.
* There is no automatic gitlab issue creation or syncing anymore.
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.


== Critical incidents ==
== Critical incidents ==
Line 147: Line 149:


When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.
===Additional context===
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical '''SLA - Critical'''].
* <s>When it is being tracked on GitLab a heavy check mark is added to the message.</s>
* <s>Responses on the thread and on GitLab are automatically synced (to some extend)</s>
* <s>When you reply with '''I agree that this has been fully resolved''' eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.</s>


== Non-Critical incidents ==
== Non-Critical incidents ==
Line 204: Line 200:


== Handover ==
== Handover ==
When handing over the responsibility of '''first responder''' (FR), the following needs to happen:
See [[Handover]]
* The handover can be initiated by both the upcoming FR or the acting FR
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
* The upcoming FR makes sure they are subscribed to the right channels.
 
The following steps can be done async or in person:
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.
* If there are any particularities the upcoming FR needs to be aware of, those are shared.
* The upcoming FR asks their questions until they are satisfied and able to take over the FR
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].
* The upcoming FR announces/informs that they are now the acting FR over Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
118

edits

Navigation menu