Incident Handling: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 1: Line 1:
== Zulip migration ==
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix
* Triggers are grouped in a topic on Zulip per host
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved
* There's no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.
* Incidents can be manually tracked by creating a topic by hand and reporting the problem
* There is no automatic gitlab issue creation or syncing anymore.
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.
== Critical incidents ==
== Critical incidents ==
* Critical incidents are resolved within 16 hours.
* Critical incidents are resolved within 16 hours.
Line 79: Line 91:


===Additional context===
===Additional context===
* Critical incidents are posted in '''Infrastructure'''.
* Critical incidents are posted in '''SLA - Critical'''.
* When it is being tracked on GitLab a heavy check mark is added to the message.
* {{When it is being tracked on GitLab a heavy check mark is added to the message.}}
* Responses on the thread and on GitLab are automatically synced (to some extend)
* {{Responses on the thread and on GitLab are automatically synced (to some extend)}}
* When you reply with '''I agree that this has been fully resolved''' eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.
* {{When you reply with '''I agree that this has been fully resolved''' eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.}}


== Non-Critical incidents ==
== Non-Critical incidents ==
Line 102: Line 114:
# If action needed, perform action
# If action needed, perform action


== If an incident is reported in the SRE channel by other means than the Zabbix-Mattermost intergration ==
== If an incident is reported by other means than the Zabbix-Zulip intergration ==
# Acknowledge receipt.
# Acknowledge receipt.
# Classify the incident as critical, non-critical, or informational.
# Classify the incident as critical, non-critical, or informational.
# Create an issue using the `[https://chat.empiresmod.com/era/pl/xt669m4fcprfbciwee9iiq5xte ?track]` command and state that you've done so.
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)


Line 113: Line 125:
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
The following steps can be done async or in person:
The following steps can be done async or in person:
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq).
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip's organisational channel if asynq).
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.
* If there are any particularities the upcoming FR needs to be aware of, those are shared.
* If there are any particularities the upcoming FR needs to be aware of, those are shared.
* The upcoming FR asks their questions until they are satisfied and able to take over the FR
* The upcoming FR asks their questions until they are satisfied and able to take over the FR
* The upcoming FR announces/informs that they are now the acting FR over Mattermost's organisational channel
* The upcoming FR announces/informs that they are now the acting FR over Zulip's organisational channel
* The now acting FR removes the previous FR from IPA the sla-first-responder user group (Optional: and disables Zabbix calling for that person if they had that enabled by going to Zabbix > Configuration > Actions > Trigger actions)
* The now acting FR removes the previous FR from IPA the sla-first-responder user group (Optional: and disables Zabbix calling for that person if they had that enabled by going to Zabbix > Configuration > Actions > Trigger actions)
116

edits

Navigation menu