116
edits
| (12 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
= Checklist = | |||
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material. | |||
=== Critical Incidents === | |||
Critical incidents must be resolved within 16 hours. | |||
# Acknowledge trigger in Zabbix. | |||
# Check if incident is still ongoing. | |||
# If ongoing and clients are potentially affected, notify the affected clients via Slack. | |||
# Document all actions taken in Zulip topic. | |||
# Create plan of action. | |||
# Execute plan and document results in Zabbix thread. | |||
# If unresolved, create new plan. | |||
# When resolved: | |||
## Verify trigger is no longer firing. | |||
## Mark Zulip topic as resolved if no other incidents for host. | |||
## Check for related triggers and resolve them. | |||
Common issues that have occurred previously, and ''could'' occur again: | |||
* SSH down: Check MaxStartups throttling, apply custom SSH config | |||
* No backup: Verify backup process is running, check devteam email | |||
* HTTPS down on Sunday: this can be due to Gitlab updates | |||
=== Non-Critical Incidents === | |||
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week. | |||
# Acknowledge in Zabbix thread | |||
# Check metrics sheet for existing milestone | |||
## If a milestone exists: | |||
### Add Lynx project ID to Zulip topic | |||
### Add 🔁 emoji if ID already reported | |||
## If no milestone exists: | |||
### Add to metrics sheet | |||
### Create Lynx project (priority 99, then 20 after estimation) | |||
### Create Kimai activity | |||
### Document IDs in Zulip topic | |||
=== Informational Incidents === | |||
Informational incidents must be acknowledged within 72 hours. | |||
# Acknowledge in Zabbix | |||
# Verify issue | |||
# Take action if needed | |||
=== External Reports === | |||
# Acknowledge receipt | |||
# Classify report as critical, non-critical or informational. | |||
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details. | |||
# Proceed with checklist above for the type of incident. | |||
= Full procedure = | |||
== Zulip migration == | == Zulip migration == | ||
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes: | Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes: | ||
| Line 138: | Line 191: | ||
# If action needed, perform action | # If action needed, perform action | ||
== If an incident is reported by other means than the Zabbix-Zulip | == If an incident is reported by other means than the Zabbix-Zulip integration == | ||
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc. | |||
# Acknowledge receipt. | # Acknowledge receipt. | ||
# Classify the incident as critical, non-critical, or informational. | # Classify the incident as critical, non-critical, or informational. | ||
| Line 147: | Line 201: | ||
When handing over the responsibility of '''first responder''' (FR), the following needs to happen: | When handing over the responsibility of '''first responder''' (FR), the following needs to happen: | ||
* The handover can be initiated by both the upcoming FR or the acting FR | * The handover can be initiated by both the upcoming FR or the acting FR | ||
* Acting FR adds the upcoming FR | * Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions] | ||
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented. | * Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented. | ||
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR. | * The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR. | ||
* The upcoming FR makes sure they are subscribed to the right channels. | * The upcoming FR makes sure they are subscribed to the right channels. | ||
edits