Incident Handling: Difference between revisions

Jump to navigation Jump to search
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Checklist =
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.
=== Critical Incidents ===
Critical incidents must be resolved within 16 hours.
# Acknowledge trigger in Zabbix.
# Check if incident is still ongoing.
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# Document all actions taken in Zulip topic.
# Create plan of action.
# Execute plan and document results in Zabbix thread.
# If unresolved, create new plan.
# When resolved:
## Verify trigger is no longer firing.
## Mark Zulip topic as resolved if no other incidents for host.
## Check for related triggers and resolve them.
Common issues that have occurred previously, and ''could'' occur again:
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* HTTPS down on Sunday: this can be due to Gitlab updates
=== Non-Critical Incidents ===
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.
# Acknowledge in Zabbix thread
# Check metrics sheet for existing milestone
## If a milestone exists:
### Add Lynx project ID to Zulip topic
### Add 🔁 emoji if ID already reported
## If no milestone exists:
### Add to metrics sheet
### Create Lynx project (priority 99, then 20 after estimation)
### Create Kimai activity
### Document IDs in Zulip topic
=== Informational Incidents ===
Informational incidents must be acknowledged within 72 hours.
# Acknowledge in Zabbix
# Verify issue
# Take action if needed
=== External Reports ===
# Acknowledge receipt
# Classify report as critical, non-critical or informational.
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.
# Proceed with checklist above for the type of incident.
= Full procedure =
== Zulip migration ==
== Zulip migration ==
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
116

edits

Navigation menu