Incident Handling: Difference between revisions

Incident Handling (view source)

Revision as of 07:53, 4 March 2025

341 bytes removed , 4 March

→‎Critical Incidents

Jakobbuis

116

edits

@@ Line 5: / Line 5: @@
 Critical incidents must be resolved within 16 hours.
-==== Initial Response ====
+# Acknowledge trigger in Zabbix.
-# Acknowledge trigger in Zabbix
+# Check if incident is still ongoing.
-# Check if incident is still ongoing
+# If ongoing and clients are potentially affected, notify the affected clients via Slack.
-# If ongoing, notify affected clients via Slack
+# Document all actions taken in Zulip topic.
-# Document all actions in Zulip topic
+# Create plan of action.
+# Execute plan and document results in Zabbix thread.
-==== Resolution Process ====
+# If unresolved, create new plan.
-# Create plan of action
-# Execute plan and document results
-# If unresolved, create new plan
 # When resolved:
-## Verify trigger is no longer firing
+## Verify trigger is no longer firing.
-## Mark Zulip topic as resolved if no other incidents for host
+## Mark Zulip topic as resolved if no other incidents for host.
-## Check for related triggers and resolve them
+## Check for related triggers and resolve them.
-==== Common issues ====
+Common issues that have occurred previously, and ''could'' occur again:
 * SSH down: Check MaxStartups throttling, apply custom SSH config
 * No backup: Verify backup process is running, check devteam email
-* HTTPS down: May be due to Sunday Gitlab updates
+* HTTPS down on Sunday: this can be due to Gitlab updates
 === Non-Critical Incidents ===
 Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.
-# Acknowledge in Zabbix
+# Acknowledge in Zabbix thread
 # Check metrics sheet for existing milestone
-# If a milestone exists:
+## If a milestone exists:
-## Add Lynx project ID to Zulip topic
+### Add Lynx project ID to Zulip topic
-## Add 🔁 emoji if ID already reported
+### Add 🔁 emoji if ID already reported
-# If no milestone:
+## If no milestone exists:
-## Add to metrics sheet
+### Add to metrics sheet
-## Create Lynx project (priority 99, then 20 after estimation)
+### Create Lynx project (priority 99, then 20 after estimation)
-## Create Kimai activity
+### Create Kimai activity
-## Document IDs in Zulip topic
+### Document IDs in Zulip topic
 === Informational Incidents ===
@@ Line 49: / Line 46: @@
 # Acknowledge receipt
-# Classify criticality
+# Classify report as critical, non-critical or informational.
-# Create Zulip topic in appropriate channel
+# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.
-# Follow standard process based on classification
+# Proceed with checklist above for the type of incident.
-=== Handover Steps ===
-==== Acting FR: ====
-* Add new FR to IPA group
-* Enable Zabbix calling
-* Document all active incidents
-* Share special circumstances
-==== New FR: ====
-* Review SLA status
-* Subscribe to channels:
-** SRE - General
-** SRE # Critical
-** SRE ## Non-Critical
-** SRE ### Informational
-* Announce takeover in Organisational channel
-* Remove old FR from IPA group
-* Disable old FR's Zabbix calling
-=== Naming Convention ===
-* Kimai: <YYYY-MM> <problem_title>
-* Milestone: Delft Solutions Hosting Incident response work <kimai_activity_name>
-* Lynx ID: SRE<YYMM><XXX>
 = Full procedure =

Incident Handling: Difference between revisions

Incident Handling (view source)

Revision as of 07:53, 4 March 2025

Navigation menu

Search