Incident Handling: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
Line 1: Line 1:
= New procedure =
=== Critical Incidents (16hr resolution) ===
==== Initial Response ====
# Acknowledge trigger in Zabbix
# Check if incident is still ongoing
# If ongoing, notify affected clients via Slack
# Document all actions in Zulip topic
==== Resolution Process ====
# Create plan of action
# Execute plan and document results
# If unresolved, create new plan
# When resolved:
** Verify trigger is no longer firing
** Mark Zulip topic as resolved if no other incidents for host
** Check for related triggers and resolve them
==== Common Issues ====
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* HTTPS down: May be due to Sunday Gitlab updates
=== Non-Critical Incidents (9hr acknowledge, 1wk resolution) ===
# Acknowledge in Zabbix
# Check metrics sheet for existing milestone
# If milestone exists:
** Add Lynx project ID to Zulip topic
** Add 🔁 emoji if ID already reported
# If no milestone:
** Add to metrics sheet
** Create Lynx project (priority 99, then 20 after estimation)
** Create Kimai activity
** Document IDs in Zulip topic
=== Informational Incidents (72hr acknowledge) ===
# Acknowledge in Zabbix
# Verify issue
# Take action if needed
=== External Reports ===
# Acknowledge receipt
# Classify criticality
# Create Zulip topic in appropriate channel
# Follow standard process based on classification
=== Handover Steps ===
==== Acting FR: ====
* Add new FR to IPA group
* Enable Zabbix calling
* Document all active incidents
* Share special circumstances
==== New FR: ====
* Review SLA status
* Subscribe to channels:
** SRE - General
** SRE # Critical
** SRE ## Non-Critical
** SRE ### Informational
* Announce takeover in Organisational channel
* Remove old FR from IPA group
* Disable old FR's Zabbix calling
=== Naming Convention ===
* Kimai: <YYYY-MM> <problem_title>
* Milestone: Delft Solutions Hosting Incident response work <kimai_activity_name>
* Lynx ID: SRE<YYMM><XXX>
= Old procedure =
== Zulip migration ==
== Zulip migration ==
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
116

edits

Navigation menu