Incident Handling: Difference between revisions

Jump to navigation Jump to search
Line 6: Line 6:


==== Initial Response ====
==== Initial Response ====
# Acknowledge trigger in Zabbix
# Acknowledge trigger in Zabbix.
# Check if incident is still ongoing
# Check if incident is still ongoing.
# If ongoing, notify affected clients via Slack
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# Document all actions in Zulip topic
# Document all actions in Zulip topic.


==== Resolution Process ====
==== Resolve the incident ====
# Create plan of action
# Create plan of action.
# Execute plan and document results
# Execute plan and document results.
# If unresolved, create new plan
# If unresolved, create new plan.
# When resolved:
# When resolved:
## Verify trigger is no longer firing
## Verify trigger is no longer firing.
## Mark Zulip topic as resolved if no other incidents for host
## Mark Zulip topic as resolved if no other incidents for host.
## Check for related triggers and resolve them
## Check for related triggers and resolve them.


==== Common issues ====
==== Common issues ====
* SSH down: Check MaxStartups throttling, apply custom SSH config
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* No backup: Verify backup process is running, check devteam email
* HTTPS down: May be due to Sunday Gitlab updates
* HTTPS down on Sunday: this can be due to Gitlab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
Line 33: Line 33:
## Add Lynx project ID to Zulip topic
## Add Lynx project ID to Zulip topic
## Add 🔁 emoji if ID already reported
## Add 🔁 emoji if ID already reported
# If no milestone:
# If no milestone exists:
## Add to metrics sheet
## Add to metrics sheet
## Create Lynx project (priority 99, then 20 after estimation)
## Create Lynx project (priority 99, then 20 after estimation)
Line 49: Line 49:


# Acknowledge receipt
# Acknowledge receipt
# Classify criticality
# Classify report as critical, non-critical or informational.
# Create Zulip topic in appropriate channel
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.
# Follow standard process based on classification
# Proceed with checklist above for the type of incident.
 
=== Handover Steps ===
 
==== Acting FR: ====
* Add new FR to IPA group
* Enable Zabbix calling
* Document all active incidents
* Share special circumstances
 
==== New FR: ====
* Review SLA status
* Subscribe to channels:
** SRE - General
** SRE # Critical
** SRE ## Non-Critical
** SRE ### Informational
* Announce takeover in Organisational channel
* Remove old FR from IPA group
* Disable old FR's Zabbix calling


=== Naming Convention ===
=== Naming Convention ===
116

edits

Navigation menu