Incident Handling: Difference between revisions

Jump to navigation Jump to search
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
Critical incidents must be resolved within 16 hours.  
Critical incidents must be resolved within 16 hours.  


==== Initial Response ====
# Acknowledge trigger in Zabbix.
# Acknowledge trigger in Zabbix
# Check if incident is still ongoing.
# Check if incident is still ongoing
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# If ongoing, notify affected clients via Slack
# Document all actions taken in Zulip topic.
# Document all actions in Zulip topic
# Create plan of action.
 
# Execute plan and document results in Zabbix thread.
==== Resolution Process ====
# If unresolved, create new plan.
# Create plan of action
# Execute plan and document results
# If unresolved, create new plan
# When resolved:
# When resolved:
## Verify trigger is no longer firing
## Verify trigger is no longer firing.
## Mark Zulip topic as resolved if no other incidents for host
## Mark Zulip topic as resolved if no other incidents for host.
## Check for related triggers and resolve them
## Check for related triggers and resolve them.


==== Common issues ====
Common issues that have occurred previously, and ''could'' occur again:
* SSH down: Check MaxStartups throttling, apply custom SSH config
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* No backup: Verify backup process is running, check devteam email
* HTTPS down: May be due to Sunday Gitlab updates
* HTTPS down on Sunday: this can be due to Gitlab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.


# Acknowledge in Zabbix
# Acknowledge in Zabbix thread
# Check metrics sheet for existing milestone
# Check metrics sheet for existing milestone
# If a milestone exists:
## If a milestone exists:
## Add Lynx project ID to Zulip topic
### Add Lynx project ID to Zulip topic
## Add 🔁 emoji if ID already reported
### Add 🔁 emoji if ID already reported
# If no milestone:
## If no milestone exists:
## Add to metrics sheet
### Add to metrics sheet
## Create Lynx project (priority 99, then 20 after estimation)
### Create Lynx project (priority 99, then 20 after estimation)
## Create Kimai activity
### Create Kimai activity
## Document IDs in Zulip topic
### Document IDs in Zulip topic


=== Informational Incidents ===
=== Informational Incidents ===
Line 49: Line 46:


# Acknowledge receipt
# Acknowledge receipt
# Classify criticality
# Classify report as critical, non-critical or informational.
# Create Zulip topic in appropriate channel
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.
# Follow standard process based on classification
# Proceed with checklist above for the type of incident.
 
=== Handover Steps ===
 
==== Acting FR: ====
* Add new FR to IPA group
* Enable Zabbix calling
* Document all active incidents
* Share special circumstances
 
==== New FR: ====
* Review SLA status
* Subscribe to channels:
** SRE - General
** SRE # Critical
** SRE ## Non-Critical
** SRE ### Informational
* Announce takeover in Organisational channel
* Remove old FR from IPA group
* Disable old FR's Zabbix calling
 
=== Naming Convention ===
* Kimai: <YYYY-MM> <problem_title>
* Milestone: Delft Solutions Hosting Incident response work <kimai_activity_name>
* Lynx ID: SRE<YYMM><XXX>


= Full procedure =
= Full procedure =
116

edits

Navigation menu