Incident Handling: Difference between revisions

Jump to navigation Jump to search
Line 13: Line 13:
==== Resolve the incident ====
==== Resolve the incident ====
# Create plan of action.
# Create plan of action.
# Execute plan and document results.
# Execute plan and document results in Zabbix thread.  
# If unresolved, create new plan.
# If unresolved, create new plan.
# When resolved:
# When resolved:
Line 21: Line 21:


==== Common issues ====
==== Common issues ====
Common issues that have occurred previously, and ''could'' occur again:
* SSH down: Check MaxStartups throttling, apply custom SSH config
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* No backup: Verify backup process is running, check devteam email
Line 28: Line 29:
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.


# Acknowledge in Zabbix
# Acknowledge in Zabbix thread
# Check metrics sheet for existing milestone
# Check metrics sheet for existing milestone
# If a milestone exists:
## If a milestone exists:
## Add Lynx project ID to Zulip topic
### Add Lynx project ID to Zulip topic
## Add 🔁 emoji if ID already reported
### Add 🔁 emoji if ID already reported
# If no milestone exists:
## If no milestone exists:
## Add to metrics sheet
### Add to metrics sheet
## Create Lynx project (priority 99, then 20 after estimation)
### Create Lynx project (priority 99, then 20 after estimation)
## Create Kimai activity
### Create Kimai activity
## Document IDs in Zulip topic
### Document IDs in Zulip topic


=== Informational Incidents ===
=== Informational Incidents ===
Line 51: Line 52:
# Classify report as critical, non-critical or informational.  
# Classify report as critical, non-critical or informational.  
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.  
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.  
# Proceed with checklist above for the type of incident.  
# Proceed with checklist above for the type of incident.
 
=== Naming Convention ===
* Kimai: <YYYY-MM> <problem_title>
* Milestone: Delft Solutions Hosting Incident response work <kimai_activity_name>
* Lynx ID: SRE<YYMM><XXX>


= Full procedure =
= Full procedure =
116

edits

Navigation menu