Incident Handling: Difference between revisions

Jump to navigation Jump to search
Line 5: Line 5:
Critical incidents must be resolved within 16 hours.  
Critical incidents must be resolved within 16 hours.  


==== Initial Response ====
# Acknowledge trigger in Zabbix.
# Acknowledge trigger in Zabbix.
# Check if incident is still ongoing.
# Check if incident is still ongoing.
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# Document all actions in Zulip topic.
# Document all actions taken in Zulip topic.
 
==== Resolve the incident ====
# Create plan of action.
# Create plan of action.
# Execute plan and document results in Zabbix thread.  
# Execute plan and document results in Zabbix thread.  
Line 20: Line 17:
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.


==== Common issues ====
Common issues that have occurred previously, and ''could'' occur again:
Common issues that have occurred previously, and ''could'' occur again:
* SSH down: Check MaxStartups throttling, apply custom SSH config
* SSH down: Check MaxStartups throttling, apply custom SSH config
116

edits

Navigation menu