Incident Handling: Difference between revisions

Jump to navigation Jump to search
→‎Critical Incidents: textual improvements
No edit summary
(→‎Critical Incidents: textual improvements)
Line 6: Line 6:


# Acknowledge trigger in Zabbix.
# Acknowledge trigger in Zabbix.
# Check if incident is still ongoing.
# Check if the incident is still ongoing.
# If ongoing and clients are potentially affected, notify the affected clients via Slack (when done, communicate this internally).
# Determine whether the incident is ongoing
# Document all actions taken in Zulip topic.
# Determine whether clients are potentially affected, if so:
# Create plan of action.
## notify the affected clients (Slack preferred)
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Create a plan of action.
# Execute plan and document results in Zabbix thread.  
# Execute plan and document results in Zabbix thread.  
# If unresolved, create new plan.
# If unresolved, create a new plan.
# When resolved:
# When resolved:
## Verify trigger is no longer firing.
## Verify trigger is no longer firing.
## Decide on when to notify affected clients (that you have notified of the incident) the incident has been resolved, and communicate this internally
## Decide on when to notify affected clients (that you have notified of the incident), the incident has been resolved, and communicate this internally
## Mark Zulip topic as resolved if no other incidents for host.
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## Check for related triggers and resolve them.


Common issues that have occurred previously, and ''could'' occur again:
Common issues that have occurred previously, and ''could'' occur again:
* SSH down: Check MaxStartups throttling, apply custom SSH config
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* No backup: Verify backup process is running, check the devteam email
* HTTPS down on Sunday: this can be due to Gitlab updates
* HTTPS down on Sunday: this can be due to GitLab updates


=== Non-Critical Incidents ===
=== Non-Critical Incidents ===
116

edits

Navigation menu