116
edits
m (→Handover) |
No edit summary |
||
| Line 1: | Line 1: | ||
= New procedure = | |||
=== Critical Incidents (16hr resolution) === | |||
==== Initial Response ==== | |||
# Acknowledge trigger in Zabbix | |||
# Check if incident is still ongoing | |||
# If ongoing, notify affected clients via Slack | |||
# Document all actions in Zulip topic | |||
==== Resolution Process ==== | |||
# Create plan of action | |||
# Execute plan and document results | |||
# If unresolved, create new plan | |||
# When resolved: | |||
** Verify trigger is no longer firing | |||
** Mark Zulip topic as resolved if no other incidents for host | |||
** Check for related triggers and resolve them | |||
==== Common Issues ==== | |||
* SSH down: Check MaxStartups throttling, apply custom SSH config | |||
* No backup: Verify backup process is running, check devteam email | |||
* HTTPS down: May be due to Sunday Gitlab updates | |||
=== Non-Critical Incidents (9hr acknowledge, 1wk resolution) === | |||
# Acknowledge in Zabbix | |||
# Check metrics sheet for existing milestone | |||
# If milestone exists: | |||
** Add Lynx project ID to Zulip topic | |||
** Add 🔁 emoji if ID already reported | |||
# If no milestone: | |||
** Add to metrics sheet | |||
** Create Lynx project (priority 99, then 20 after estimation) | |||
** Create Kimai activity | |||
** Document IDs in Zulip topic | |||
=== Informational Incidents (72hr acknowledge) === | |||
# Acknowledge in Zabbix | |||
# Verify issue | |||
# Take action if needed | |||
=== External Reports === | |||
# Acknowledge receipt | |||
# Classify criticality | |||
# Create Zulip topic in appropriate channel | |||
# Follow standard process based on classification | |||
=== Handover Steps === | |||
==== Acting FR: ==== | |||
* Add new FR to IPA group | |||
* Enable Zabbix calling | |||
* Document all active incidents | |||
* Share special circumstances | |||
==== New FR: ==== | |||
* Review SLA status | |||
* Subscribe to channels: | |||
** SRE - General | |||
** SRE # Critical | |||
** SRE ## Non-Critical | |||
** SRE ### Informational | |||
* Announce takeover in Organisational channel | |||
* Remove old FR from IPA group | |||
* Disable old FR's Zabbix calling | |||
=== Naming Convention === | |||
* Kimai: <YYYY-MM> <problem_title> | |||
* Milestone: Delft Solutions Hosting Incident response work <kimai_activity_name> | |||
* Lynx ID: SRE<YYMM><XXX> | |||
= Old procedure = | |||
== Zulip migration == | == Zulip migration == | ||
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes: | Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes: | ||
edits