Incident Handling: Difference between revisions

Jump to navigation Jump to search
No edit summary
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Checklist =
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.
=== Critical Incidents ===
Critical incidents must be resolved within 16 hours.
# Acknowledge trigger in Zabbix.
# Check if incident is still ongoing.
# If ongoing and clients are potentially affected, notify the affected clients via Slack.
# Document all actions taken in Zulip topic.
# Create plan of action.
# Execute plan and document results in Zabbix thread.
# If unresolved, create new plan.
# When resolved:
## Verify trigger is no longer firing.
## Mark Zulip topic as resolved if no other incidents for host.
## Check for related triggers and resolve them.
Common issues that have occurred previously, and ''could'' occur again:
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* HTTPS down on Sunday: this can be due to Gitlab updates
=== Non-Critical Incidents ===
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.
# Acknowledge in Zabbix thread
# Check metrics sheet for existing milestone
## If a milestone exists:
### Add Lynx project ID to Zulip topic
### Add 🔁 emoji if ID already reported
## If no milestone exists:
### Add to metrics sheet
### Create Lynx project (priority 99, then 20 after estimation)
### Create Kimai activity
### Document IDs in Zulip topic
=== Informational Incidents ===
Informational incidents must be acknowledged within 72 hours.
# Acknowledge in Zabbix
# Verify issue
# Take action if needed
=== External Reports ===
# Acknowledge receipt
# Classify report as critical, non-critical or informational.
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.
# Proceed with checklist above for the type of incident.
= Full procedure =
== Zulip migration ==
== Zulip migration ==
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
Line 70: Line 123:
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM's are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM's are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.


Some know issues and their resolutions:
===== Some known issues and their resolutions =====
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartupsThrottling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartups throttling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved.
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved.
Line 101: Line 154:
=== Acknowledging ===
=== Acknowledging ===
Fully acknowledging a non-critical incident requires the following tasks to have been completed:
Fully acknowledging a non-critical incident requires the following tasks to have been completed:
* Acknowledging the incident on Zabbix
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.
* Add the non-critical incident as a milestone in the metrics sheet
 
 
The next steps don't have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.
 
Check if there's already a uncompleted milestone for this host with this issue in the metrics sheet.
If a milestone is already present:
* Report in the topic the Lynx project ID for resolving this issue.
* If the ID has already been reported in the topic, we don't want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert
 
If a milestone is NOT already present:
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention
** Start date is the date of the incident
** Start date is the date of the incident
** DoD states what needs to be true for the non-critical incident to be consider resolved
** DoD states what needs to be true for the non-critical incident to be consider resolved
* Add the non-critical incident to Lynx as a project
* Add the non-critical incident to Lynx as a project
** Follow the naming convention below for the title & project ID
** Tasks need to be added
** Tasks need to be added
** Final tasks needs to have the SLO deadline set as 'contraint'
** Final tasks needs to have the SLO deadline set as 'contraint'
** Project priority is set to 20 (as a default)
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20
** The tasks are estimated for SP
** The tasks are estimated for SP
* The Lynx project ID is reported in the non-critical incident's topic on Zulip
* The Lynx project ID is reported in the non-critical incident's topic on Zulip, and logged in the metrics sheet
* A Kimai activity is created in Kimai for the non-critical incident
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion


Checklist (outdated)
==== Naming convention ====
# Acknowledge on Zabbix and state who is responsible for resolving this in the description
* Kimai activity name needs to follow the pattern: '<YYYY-MM> <problem_title>'. For <problem_title>, incorporate the trigger title and hostname for clarity.
# Communicate plan/next steps (even if that is gathering information)
* Milestone name needs to follow the pattern: 'Delft Solutions Hosting Incident response work <kimai_activity_name>'
# Communicate findings/results of executed plan, go back to previous step if not resolved
* Lynx project name needs to follow the pattern: 'Delft Solutions Hosting Incident response work <kimai_activity_name>'
# If there is no resolution to the incident, evaluate if the trigger needs updating/disabling
* Lynx project ID needs to follow the pattern: 'SRE<YYMM><XXX>', where <XXX> is some three letter shorthand that relates to the problem/host
# Resolve incident


== Informational incidents ==
== Informational incidents ==
Line 128: Line 191:
# If action needed, perform action
# If action needed, perform action


== If an incident is reported by other means than the Zabbix-Zulip intergration ==
== If an incident is reported by other means than the Zabbix-Zulip integration ==
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.
# Acknowledge receipt.
# Acknowledge receipt.
# Classify the incident as critical, non-critical, or informational.
# Classify the incident as critical, non-critical, or informational.
Line 137: Line 201:
When handing over the responsibility of '''first responder''' (FR), the following needs to happen:
When handing over the responsibility of '''first responder''' (FR), the following needs to happen:
* The handover can be initiated by both the upcoming FR or the acting FR
* The handover can be initiated by both the upcoming FR or the acting FR
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group and enables Zabbix calling for that the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
* The upcoming FR makes sure they are subscribed to the right channels.
* The upcoming FR makes sure they are subscribed to the right channels.
116

edits

Navigation menu