116
edits
| (15 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
= Checklist = | |||
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material. | |||
=== Critical Incidents === | |||
Critical incidents must be resolved within 16 hours. | |||
# Acknowledge trigger in Zabbix. | |||
# Check if incident is still ongoing. | |||
# If ongoing and clients are potentially affected, notify the affected clients via Slack. | |||
# Document all actions taken in Zulip topic. | |||
# Create plan of action. | |||
# Execute plan and document results in Zabbix thread. | |||
# If unresolved, create new plan. | |||
# When resolved: | |||
## Verify trigger is no longer firing. | |||
## Mark Zulip topic as resolved if no other incidents for host. | |||
## Check for related triggers and resolve them. | |||
Common issues that have occurred previously, and ''could'' occur again: | |||
* SSH down: Check MaxStartups throttling, apply custom SSH config | |||
* No backup: Verify backup process is running, check devteam email | |||
* HTTPS down on Sunday: this can be due to Gitlab updates | |||
=== Non-Critical Incidents === | |||
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week. | |||
# Acknowledge in Zabbix thread | |||
# Check metrics sheet for existing milestone | |||
## If a milestone exists: | |||
### Add Lynx project ID to Zulip topic | |||
### Add 🔁 emoji if ID already reported | |||
## If no milestone exists: | |||
### Add to metrics sheet | |||
### Create Lynx project (priority 99, then 20 after estimation) | |||
### Create Kimai activity | |||
### Document IDs in Zulip topic | |||
=== Informational Incidents === | |||
Informational incidents must be acknowledged within 72 hours. | |||
# Acknowledge in Zabbix | |||
# Verify issue | |||
# Take action if needed | |||
=== External Reports === | |||
# Acknowledge receipt | |||
# Classify report as critical, non-critical or informational. | |||
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details. | |||
# Proceed with checklist above for the type of incident. | |||
= Full procedure = | |||
== Zulip migration == | == Zulip migration == | ||
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes: | Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes: | ||
| Line 70: | Line 123: | ||
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM's are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue. | Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM's are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue. | ||
Some | ===== Some known issues and their resolutions ===== | ||
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep ' | * SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartups throttling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration]. | ||
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron). | * No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron). | ||
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved. | * git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved. | ||
| Line 138: | Line 191: | ||
# If action needed, perform action | # If action needed, perform action | ||
== If an incident is reported by other means than the Zabbix-Zulip | == If an incident is reported by other means than the Zabbix-Zulip integration == | ||
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc. | |||
# Acknowledge receipt. | # Acknowledge receipt. | ||
# Classify the incident as critical, non-critical, or informational. | # Classify the incident as critical, non-critical, or informational. | ||
| Line 147: | Line 201: | ||
When handing over the responsibility of '''first responder''' (FR), the following needs to happen: | When handing over the responsibility of '''first responder''' (FR), the following needs to happen: | ||
* The handover can be initiated by both the upcoming FR or the acting FR | * The handover can be initiated by both the upcoming FR or the acting FR | ||
* Acting FR adds the upcoming FR | * Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions] | ||
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented. | |||
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR. | * The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR. | ||
* The upcoming FR makes sure they are subscribed to the right channels. | * The upcoming FR makes sure they are subscribed to the right channels. | ||
edits