116
edits
No edit summary |
No edit summary |
||
| Line 31: | Line 31: | ||
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM' to reboot, or the router having an connectivity issue will lead to most other VM's having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to | The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM' to reboot, or the router having an connectivity issue will lead to most other VM's having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to | ||
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. | * Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip. | ||
==== Determine if incident is still ongoing ==== | ==== Determine if incident is still ongoing ==== | ||
| Line 51: | Line 51: | ||
** SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it. | ** SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it. | ||
* Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents. | * Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents. | ||
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on | * Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip. | ||
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I've sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it) | As always, report the decisions taken and actions maded in the incident thread. (e.g.: I've sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it) | ||
| Line 73: | Line 73: | ||
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartupsThrottling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration]. | * SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartupsThrottling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration]. | ||
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron). | * No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron). | ||
* git.* HTTPS is down: On Sunday, | * git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved. | ||
==== Resolve incident + cleanup ==== | ==== Resolve incident + cleanup ==== | ||
When you've executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following: | When you've executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following: | ||
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you're sure that you've resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host's configuration on Zabbix and selecting 'Execute Now', after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated. | # Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you're sure that you've resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host's configuration on Zabbix and selecting 'Execute Now', after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated. | ||
# | # Close the incident by marking the topic as resolved, when there are no other triggers firing for the host. | ||
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and | Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved. | ||
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this. | # First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this. | ||
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it's own issue and the relevant process needs to be followed (critical or non-critical depending on criticality). | # Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it's own issue and the relevant process needs to be followed (critical or non-critical depending on criticality). | ||
## Ensuring the mentioned problem is no longer observable | ## Ensuring the mentioned problem is no longer observable | ||
## The trigger has resolved (You might need to force an update with `Execute Now`). | ## The trigger has resolved (You might need to force an update with `Execute Now`). | ||
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that | ## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic. | ||
## Closing the incident | ## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host. | ||
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | ||
edits