Incident Handling: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 22: Line 22:
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.
# The trigger resolved itself and the problem can still be observed.
# The trigger resolved itself and the problem can still be observed.
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can't be resolved, but in reality there's a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine.
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can't be resolved, but in reality there's a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an 'it works on my machine' does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice.
 
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).
 
Make sure to report your findings in the incident's thread. It's advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)


==== Communicate to affected clients ====
==== Communicate to affected clients ====
Line 33: Line 37:
* Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
* Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Mattermost.
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Mattermost.
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I've sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)
==== Communicate plan/next steps + Communicate findings/results of executed plan ====
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn't, and then how to resolve that underlying cause or how to update the trigger to work better)
There are three main types of steps defined, but you are not limited to these:
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident 'SSH service is down on X' your hypothesis could be that this is due to 'MaxStartups' throttling, which can be proven by 'grep'ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what's wrong.
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don't know why something is failing, and/or don't have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.
Some know issues and their resolutions:
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartupsThrottling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.empiresmod.com/era/pl/oxiu4ark4t8e5paueftr981iyo Custom SSH Configuration].
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).
* git.* HTTPS is down: On Sunday, Git gets automaticly updated, but this incurs some downtime. This is usually short enough to not be reported to Mattermost as per our settings, but sometimes it's longer. If the service does not stay down, for more then 20 minutes, the issue can be just resolved.
==== Resolve incident ====


===Additional context===
===Additional context===
116

edits

Navigation menu