Incident Handling

From Delft Solutions
Revision as of 05:53, 27 May 2024 by Dortund (talk | contribs)
Jump to navigation Jump to search

Critical incidents

  • Critical incidents are resolved within 16 hours.

Process

The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.

  1. Take responsibility for seeing the incident resolved
  2. Determine if incident is still ongoing
  3. If ongoing: Communicate to affected clients that the issue is being investigated
  4. Communicate plan/next steps (even if that is gathering information)
  5. Communicate findings/results of executed plan, go back to previous step if not resolved
  6. Resolve incident

During working on an incident it is expected that all communication is done in the incident's thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident's thread with the comment that the resolution is done in that thread.

Acknowledge the incident on Zabbix

The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM' to reboot, or the router having an connectivity issue will lead to most other VM's having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to

  • Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident.

Determine if incident is still ongoing

The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:

  1. The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.
  2. The trigger resolved itself and the problem can still be observed.
  3. The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can't be resolved, but in reality there's a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an 'it works on my machine' does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice.

In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).

Make sure to report your findings in the incident's thread. It's advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)

Communicate to affected clients

If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:

  • If an incident has already resolved itself and the problem is no longer observable, we don't communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.
  • Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don't have an exhaustive list, but here is those I can think of:
    • SSH Service is down: We don't have any clients that SSH into their services, so it's generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM's.
    • No backup for x days: Clients don't notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed
    • SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it.
  • Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
  • Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Mattermost.

As always, report the decisions taken and actions maded in the incident thread. (e.g.: I've sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)

Communicate plan/next steps + Communicate findings/results of executed plan

This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it. The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn't, and then how to resolve that underlying cause or how to update the trigger to work better)

There are three main types of steps defined, but you are not limited to these:

  1. Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident 'SSH service is down on X' your hypothesis could be that this is due to 'MaxStartups' throttling, which can be proven by 'grep'ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.
  2. Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what's wrong.
  3. Investigative: The most rigorous of process. The full process is described here originally Drive - Final Coundown - General Investigative Process. To summarize, when you don't know why something is failing, and/or don't have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.

Some know issues and their resolutions:

  • SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartupsThrottling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration Custom SSH Configuration.
  • No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).
  • git.* HTTPS is down: On Sunday, Git gets automaticly updated, but this incurs some downtime. This is usually short enough to not be reported to Mattermost as per our settings, but sometimes it's longer. If the service does not stay down, for more then 20 minutes, the issue can be just resolved.

Resolve incident

Additional context

  • Critical incidents are posted in Infrastructure.
  • When it is being tracked on GitLab a heavy check mark is added to the message.
  • Responses on the thread and on GitLab are automatically synced (to some extend)
  • When you reply with I agree that this has been fully resolved eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.

Non-Critical incidents

  • Non-critical incidents are acknowledged within 9 hours and resolved within one week.

Checklist

  1. Acknowledge on Zabbix and state who is responsible for resolving this in the description
  2. Communicate plan/next steps (even if that is gathering information)
  3. Communicate findings/results of executed plan, go back to previous step if not resolved
  4. If there is no resolution to the incident, evaluate if the trigger needs updating/disabling
  5. Resolve incident

Informational incidents

  • Informational incidents are acknowledged within 72 hours

Checklist

  1. Acknowledge on Zabbix
  2. Sanity check the event, post result in thread
  3. If action needed, perform action

If an incident is reported in the SRE channel by a human

  • Acknowledge receipt.
  • Classify the incident as critical, non-critical, or informational.
  • Create an issue and state that you've done so.

Handover

When handing over the responsibility of first responder (FR), the following needs to happen:

  • Acting FR adds the upcoming FR the the IPA sla-first-responder user group, and enables Zabbix calling for that person
  • The upcoming FR makes sure he is aware of the state of the SLA and knows what questions he wants to ask the acting FR

The following steps can be done async or in person:

  • The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq).
  • If the acting FR wants to hand over responsibility for any ongoing incident he also states which incidents he want the upcoming FR to take over.
  • If there are any particularities the upcoming FR needs to be aware of, he shares them then.
  • The upcoming FR asks his questions until he is satisfied and able to take over the FR
  • The upcoming FR announces/informs that he is now the acting FR over Mattermost's organisational channel
  • The now acting FR removes the previous FR from IPA the sla-first-responder user group, and disables Zabbix calling for that person