Incident Handling
Critical incidents
- Critical incidents are resolved within 16 hours.
Process
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.
- Take responsibility for seeing the incident resolved
- Determine if incident is still ongoing
- If ongoing: Communicate to affected clients that the issue is being investigated
- Communicate plan/next steps (even if that is gathering information)
- Communicate findings/results of executed plan, go back to previous step if not resolved
- Resolve incident
During working on an incident it is expected that all communication is done in the incident's thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident's thread with the comment that the resolution is done in that thread.
Acknowledge the incident on Zabbix
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM' to reboot, or the router having an connectivity issue will lead to most other VM's having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to
- Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident.
Determine if incident is still ongoing
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:
- The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.
- The trigger resolved itself and the problem can still be observed.
- The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can't be resolved, but in reality there's a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine.
Communicate to affected clients
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:
- If an incident has already resolved itself and the problem is no longer observable, we don't communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.
- Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don't have an exhaustive list, but here is those I can think of:
- SSH Service is down: We don't have any clients that SSH into their services, so it's generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM's.
- No backup for x days: Clients don't notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed
- SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it.
- Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
- Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Mattermost.
Additional context
- Critical incidents are posted in Infrastructure.
- When it is being tracked on GitLab a heavy check mark is added to the message.
- Responses on the thread and on GitLab are automatically synced (to some extend)
- When you reply with I agree that this has been fully resolved eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.
Non-Critical incidents
- Non-critical incidents are acknowledged within 9 hours and resolved within one week.
Checklist
- Acknowledge on Zabbix and state who is responsible for resolving this in the description
- Communicate plan/next steps (even if that is gathering information)
- Communicate findings/results of executed plan, go back to previous step if not resolved
- If there is no resolution to the incident, evaluate if the trigger needs updating/disabling
- Resolve incident
Informational incidents
- Informational incidents are acknowledged within 72 hours
Checklist
- Acknowledge on Zabbix
- Sanity check the event, post result in thread
- If action needed, perform action
If an incident is reported in the SRE channel by a human
- Acknowledge receipt.
- Classify the incident as critical, non-critical, or informational.
- Create an issue and state that you've done so.
Handover
When handing over the responsibility of first responder (FR), the following needs to happen:
- Acting FR adds the upcoming FR the the IPA sla-first-responder user group, and enables Zabbix calling for that person
- The upcoming FR makes sure he is aware of the state of the SLA and knows what questions he wants to ask the acting FR
The following steps can be done async or in person:
- The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq).
- If the acting FR wants to hand over responsibility for any ongoing incident he also states which incidents he want the upcoming FR to take over.
- If there are any particularities the upcoming FR needs to be aware of, he shares them then.
- The upcoming FR asks his questions until he is satisfied and able to take over the FR
- The upcoming FR announces/informs that he is now the acting FR over Mattermost's organisational channel
- The now acting FR removes the previous FR from IPA the sla-first-responder user group, and disables Zabbix calling for that person