Incident Handling: Difference between revisions

Incident Handling (view source)

Revision as of 04:48, 27 May 2024

4,627 bytes added , 27 May 2024

no edit summary

Dortund

125

edits

@@ Line 2: / Line 2: @@
 * Critical incidents are resolved within 16 hours.
-===Checklist===
+=== Process ===
-# Acknowledge on Zabbix and state who is responsible for resolving this in the description
+The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.
-# Determine affected clients
+# Take responsibility for seeing the incident resolved
-# Communicate to affected clients that the issue is being investigated
+# Determine if incident is still ongoing
+# If ongoing: Communicate to affected clients that the issue is being investigated
 # Communicate plan/next steps (even if that is gathering information)
 # Communicate findings/results of executed plan, go back to previous step if not resolved
 # Resolve incident
+During working on an incident it is expected that all communication is done in the incident's thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident's thread with the comment that the resolution is done in that thread.
+==== Acknowledge the incident on Zabbix ====
+The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM' to reboot, or the router having an connectivity issue will lead to most other VM's having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to
+* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident.
+==== Determine if incident is still ongoing ====
+The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:
+# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.
+# The trigger resolved itself and the problem can still be observed.
+# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can't be resolved, but in reality there's a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine.
+==== Communicate to affected clients ====
+If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:
+* If an incident has already resolved itself and the problem is no longer observable, we don't communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.
+* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don't have an exhaustive list, but here is those I can think of:
+** SSH Service is down: We don't have any clients that SSH into their services, so it's generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM's.
+** No backup for x days: Clients don't notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed
+** SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it.
+* Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
+* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Mattermost.
 ===Additional context===
@@ Line 14: / Line 38: @@
 * When it is being tracked on GitLab a heavy check mark is added to the message.
 * Responses on the thread and on GitLab are automatically synced (to some extend)
-* When you reply with '''I agree that this has been fully resolved''' eventually Zabbix will pick this up and a green check mark is added to the message.
+* When you reply with '''I agree that this has been fully resolved''' eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.
 == Non-Critical incidents ==

Incident Handling: Difference between revisions

Incident Handling (view source)

Revision as of 04:48, 27 May 2024

Navigation menu

Search