116
edits
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
== Critical incidents == | == Critical incidents == | ||
* Critical incidents are resolved within 16 hours. | * Critical incidents are resolved within 16 hours. | ||
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours. | |||
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn't need to call in people to manage each incident. But a client's service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you'd want to call in help to resolve the incident in time). | |||
=== Process === | === Process === | ||
| Line 9: | Line 12: | ||
# Communicate plan/next steps (even if that is gathering information) | # Communicate plan/next steps (even if that is gathering information) | ||
# Communicate findings/results of executed plan, go back to previous step if not resolved | # Communicate findings/results of executed plan, go back to previous step if not resolved | ||
# Resolve incident | # Resolve incident + cleanup | ||
During working on an incident it is expected that all communication is done in the incident's thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident's thread with the comment that the resolution is done in that thread. | During working on an incident it is expected that all communication is done in the incident's thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident's thread with the comment that the resolution is done in that thread. | ||
| Line 48: | Line 51: | ||
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what's wrong. | # Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what's wrong. | ||
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don't know why something is failing, and/or don't have any decent hypotheses to follow up, you can follow this process to systematicly find the problem. | # Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don't know why something is failing, and/or don't have any decent hypotheses to follow up, you can follow this process to systematicly find the problem. | ||
Regarding the resolution to an incident: The resolution to any incident is usually one of two things: | |||
# Fix the underlying problem. | |||
# Fix the trigger itself. | |||
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger. | |||
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM's are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue. | |||
Some know issues and their resolutions: | Some know issues and their resolutions: | ||
| Line 54: | Line 63: | ||
* git.* HTTPS is down: On Sunday, Git gets automaticly updated, but this incurs some downtime. This is usually short enough to not be reported to Mattermost as per our settings, but sometimes it's longer. If the service does not stay down, for more then 20 minutes, the issue can be just resolved. | * git.* HTTPS is down: On Sunday, Git gets automaticly updated, but this incurs some downtime. This is usually short enough to not be reported to Mattermost as per our settings, but sometimes it's longer. If the service does not stay down, for more then 20 minutes, the issue can be just resolved. | ||
==== Resolve incident ==== | ==== Resolve incident + cleanup ==== | ||
When you've executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following: | |||
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you're sure that you've resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host's configuration on Zabbix and selecting 'Execute Now', after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated. | |||
# Type the magic string 'I agree that this has been fully resolved' in the thread. During the next iteration of the Zabbix-Mattermost integration the incident will be closed. | |||
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Mattermost for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved. | |||
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this. | |||
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it's own issue and the relevant process needs to be followed (critical or non-critical depending on criticality). | |||
## Ensuring the mentioned problem is no longer observable | |||
## The trigger has resolved (You might need to force an update with `Execute Now`). | |||
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that thread. | |||
## Closing the incident with the magic string `I agree that this has been fully resolved`. | |||
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | |||
===Additional context=== | ===Additional context=== | ||
| Line 81: | Line 102: | ||
# If action needed, perform action | # If action needed, perform action | ||
== If an incident is reported in the SRE channel by | == If an incident is reported in the SRE channel by other means than the Zabbix-Mattermost intergration == | ||
# Acknowledge receipt. | |||
# Classify the incident as critical, non-critical, or informational. | |||
# Create an issue using the `[https://chat.empiresmod.com/era/pl/xt669m4fcprfbciwee9iiq5xte ?track]` command and state that you've done so. | |||
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process) | |||
== Handover == | == Handover == | ||
When handing over the responsibility of first responder (FR), the following needs to happen: | When handing over the responsibility of first responder (FR), the following needs to happen: | ||
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group | * Acting FR adds the upcoming FR the the IPA sla-first-responder user group (Optional: and enables Zabbix calling for that person if they have that set) | ||
* The upcoming FR makes sure | * The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR. | ||
The following steps can be done async or in person: | The following steps can be done async or in person: | ||
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq). | * The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq). | ||
* If the acting FR wants to hand over responsibility for any ongoing incident | * If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over. | ||
* If there are any particularities the upcoming FR needs to be aware of, | * If there are any particularities the upcoming FR needs to be aware of, those are shared. | ||
* The upcoming FR asks | * The upcoming FR asks their questions until they are satisfied and able to take over the FR | ||
* The upcoming FR announces/informs that | * The upcoming FR announces/informs that they are now the acting FR over Mattermost's organisational channel | ||
* The now acting FR removes the previous FR from IPA the sla-first-responder user group | * The now acting FR removes the previous FR from IPA the sla-first-responder user group (Optional: and disables Zabbix calling for that person if they had that enabled) | ||
edits