Incident Handling: Difference between revisions

Jump to navigation Jump to search
 
(23 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Shorter checklist =
= This is the process =
This document is an authoritative description of the process. This document supersedes all prior documents on the process.
=== Critical Incidents (16hr resolution) ===


==== Initial Response ====
= Deviating from the process =
# Acknowledge trigger in Zabbix
You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate.
# Check if incident is still ongoing
# If ongoing, notify affected clients via Slack
# Document all actions in Zulip topic


==== Resolution Process ====
= Checklist =
# Create plan of action
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You're encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.
# Execute plan and document results
 
# If unresolved, create new plan
=== Critical Incidents ===
Critical incidents must be resolved within 16 hours.
 
# Acknowledge trigger in Zabbix.
# Check if the incident is still ongoing.
# Determine whether the incident is ongoing
# If this report came in via SRE - Report:
## keep that thread open until the incident is resolved
## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
# Determine whether clients are potentially affected, if so:
## notify the affected clients (Slack preferred if available)
## share the message sent to the client in the incident Zulip thread
# Document all actions taken in the Zulip topic.
# Create a plan of action.
# Execute plan and document results in Zabbix thread.
# If unresolved, create a new plan.
# When resolved:
# When resolved:
** Verify trigger is no longer firing
## Verify trigger is no longer firing.
** Mark Zulip topic as resolved if no other incidents for host
## Decide on when to notify affected clients (that you have notified of the incident), the incident has been resolved, and communicate this internally
** Check for related triggers and resolve them
## Mark Zulip topic as resolved if no other incidents for the host.
## Check for related triggers and resolve them.
## If there were any SRE - Report threads:
### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
### post that summary message to any client channels such as Slack too.
### close the thread in SRE - Report


==== Common Issues ====
Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.
* SSH down: Check MaxStartups throttling, apply custom SSH config
* No backup: Verify backup process is running, check devteam email
* HTTPS down: May be due to Sunday Gitlab updates


=== Non-Critical Incidents (9hr acknowledge, 1wk resolution) ===
=== Non-Critical Incidents ===
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.


# Acknowledge in Zabbix
# Acknowledge in Zabbix thread
# Check metrics sheet for existing milestone
# Check metrics sheet for existing milestone
# If milestone exists:
## If a milestone exists:
** Add Lynx project ID to Zulip topic
### Add Lynx project ID to Zulip topic
** Add 🔁 emoji if ID already reported
### Add 🔁 emoji if ID already reported
# If no milestone:
## If no milestone exists:
** Add to metrics sheet
### Add to metrics sheet
** Create Lynx project (priority 99, then 20 after estimation)
### Create Lynx project (priority 99, then 20 after estimation)
** Create Kimai activity
### Create Kimai activity
** Document IDs in Zulip topic
### Document IDs in Zulip topic


=== Informational Incidents (72hr acknowledge) ===
=== Informational Incidents ===
Informational incidents must be acknowledged within 72 hours.


# Acknowledge in Zabbix
# Acknowledge in Zabbix
Line 45: Line 60:


# Acknowledge receipt
# Acknowledge receipt
# Classify criticality
# Classify report as critical, non-critical or informational.
# Create Zulip topic in appropriate channel
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details.
# Follow standard process based on classification
# Proceed with checklist above for the type of incident.
 
=== Handover Steps ===
 
==== Acting FR: ====
* Add new FR to IPA group
* Enable Zabbix calling
* Document all active incidents
* Share special circumstances
 
==== New FR: ====
* Review SLA status
* Subscribe to channels:
** SRE - General
** SRE # Critical
** SRE ## Non-Critical
** SRE ### Informational
* Announce takeover in Organisational channel
* Remove old FR from IPA group
* Disable old FR's Zabbix calling
 
=== Naming Convention ===
* Kimai: <YYYY-MM> <problem_title>
* Milestone: Delft Solutions Hosting Incident response work <kimai_activity_name>
* Lynx ID: SRE<YYMM><XXX>


= Full procedure =
= Full procedure =


== Zulip migration ==
== General Rules ==
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix
* Triggers are grouped in a topic on Zulip per host
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved
* There's no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.
* There is no automatic gitlab issue creation or syncing anymore.


Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.
# When an incident is in progress, and person A is handling it, then all incidents in area X, are handled by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over.
# FR always has the last word on what solution to apply for resolving an incident.


== Critical incidents ==
== Critical incidents ==
'''Critical incidents are resolved within 16 hours.'''
'''Critical incidents are resolved within 16 hours.'''


As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve others to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn't need to call in people to manage each incident. But a client's service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you'd want to call in help to resolve the incident in time).
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaneously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn't need to call in people to manage each incident. But a client's service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you'd want to call in help to resolve the incident in time).


=== Process ===
=== Process ===
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.
The general process is made up of the following steps. Each step has additional information on how to handle/execute them in the sections below.
# Take responsibility for seeing the incident resolved
# Take responsibility for seeing the incident resolved
# Determine if incident is still ongoing
# Determine if incident is still ongoing
Line 105: Line 89:


==== Acknowledge the incident on Zabbix ====
==== Acknowledge the incident on Zabbix ====
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM' to reboot, or the router having an connectivity issue will lead to most other VM's having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to   
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VMs to reboot, or the router having an connectivity issue will lead to most other VMs having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what's ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to   


* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.
Line 123: Line 107:
* If an incident has already resolved itself and the problem is no longer observable, we don't communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.
* If an incident has already resolved itself and the problem is no longer observable, we don't communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don't have an exhaustive list, but here is those I can think of:
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don't have an exhaustive list, but here is those I can think of:
** SSH Service is down: We don't have any clients that SSH into their services, so it's generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM's.
** SSH Service is down: We don't have any clients that SSH into their services, so it's generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VMs.
** No backup for x days: Clients don't notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed
** No backup for x days: Clients don't notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed
** SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it.
** SSL certificate is expiring in < 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client's service, so no need for communicating about it.
* Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM's for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
* Determining which clients are being affected can be done by looking at the host's DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VMs for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip.
* Communicating to DS about ongoing incidents is usually assumed to be automatically have been done by the fact that the incident was reported on Zulip.


As always, report the decisions taken and actions maded in the incident thread. (e.g.: I've sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)
As always, report the decisions taken and actions made in the incident thread. (e.g.: I've sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)


==== Communicate plan/next steps + Communicate findings/results of executed plan ====
==== Communicate plan/next steps + Communicate findings/results of executed plan ====
Line 138: Line 122:
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident 'SSH service is down on X' your hypothesis could be that this is due to 'MaxStartups' throttling, which can be proven by 'grep'ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident 'SSH service is down on X' your hypothesis could be that this is due to 'MaxStartups' throttling, which can be proven by 'grep'ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what's wrong.
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what's wrong.
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don't know why something is failing, and/or don't have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don't know why something is failing, and/or don't have any decent hypotheses to follow up, you can follow this process to systematically find the problem.


Regarding the resolution to an incident: The resolution to any incident is usually one of two things:
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:
# Fix the underlying problem.
# Fix the underlying problem.
# Fix the trigger itself.
# Fix the trigger itself.
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger.
Fixing the trigger is relatively straightforward, but do make sure document in the thread what you changed to which trigger.
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM's are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won't re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the time frame that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VMs are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO's versus setting up a new Proxmox Backup server at a different location. Since we don't have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.


===== Some known issues and their resolutions =====
===== Some known issues and their resolutions =====
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartups throttling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].
* SSH service is down: The internet is a vile place. There's constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS'ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep 'MaxStartups throttling'` (you probably want to select a relevant time period with `--since "2 hours ago"` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved.
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automatically updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it's longer. If the service does not stay down, the issue can be just resolved.


==== Resolve incident + cleanup ====
==== Resolve incident + cleanup ====
When you've executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:
When you've executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you're sure that you've resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host's configuration on Zabbix and selecting 'Execute Now', after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.
# Verify that the trigger is no longer firing. An incident will be immediately re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you're sure that you've resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host's configuration on Zabbix and selecting 'Execute Now', after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.


Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.
Unfortunately, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it's own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it's own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).
Line 164: Line 148:
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.


When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.
 
===Additional context===
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical '''SLA - Critical'''].
* <s>When it is being tracked on GitLab a heavy check mark is added to the message.</s>
* <s>Responses on the thread and on GitLab are automatically synced (to some extend)</s>
* <s>When you reply with '''I agree that this has been fully resolved''' eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.</s>


== Non-Critical incidents ==
== Non-Critical incidents ==
Line 180: Line 158:




The next steps don't have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.
The next steps don't have to be done immediately, as they have dependencies, but be started and scheduled for completion the next work day.


Check if there's already a uncompleted milestone for this host with this issue in the metrics sheet.
Check if there's already a uncompleted milestone for this host with this issue in the metrics sheet.
Line 194: Line 172:
** Follow the naming convention below for the title & project ID
** Follow the naming convention below for the title & project ID
** Tasks need to be added
** Tasks need to be added
** Final tasks needs to have the SLO deadline set as 'contraint'
** Final tasks needs to have the SLO deadline set as 'constraint'
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20
** The tasks are estimated for SP
** The tasks are estimated for SP
Line 222: Line 200:


== Handover ==
== Handover ==
When handing over the responsibility of '''first responder''' (FR), the following needs to happen:
See [[Handover]]
* The handover can be initiated by both the upcoming FR or the acting FR
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
* The upcoming FR makes sure they are subscribed to the right channels.
 
The following steps can be done async or in person:
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.
* If there are any particularities the upcoming FR needs to be aware of, those are shared.
* The upcoming FR asks their questions until they are satisfied and able to take over the FR
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].
* The upcoming FR announces/informs that they are now the acting FR over Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
118

edits

Navigation menu