|
|
| (5 intermediate revisions by the same user not shown) |
| Line 1: |
Line 1: |
| = This is the process = | | = This is the process = |
| This document describes the process. This document supersedes all prior documents on process. | | This document is an authoritative description of the process. This document supersedes all prior documents on the process. |
|
| |
|
| = Deviating from the process = | | = Deviating from the process = |
| You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, ideally as soon as possible after deciding to deviate. | | You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate. |
|
| |
|
| = Checklist = | | = Checklist = |
| Line 26: |
Line 26: |
| ## Mark Zulip topic as resolved if no other incidents for the host. | | ## Mark Zulip topic as resolved if no other incidents for the host. |
| ## Check for related triggers and resolve them. | | ## Check for related triggers and resolve them. |
|
| |
| Common issues that have occurred previously, and ''could'' occur again:
| |
| * SSH down: Check MaxStartups throttling, apply custom SSH config
| |
| * No backup: Verify backup process is running, check the devteam email
| |
| * HTTPS down on Sunday: this can be due to GitLab updates
| |
|
| |
|
| === Non-Critical Incidents === | | === Non-Critical Incidents === |
| Line 145: |
Line 140: |
|
| |
|
| When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. | | When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling. |
|
| |
| ===Additional context===
| |
| * Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical '''SLA - Critical'''].
| |
| * <s>When it is being tracked on GitLab a heavy check mark is added to the message.</s>
| |
| * <s>Responses on the thread and on GitLab are automatically synced (to some extend)</s>
| |
| * <s>When you reply with '''I agree that this has been fully resolved''' eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.</s>
| |
|
| |
|
| == Non-Critical incidents == | | == Non-Critical incidents == |
| Line 202: |
Line 191: |
|
| |
|
| == Handover == | | == Handover == |
| When handing over the responsibility of '''first responder''' (FR), the following needs to happen:
| | See [[Handover]] |
| * The handover can be initiated by both the upcoming FR or the acting FR
| |
| * Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
| |
| * Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.
| |
| * The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.
| |
| * The upcoming FR makes sure they are subscribed to the right channels.
| |
| | |
| The following steps can be done async or in person:
| |
| * The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).
| |
| * If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.
| |
| * If there are any particularities the upcoming FR needs to be aware of, those are shared.
| |
| * The upcoming FR asks their questions until they are satisfied and able to take over the FR
| |
| * The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].
| |
| * The upcoming FR announces/informs that they are now the acting FR over Zulip's [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]
| |
| * The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix > Configuration > Actions > [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]
| |