Incident Handling
Jump to navigation
Jump to search
Critical incidents
- Critical incidents are resolved within 16 hours.
Checklist
- Acknowledge on Zabbix and state who is responsible for resolving this in the description
- Determine affected clients
- Communicate to affected clients that the issue is being investigated
- Communicate plan/next steps (even if that is gathering information)
- Communicate findings/results of executed plan, go back to previous step if not resolved
- Resolve incident
Additional context
- Critical incidents are posted in Infrastructure.
- When it is being tracked on GitLab a heavy check mark is added to the message.
- Responses on the thread and on GitLab are automatically synced (to some extend)
- When you reply with I agree that this has been fully resolved eventually Zabbix will pick this up and a green check mark is added to the message.
Non-Critical incidents
- Non-critical incidents are acknowledged within 9 hours and resolved within one week.
Checklist
- Acknowledge on Zabbix and state who is responsible for resolving this in the description
- Communicate plan/next steps (even if that is gathering information)
- Communicate findings/results of executed plan, go back to previous step if not resolved
- If there is no resolution to the incident, evaluate if the trigger needs updating/disabling
- Resolve incident
Informational incidents
- Informational incidents are acknowledged within 72 hours
Checklist
- Acknowledge on Zabbix
- Sanity check the event, post result in thread
- If action needed, perform action
If an incident is reported in the SRE channel by a human
- Acknowledge receipt.
- Classify the incident as critical, non-critical, or informational.
- Create an issue and state that you've done so.
Handover
When handing over the responsibility of first responder (FR), the following needs to happen:
- Acting FR adds the upcoming FR the the IPA sla-first-responder user group, and enables Zabbix calling for that person
- The upcoming FR makes sure he is aware of the state of the SLA and knows what questions he wants to ask the acting FR
The following steps can be done async or in person:
- The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq).
- If the acting FR wants to hand over responsibility for any ongoing incident he also states which incidents he want the upcoming FR to take over.
- If there are any particularities the upcoming FR needs to be aware of, he shares them then.
- The upcoming FR asks his questions until he is satisfied and able to take over the FR
- The upcoming FR announces/informs that he is now the acting FR over Mattermost's organisational channel
- The now acting FR removes the previous FR from IPA the sla-first-responder user group, and disables Zabbix calling for that person