Incident Handling: Difference between revisions

From Delft Solutions
Jump to navigation Jump to search
No edit summary
No edit summary
Line 32: Line 32:
* Classify the incident as critical, non-critical, or informational.
* Classify the incident as critical, non-critical, or informational.
* Create an issue and state that you've done so.
* Create an issue and state that you've done so.
== Handover ==
When handing over the responsibility of first responder (FR), the following needs to happen:
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group, and enables Zabbix calling for that person
* The upcoming FR makes sure he is aware of the state of the SLA and knows what questions he wants to ask the acting FR
The following steps can be done async or in person:
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq).
* If the acting FR wants to hand over responsibility for any ongoing incident he also states which incidents he want the upcoming FR to take over.
* If there are any particularities the upcoming FR needs to be aware of, he shares them then.
* The upcoming FR asks his questions until he is satisfied and able to take over the FR
* The upcoming FR announces/informs that he is now the acting FR over Mattermost's organisational channel
* The now acting FR removes the previous FR from IPA the sla-first-responder user group, and disables Zabbix calling for that person

Revision as of 07:04, 3 May 2024

Critical incidents

  • Critical incidents are resolved within 16 hours.

Checklist

  1. Acknowledge on Zabbix and state who is responsible for resolving this in the description
  2. Determine affected clients
  3. Communicate to affected clients that the issue is being investigated
  4. Communicate plan/next steps (even if that is gathering information)
  5. Communicate findings/results of executed plan, go back to previous step if not resolved
  6. Resolve incident

Non-Critical incidents

  • Non-critical incidents are acknowledged within 9 hours and resolved within one week.

Checklist

  1. Acknowledge on Zabbix and state who is responsible for resolving this in the description
  2. Communicate plan/next steps (even if that is gathering information)
  3. Communicate findings/results of executed plan, go back to previous step if not resolved
  4. If there is no resolution to the incident, evaluate if the trigger needs updating/disabling
  5. Resolve incident

Informational incidents

  • Informational incidents are acknowledged within 72 hours

Checklist

  1. Acknowledge on Zabbix
  2. Sanity check the event, post result in thread
  3. If action needed, perform action

If an incident is reported in the SRE channel by a human

  • Acknowledge receipt.
  • Classify the incident as critical, non-critical, or informational.
  • Create an issue and state that you've done so.

Handover

When handing over the responsibility of first responder (FR), the following needs to happen:

  • Acting FR adds the upcoming FR the the IPA sla-first-responder user group, and enables Zabbix calling for that person
  • The upcoming FR makes sure he is aware of the state of the SLA and knows what questions he wants to ask the acting FR

The following steps can be done async or in person:

  • The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Mattermost's organisational channel if asynq).
  • If the acting FR wants to hand over responsibility for any ongoing incident he also states which incidents he want the upcoming FR to take over.
  • If there are any particularities the upcoming FR needs to be aware of, he shares them then.
  • The upcoming FR asks his questions until he is satisfied and able to take over the FR
  • The upcoming FR announces/informs that he is now the acting FR over Mattermost's organisational channel
  • The now acting FR removes the previous FR from IPA the sla-first-responder user group, and disables Zabbix calling for that person