Incident Handling: Difference between revisions

From Delft Solutions
Jump to navigation Jump to search
No edit summary
No edit summary
Line 27: Line 27:
# Sanity check the event, post result in thread
# Sanity check the event, post result in thread
# If action needed, perform action
# If action needed, perform action
== If an incident is reported in the SRE channel by a human ==
* Acknowledge receipt.
* Classify the incident as critical, non-critical, or informational.
* Create an issue and state that you've done so.

Revision as of 00:07, 30 August 2023

Critical incidents

  • Critical incidents are resolved within 16 hours.

Checklist

  1. Acknowledge on Zabbix and state who is responsible for resolving this in the description
  2. Determine affected clients
  3. Communicate to affected clients that the issue is being investigated
  4. Communicate plan/next steps (even if that is gathering information)
  5. Communicate findings/results of executed plan, go back to previous step if not resolved
  6. Resolve incident

Non-Critical incidents

  • Non-critical incidents are acknowledged within 9 hours and resolved within one week.

Checklist

  1. Acknowledge on Zabbix and state who is responsible for resolving this in the description
  2. Communicate plan/next steps (even if that is gathering information)
  3. Communicate findings/results of executed plan, go back to previous step if not resolved
  4. If there is no resolution to the incident, evaluate if the trigger needs updating/disabling
  5. Resolve incident

Informational incidents

  • Informational incidents are acknowledged within 72 hours

Checklist

  1. Acknowledge on Zabbix
  2. Sanity check the event, post result in thread
  3. If action needed, perform action

If an incident is reported in the SRE channel by a human

  • Acknowledge receipt.
  • Classify the incident as critical, non-critical, or informational.
  • Create an issue and state that you've done so.