Incident Handling: Difference between revisions

← Older edit

Incident Handling (view source)

Revision as of 06:57, 12 January 2026

598 bytes added , 12 January

→‎Checklist

Jakobbuis

118

edits

@@ Line 1: / Line 1: @@
 = This is the process =
-This document describes the process. This document supersedes all prior documents on process.
+This document is an authoritative description of the process. This document supersedes all prior documents on the process.
 = Deviating from the process =
-You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, ideally as soon as possible after deciding to deviate.
+You may deviate from the process at any moment. A deviation should be communicated to the dev-team, preferably in the Zulip topic about the applicable incident, as soon as possible after deciding to deviate.
 = Checklist =
@@ Line 14: / Line 14: @@
 # Check if the incident is still ongoing.
 # Determine whether the incident is ongoing
+# If this report came in via SRE - Report:
+## keep that thread open until the incident is resolved
+## post a link to the SRE - Report thread to any underlying technical threads in SRE # Critical, SRE ## Non-critical, or SRE ### Informational that is related
 # Determine whether clients are potentially affected, if so:
-## notify the affected clients (Slack preferred)
+## notify the affected clients (Slack preferred if available)
 ## share the message sent to the client in the incident Zulip thread
 # Document all actions taken in the Zulip topic.
@@ Line 26: / Line 29: @@
 ## Mark Zulip topic as resolved if no other incidents for the host.
 ## Check for related triggers and resolve them.
+## If there were any SRE - Report threads:
+### post a summary describing the high-level incident, that it is resolved, and how it was resolved.
+### post that summary message to any client channels such as Slack too.
+### close the thread in SRE - Report
-Common issues that have occurred previously, and ''could'' occur again:
+Note: we do not accept discussions on the how or why of incident response in the SRE - Report channel; those should be redirected to either Retro or Organisational channels. The only reason to reopen a thread in SRE - Report should be to report that there's still impact and the incident has been resolved prematurely.
-* SSH down: Check MaxStartups throttling, apply custom SSH config
-* No backup: Verify backup process is running, check the devteam email
-* HTTPS down on Sunday: this can be due to GitLab updates
 === Non-Critical Incidents ===

Incident Handling: Difference between revisions

Incident Handling (view source)

Revision as of 06:57, 12 January 2026

Navigation menu

Search