<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://docs.delftsolutions.nl/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Alois</id>
	<title>Delft Solutions - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://docs.delftsolutions.nl/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Alois"/>
	<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/wiki/Special:Contributions/Alois"/>
	<updated>2026-04-04T04:41:20Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.3</generator>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=WS_Proxmox_node_reboot&amp;diff=693</id>
		<title>WS Proxmox node reboot</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=WS_Proxmox_node_reboot&amp;diff=693"/>
		<updated>2026-02-12T12:32:30Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Reboot process */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Tips &amp;amp; Notes ==&lt;br /&gt;
* If you&#039;re expecting to reboot every node in the cluster, do the node with the containers last, to limit the amount of downtime and reboots for them&lt;br /&gt;
* Updating a node: `apt update` and `apt full-upgrade`&lt;br /&gt;
* Make sure all VMs are actually migratable before adding to a HA group&lt;br /&gt;
* If there are containers on the device you are looking to reboot- you are going to need to also create a maintenance mode to cover them (for example teamspeak or awstats)&lt;br /&gt;
* Containers will inherit the OS of their host, so you will also need to handle triggers related to their OS updating, where appropriate&lt;br /&gt;
== Pre-Work ==&lt;br /&gt;
* If a VM or container is going to incur downtime, you must let the affected parties know in advance. Ideally they should be informed the previous day.  &lt;br /&gt;
&lt;br /&gt;
== Pre-flight checks ==&lt;br /&gt;
* Check all Ceph pools are running on at least 3/2 replication&lt;br /&gt;
* Check that all running VM&#039;s on the node you want to reboot are in HA (if not, add them or migrate them away manually)&lt;br /&gt;
** &#039;&#039;&#039;The `compute.*` VM&#039;s are not to be migrated! Rebooting a node with such a VM present requires shutting down the VM!&#039;&#039;&#039;&lt;br /&gt;
* Check that Ceph is healthy -&amp;gt; No remapped PG&#039;s, or degraded data redundancy&lt;br /&gt;
* You have communicated that downtime is expected to the users who will be affected (Ideally one day in advance)&lt;br /&gt;
&lt;br /&gt;
== Update Process ==&lt;br /&gt;
* Update the node: `apt update` and `apt full-upgrade`&lt;br /&gt;
* Check the packages that are removed/updated/installed correctly and they are sane (to make sense)&lt;br /&gt;
&lt;br /&gt;
== Reboot process ==&lt;br /&gt;
* Complete the pre-flight checks&lt;br /&gt;
* If you want to reboot for a kernel update, make sure the kernel is updated by following the Update Process written above&lt;br /&gt;
* Start maintenance mode for the Proxmox node and any containers running on the node&lt;br /&gt;
* Start maintenance mode for Ceph, specify that we only want to surpress the trigger for health state being in warning by setting tag `ceph_health` equals `warning`&lt;br /&gt;
* Let affected parties know that the maintenance period you told them about in the preflight checks is about to take place.&lt;br /&gt;
[[File:Ceph-maintenance.png|thumb]]&lt;br /&gt;
* Set noout flag on host: `ceph osd set-group noout &amp;lt;node&amp;gt;`&lt;br /&gt;
&lt;br /&gt;
# gain ssh access to host&lt;br /&gt;
# Log in through IPA&lt;br /&gt;
# Run the command&lt;br /&gt;
* Place the node in HA maintenance mode so the cluster does not trigger failover or recovery actions `ha-manager crm-command node-maintenance enable &amp;lt;node&amp;gt;` &lt;br /&gt;
* &#039;&#039;&#039;Reboot&#039;&#039;&#039; node through web GUI&lt;br /&gt;
* Wait for node to come back up&lt;br /&gt;
* Wait for OSD&#039;s to be back online&lt;br /&gt;
* disable node maintenance mode `ha-manager crm-command node-maintenance disable &amp;lt;node&amp;gt;`&lt;br /&gt;
* Remove noout flag on host: `ceph osd unset-group noout &amp;lt;node&amp;gt;` ,to do this:&lt;br /&gt;
* If a kernel update was done, manually execute the `Operating system` item manually to detect the update. Manually executing the two items that indicate a reboot is also usefull if they were firing, to stop them/check no further reboots are needed.&lt;br /&gt;
* Ackowledge &amp;amp; close triggers&lt;br /&gt;
* Remove maintenance modes&lt;br /&gt;
&lt;br /&gt;
== Aftercare ==&lt;br /&gt;
* Ensure that Kaboom API is running on Screwdriver or Paloma. This is to get the best performance for the VM.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=WS_Proxmox_node_reboot&amp;diff=692</id>
		<title>WS Proxmox node reboot</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=WS_Proxmox_node_reboot&amp;diff=692"/>
		<updated>2026-02-12T08:26:56Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Update Process */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Tips &amp;amp; Notes ==&lt;br /&gt;
* If you&#039;re expecting to reboot every node in the cluster, do the node with the containers last, to limit the amount of downtime and reboots for them&lt;br /&gt;
* Updating a node: `apt update` and `apt full-upgrade`&lt;br /&gt;
* Make sure all VMs are actually migratable before adding to a HA group&lt;br /&gt;
* If there are containers on the device you are looking to reboot- you are going to need to also create a maintenance mode to cover them (for example teamspeak or awstats)&lt;br /&gt;
* Containers will inherit the OS of their host, so you will also need to handle triggers related to their OS updating, where appropriate&lt;br /&gt;
== Pre-Work ==&lt;br /&gt;
* If a VM or container is going to incur downtime, you must let the affected parties know in advance. Ideally they should be informed the previous day.  &lt;br /&gt;
&lt;br /&gt;
== Pre-flight checks ==&lt;br /&gt;
* Check all Ceph pools are running on at least 3/2 replication&lt;br /&gt;
* Check that all running VM&#039;s on the node you want to reboot are in HA (if not, add them or migrate them away manually)&lt;br /&gt;
** &#039;&#039;&#039;The `compute.*` VM&#039;s are not to be migrated! Rebooting a node with such a VM present requires shutting down the VM!&#039;&#039;&#039;&lt;br /&gt;
* Check that Ceph is healthy -&amp;gt; No remapped PG&#039;s, or degraded data redundancy&lt;br /&gt;
* You have communicated that downtime is expected to the users who will be affected (Ideally one day in advance)&lt;br /&gt;
&lt;br /&gt;
== Update Process ==&lt;br /&gt;
* Update the node: `apt update` and `apt full-upgrade`&lt;br /&gt;
* Check the packages that are removed/updated/installed correctly and they are sane (to make sense)&lt;br /&gt;
&lt;br /&gt;
== Reboot process ==&lt;br /&gt;
* Complete the pre-flight checks&lt;br /&gt;
* If you want to reboot for a kernel update, make sure the kernel is updated by following the Update Process written above&lt;br /&gt;
* Start maintenance mode for the Proxmox node and any containers running on the node&lt;br /&gt;
* Start maintenance mode for Ceph, specify that we only want to surpress the trigger for health state being in warning by setting tag `ceph_health` equals `warning`&lt;br /&gt;
* Let affected parties know that the mainenance period you told them about in the preflight checks is about to take place.&lt;br /&gt;
[[File:Ceph-maintenance.png|thumb]]&lt;br /&gt;
* Set noout flag on host: `ceph osd set-group noout &amp;lt;node&amp;gt;`&lt;br /&gt;
&lt;br /&gt;
# gain ssh access to host&lt;br /&gt;
# Log in through IPA&lt;br /&gt;
# Run the command&lt;br /&gt;
* &#039;&#039;&#039;Reboot&#039;&#039;&#039; node through web GUI&lt;br /&gt;
* Wait for node to come back up&lt;br /&gt;
* Wait for OSD&#039;s to be back online&lt;br /&gt;
* Remove noout flag on host: `ceph osd unset-group noout &amp;lt;node&amp;gt;` ,to do this:&lt;br /&gt;
* If a kernel update was done, manually execute the `Operating system` item manually to detect the update. Manually executing the two items that indicate a reboot is also usefull if they were firing, to stop them/check no further reboots are needed.&lt;br /&gt;
* Ackowledge &amp;amp; close triggers&lt;br /&gt;
* Remove maintenance modes&lt;br /&gt;
&lt;br /&gt;
== Aftercare ==&lt;br /&gt;
* Ensure that Kaboom API is running on Screwdriver or Paloma. This is to get the best performance for the VM.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=IDRAC:_Upload_a_new_pem_certificate_for_automatic_updates&amp;diff=675</id>
		<title>IDRAC: Upload a new pem certificate for automatic updates</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=IDRAC:_Upload_a_new_pem_certificate_for_automatic_updates&amp;diff=675"/>
		<updated>2025-11-04T10:44:39Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Context:&#039;&#039;&#039;&lt;br /&gt;
This procedure applies to maintaining automatic firmware updates on our Dell PowerEdge servers.&lt;br /&gt;
This is related to the broader guide on [https://docs.google.com/document/d/1GEl6RmGIseqLzkdTC4lahqU2XsDLCVZn2blqt6ZdNpg/edit?tab=t.0#heading=h.62n2umxpabzo Setting up a new Proxmox server in our cluster], specifically Appendix G: Configuring automatic updates on the Idrac.&lt;br /&gt;
Appendix G explains the initial configuration steps for enabling automatic update checks.&lt;br /&gt;
&lt;br /&gt;
Once a new PEM certificate has been created, follow the steps below to upload it to each iDRAC.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;iDRAC 8&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Requirements: Administrator privileges on the iDRAC&lt;br /&gt;
	&lt;br /&gt;
# Log in to the iDRAC web interface&lt;br /&gt;
# Navigate to: iDRAC Settings → Update and Rollback → Automatic Update&lt;br /&gt;
# Locate the line Upload Server Certificate.&lt;br /&gt;
# Click Choose File and select the new PEM certificate.&lt;br /&gt;
# Click Upload.&lt;br /&gt;
# Confirm the operation if prompted.&lt;br /&gt;
[[File:Idrac8 upload pem.png|thumb|center]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;iDRAC 9&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Requirements: Administrator privileges on the iDRAC.&lt;br /&gt;
	&lt;br /&gt;
# Log in to the iDRAC web interface (https://&amp;lt;idrac-ip&amp;gt;/).&lt;br /&gt;
# Navigate to: Maintenance → System Update → Automatic Update&lt;br /&gt;
# Find the line Server Certificate and click it.&lt;br /&gt;
# A modal window opens showing the current certificate.&lt;br /&gt;
# Click Replace Certificate, select the new PEM file, and upload it.&lt;br /&gt;
# Confirm and apply the change.&lt;br /&gt;
[[File:idrac9 pem upload1.png|thumb|center]]&lt;br /&gt;
[[File:Idrac9 pem upload2.png|thumb|center]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=IDRAC:_Upload_a_new_pem_certificate_for_automatic_updates&amp;diff=674</id>
		<title>IDRAC: Upload a new pem certificate for automatic updates</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=IDRAC:_Upload_a_new_pem_certificate_for_automatic_updates&amp;diff=674"/>
		<updated>2025-11-04T10:44:17Z</updated>

		<summary type="html">&lt;p&gt;Alois: Created page with &amp;quot;&amp;#039;&amp;#039;&amp;#039;Context:&amp;#039;&amp;#039;&amp;#039; This procedure applies to maintaining automatic firmware updates on our Dell PowerEdge servers. This is related to the broader guide on [https://docs.google.com/document/d/1GEl6RmGIseqLzkdTC4lahqU2XsDLCVZn2blqt6ZdNpg/edit?tab=t.0#heading=h.62n2umxpabzo Setting up a new Proxmox server in our cluster], specifically Appendix G: Configuring automatic updates on the Idrac. Appendix G explains the initial configuration steps for enabling automatic update checks....&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Context:&#039;&#039;&#039;&lt;br /&gt;
This procedure applies to maintaining automatic firmware updates on our Dell PowerEdge servers.&lt;br /&gt;
This is related to the broader guide on [https://docs.google.com/document/d/1GEl6RmGIseqLzkdTC4lahqU2XsDLCVZn2blqt6ZdNpg/edit?tab=t.0#heading=h.62n2umxpabzo Setting up a new Proxmox server in our cluster], specifically Appendix G: Configuring automatic updates on the Idrac.&lt;br /&gt;
Appendix G explains the initial configuration steps for enabling automatic update checks.&lt;br /&gt;
&lt;br /&gt;
Once a new PEM certificate has been created, follow the steps below to upload it to each iDRAC.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;iDRAC 8&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Requirements: Administrator privileges on the iDRAC&lt;br /&gt;
	&lt;br /&gt;
# Log in to the iDRAC web interface&lt;br /&gt;
# Navigate to: iDRAC Settings → Update and Rollback → Automatic Update&lt;br /&gt;
# Locate the line Upload Server Certificate.&lt;br /&gt;
# Click Choose File and select the new PEM certificate.&lt;br /&gt;
# Click Upload.&lt;br /&gt;
# Confirm the operation if prompted.&lt;br /&gt;
[[File:Idrac8 upload pem.png|thumb|center]]&lt;br /&gt;
&lt;br /&gt;
⸻&lt;br /&gt;
&lt;br /&gt;
iDRAC 9&lt;br /&gt;
&lt;br /&gt;
Requirements: Administrator privileges on the iDRAC.&lt;br /&gt;
	&lt;br /&gt;
# Log in to the iDRAC web interface (https://&amp;lt;idrac-ip&amp;gt;/).&lt;br /&gt;
# Navigate to: Maintenance → System Update → Automatic Update&lt;br /&gt;
# Find the line Server Certificate and click it.&lt;br /&gt;
# A modal window opens showing the current certificate.&lt;br /&gt;
# Click Replace Certificate, select the new PEM file, and upload it.&lt;br /&gt;
# Confirm and apply the change.&lt;br /&gt;
[[File:idrac9 pem upload1.png|thumb|center]]&lt;br /&gt;
[[File:Idrac9 pem upload2.png|thumb|center]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Idrac9_pem_upload2.png&amp;diff=673</id>
		<title>File:Idrac9 pem upload2.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Idrac9_pem_upload2.png&amp;diff=673"/>
		<updated>2025-11-04T10:40:27Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;iDRAC 9 upload new pem file&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Idrac9_pem_upload1.png&amp;diff=672</id>
		<title>File:Idrac9 pem upload1.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Idrac9_pem_upload1.png&amp;diff=672"/>
		<updated>2025-11-04T10:37:39Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;iDRAC 9 upload new pem file&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Idrac8_upload_pem.png&amp;diff=671</id>
		<title>File:Idrac8 upload pem.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Idrac8_upload_pem.png&amp;diff=671"/>
		<updated>2025-11-04T10:29:23Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;upload pem file on idrac8&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Internal&amp;diff=670</id>
		<title>Internal</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Internal&amp;diff=670"/>
		<updated>2025-11-04T10:04:38Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* SRE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Strategy ==&lt;br /&gt;
* [[Flow metrics|Flow metrics]]&lt;br /&gt;
* [[12%-time|12%-time]]&lt;br /&gt;
&lt;br /&gt;
== Finance ==&lt;br /&gt;
&lt;br /&gt;
=== Exact ===&lt;br /&gt;
&lt;br /&gt;
* [[booking bonus|Booking bonus]]&lt;br /&gt;
* [[booking wages|Booking wages]]&lt;br /&gt;
* [[booking quarterly hosting invoice|Booking quarterly hosting invoice]]&lt;br /&gt;
* [[new receipt|Enter a new receipt]]&lt;br /&gt;
* [[reconciliation|Reconciliation of transaction]]&lt;br /&gt;
* [[invoicing|Send an invoice]]&lt;br /&gt;
* [[payment reminders|Send payment reminder]]&lt;br /&gt;
* [[invoice approval|Process for approving invoices (/filed receipts)]]&lt;br /&gt;
&lt;br /&gt;
=== Bunq ===&lt;br /&gt;
&lt;br /&gt;
* [[top up account|Top up expense account]]&lt;br /&gt;
&lt;br /&gt;
== Work Process ==&lt;br /&gt;
&lt;br /&gt;
* [[Definition of done|Definition of Done]]&lt;br /&gt;
* [[Incident Handling|Incident Handling]]&lt;br /&gt;
* [[SRE Maintenance|SRE Maintenance]]&lt;br /&gt;
* [[Release checklist|Release checklist]]&lt;br /&gt;
* [[Handover]]&lt;br /&gt;
&lt;br /&gt;
== Internal Process ==&lt;br /&gt;
* [[timetracking|Timetracking process]]&lt;br /&gt;
* [[Starting work for a new client]]&lt;br /&gt;
* [[12 percent|12% time]]&lt;br /&gt;
* [[Annual leave|Annual leave]]&lt;br /&gt;
* [[Bonus allocation|Bonus allocation]]&lt;br /&gt;
* [[Calamity leave|Calamity leave]]&lt;br /&gt;
* [[Overtime|Overtime]]&lt;br /&gt;
* [[Retrospectives|Retrospectives]]&lt;br /&gt;
* [[Sick leave|Sick leave]]&lt;br /&gt;
* [[Training and self-study|Training and Self-Study]]&lt;br /&gt;
* [[Daily|Daily]]&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* Era Inventory [[project_era_inventory_api|API Description]]&lt;br /&gt;
&lt;br /&gt;
== SRE ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;To be further populated with guide from drive&#039;&#039;&lt;br /&gt;
* [[create gitlab runner host|Create a GitLab runner host]]&lt;br /&gt;
* [[vm setup|Create a (Debian) VM]]&lt;br /&gt;
* [[border update|Process for updating a border]]&lt;br /&gt;
* [[border reboot|Reboot border without downtime]]&lt;br /&gt;
* [[WS Proxmox node reboot|Reboot WS Proxmox node without downtime]]&lt;br /&gt;
* [[Resize VM Disk]]&lt;br /&gt;
* [[SRE tools]]&lt;br /&gt;
* [[Enroll Mac in Kerberos]]&lt;br /&gt;
* [[New Mac Setup]]&lt;br /&gt;
* [[Creating a VM on Hetzner]]&lt;br /&gt;
* [[Rebooting VM]]&lt;br /&gt;
* [[Rebooting Offsite]]&lt;br /&gt;
* [[ssh-fingerprints|Verifying SSH fingerprints]]&lt;br /&gt;
* [[Removing VM]]&lt;br /&gt;
* [[Install a new Disk in Server]]&lt;br /&gt;
* [[Setting Up Wildcard Subdomains with SSL on a Debian Application]]&lt;br /&gt;
* [[s3 bucket backup|Get credentials to backup S3 bucket to Zombie]]&lt;br /&gt;
* [[Hardware Incident Response: Memory Slot Failure on banshee]]&lt;br /&gt;
* [[dfz switch setup|DFZ Switch setup]]&lt;br /&gt;
* [[Update Zulip Server version]]&lt;br /&gt;
* [[iDRAC: Upload a new pem certificate for automatic updates]]&lt;br /&gt;
=== SLA ===&lt;br /&gt;
* [[Response for Backup Service being down for an extended period of time]]&lt;br /&gt;
&lt;br /&gt;
== Other ==&lt;br /&gt;
&lt;br /&gt;
* [[stack|Greenfield stack]]&lt;br /&gt;
* [[standard tools|Standard Tools]]&lt;br /&gt;
* [[list of unfurl debuggers|List of unfurl debuggers]]&lt;br /&gt;
* [[Recommended suppliers]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Update_Zulip_Server_version&amp;diff=668</id>
		<title>Update Zulip Server version</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Update_Zulip_Server_version&amp;diff=668"/>
		<updated>2025-10-27T10:22:07Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Context:&#039;&#039;&#039;&lt;br /&gt;
This procedure was written at the time of Zulip 11.4, and applies to upgrades performed on the chat.dsinternal.net server.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Important Notes&#039;&#039;&#039;&lt;br /&gt;
* It is possible (and supported) to upgrade across multiple versions at once — there is no need to install each intermediate release.&lt;br /&gt;
* Always consult the official Zulip upgrade documentation first. It supersedes this checklist: &lt;br /&gt;
https://zulip.readthedocs.io/en/latest/production/upgrade.html&lt;br /&gt;
* Make sure to note any important and relevant details concerning the next upgrade(s) reading the upgrade notes for every upcoming major version:&lt;br /&gt;
https://zulip.readthedocs.io/en/latest/overview/changelog.html&lt;br /&gt;
* In our case we&#039;re following the upgrading-to-a-release method not the upgrading-from-a-git-repository one.&lt;br /&gt;
* Read any version-specific upgrade notes in the official docs before proceeding.&lt;br /&gt;
&lt;br /&gt;
== Pre-upgrade Steps ==&lt;br /&gt;
# Read the [https://zulip.readthedocs.io/en/latest/overview/changelog.html upgrade notes] and prepare accordingly if some specific actions need to be taken&lt;br /&gt;
# Make sure the Debian and PostgreSQL versions are compatible with the upcoming update&lt;br /&gt;
# Schedule and announce the maintenance window&lt;br /&gt;
# Create a maintenance period in Zabbix&lt;br /&gt;
# Create a backup and/or snapshot&lt;br /&gt;
&lt;br /&gt;
== Upgrade Procedure ==&lt;br /&gt;
&#039;&#039;&#039;Important Notes&#039;&#039;&#039;&lt;br /&gt;
This procedure is taken from the official doc, if official doc has been updated this checklist should be updated to&lt;br /&gt;
# Fetch the latest build &amp;lt;code&amp;gt;curl -fLO https://download.zulip.com/server/zulip-server-latest.tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the upgrade &amp;lt;code&amp;gt;/home/zulip/deployments/current/scripts/upgrade-zulip zulip-server-latest.tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
# Monitor the process, expect temporary CPU and memory spikes right after the upgrade due to migrations and cache rebuilds. You probably want to wait for things to settle down before going on to next steps.&lt;br /&gt;
&lt;br /&gt;
== Post-upgrade Tasks ==&lt;br /&gt;
# Update the settings.py file following the method described [https://zulip.readthedocs.io/en/latest/production/upgrade.html#updating-settings-py-inline-documentation here]&lt;br /&gt;
# Once you&#039;ve made sure everything is running as expected remove the snapshot and maintenance period&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Update_Zulip_Server_version&amp;diff=667</id>
		<title>Update Zulip Server version</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Update_Zulip_Server_version&amp;diff=667"/>
		<updated>2025-10-27T10:19:57Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Pre-upgrade Steps */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Context:&#039;&#039;&#039;&lt;br /&gt;
This procedure was written at the time of Zulip 11.4, and applies to upgrades performed on the chat.dsinternal.net server.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Important Notes&#039;&#039;&#039;&lt;br /&gt;
* It is possible (and supported) to upgrade across multiple versions at once — there is no need to install each intermediate release.&lt;br /&gt;
* Always consult the official Zulip upgrade documentation first. It supersedes this checklist: &lt;br /&gt;
https://zulip.readthedocs.io/en/latest/production/upgrade.html&lt;br /&gt;
* Make sure to note any important and relevant details concerning the next upgrade(s) reading the upgrade notes for every upcoming major version:&lt;br /&gt;
https://zulip.readthedocs.io/en/latest/overview/changelog.html&lt;br /&gt;
* In our case we&#039;re following the upgrading-to-a-release method not the upgrading-from-a-git-repository one.&lt;br /&gt;
* Read any version-specific upgrade notes in the official docs before proceeding.&lt;br /&gt;
&lt;br /&gt;
== Pre-upgrade Steps ==&lt;br /&gt;
# Read the [http://upgrade%20notes https://zulip.readthedocs.io/en/latest/overview/changelog.html] and prepare accordingly if some specific actions need to be taken&lt;br /&gt;
# Make sure the Debian and PostgreSQL versions are compatible with the upcoming update&lt;br /&gt;
# Schedule and announce the maintenance window&lt;br /&gt;
# Create a maintenance period in Zabbix&lt;br /&gt;
# Create a backup and/or snapshot&lt;br /&gt;
&lt;br /&gt;
== Upgrade Procedure ==&lt;br /&gt;
&#039;&#039;&#039;Important Notes&#039;&#039;&#039;&lt;br /&gt;
This procedure is taken from the official doc, if official doc has been updated this checklist should be updated to&lt;br /&gt;
# Fetch the latest build &amp;lt;code&amp;gt;curl -fLO https://download.zulip.com/server/zulip-server-latest.tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the upgrade &amp;lt;code&amp;gt;/home/zulip/deployments/current/scripts/upgrade-zulip zulip-server-latest.tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
# Monitor the process, expect temporary CPU and memory spikes right after the upgrade due to migrations and cache rebuilds. You probably want to wait for things to settle down before going on to next steps.&lt;br /&gt;
&lt;br /&gt;
== Post-upgrade Tasks ==&lt;br /&gt;
# Update the settings.py file following the method described [https://zulip.readthedocs.io/en/latest/production/upgrade.html#updating-settings-py-inline-documentation here]&lt;br /&gt;
# Once you&#039;ve made sure everything is running as expected remove the snapshot and maintenance period&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Update_Zulip_Server_version&amp;diff=666</id>
		<title>Update Zulip Server version</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Update_Zulip_Server_version&amp;diff=666"/>
		<updated>2025-10-27T10:17:39Z</updated>

		<summary type="html">&lt;p&gt;Alois: Created page with &amp;quot;&amp;#039;&amp;#039;&amp;#039;Context:&amp;#039;&amp;#039;&amp;#039; This procedure was written at the time of Zulip 11.4, and applies to upgrades performed on the chat.dsinternal.net server.  &amp;#039;&amp;#039;&amp;#039;Important Notes&amp;#039;&amp;#039;&amp;#039; * It is possible (and supported) to upgrade across multiple versions at once — there is no need to install each intermediate release. * Always consult the official Zulip upgrade documentation first. It supersedes this checklist:  https://zulip.readthedocs.io/en/latest/production/upgrade.html * Make sure to note...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Context:&#039;&#039;&#039;&lt;br /&gt;
This procedure was written at the time of Zulip 11.4, and applies to upgrades performed on the chat.dsinternal.net server.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Important Notes&#039;&#039;&#039;&lt;br /&gt;
* It is possible (and supported) to upgrade across multiple versions at once — there is no need to install each intermediate release.&lt;br /&gt;
* Always consult the official Zulip upgrade documentation first. It supersedes this checklist: &lt;br /&gt;
https://zulip.readthedocs.io/en/latest/production/upgrade.html&lt;br /&gt;
* Make sure to note any important and relevant details concerning the next upgrade(s) reading the upgrade notes for every upcoming major version:&lt;br /&gt;
https://zulip.readthedocs.io/en/latest/overview/changelog.html&lt;br /&gt;
* In our case we&#039;re following the upgrading-to-a-release method not the upgrading-from-a-git-repository one.&lt;br /&gt;
* Read any version-specific upgrade notes in the official docs before proceeding.&lt;br /&gt;
&lt;br /&gt;
== Pre-upgrade Steps ==&lt;br /&gt;
# Read the [http://upgrade%20notes https://zulip.readthedocs.io/en/latest/overview/changelog.html]&lt;br /&gt;
# Make sure the Debian and PostgreSQL versions are compatible with the upcoming update&lt;br /&gt;
# Schedule and announce the maintenance window&lt;br /&gt;
# Create a maintenance period in Zabbix&lt;br /&gt;
# Create a backup and/or snapshot&lt;br /&gt;
&lt;br /&gt;
== Upgrade Procedure ==&lt;br /&gt;
&#039;&#039;&#039;Important Notes&#039;&#039;&#039;&lt;br /&gt;
This procedure is taken from the official doc, if official doc has been updated this checklist should be updated to&lt;br /&gt;
# Fetch the latest build &amp;lt;code&amp;gt;curl -fLO https://download.zulip.com/server/zulip-server-latest.tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the upgrade &amp;lt;code&amp;gt;/home/zulip/deployments/current/scripts/upgrade-zulip zulip-server-latest.tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
# Monitor the process, expect temporary CPU and memory spikes right after the upgrade due to migrations and cache rebuilds. You probably want to wait for things to settle down before going on to next steps.&lt;br /&gt;
&lt;br /&gt;
== Post-upgrade Tasks ==&lt;br /&gt;
# Update the settings.py file following the method described [https://zulip.readthedocs.io/en/latest/production/upgrade.html#updating-settings-py-inline-documentation here]&lt;br /&gt;
# Once you&#039;ve made sure everything is running as expected remove the snapshot and maintenance period&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Internal&amp;diff=665</id>
		<title>Internal</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Internal&amp;diff=665"/>
		<updated>2025-10-27T09:48:43Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* SRE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Strategy ==&lt;br /&gt;
* [[Flow metrics|Flow metrics]]&lt;br /&gt;
* [[12%-time|12%-time]]&lt;br /&gt;
&lt;br /&gt;
== Finance ==&lt;br /&gt;
&lt;br /&gt;
=== Exact ===&lt;br /&gt;
&lt;br /&gt;
* [[booking bonus|Booking bonus]]&lt;br /&gt;
* [[booking wages|Booking wages]]&lt;br /&gt;
* [[booking quarterly hosting invoice|Booking quarterly hosting invoice]]&lt;br /&gt;
* [[new receipt|Enter a new receipt]]&lt;br /&gt;
* [[reconciliation|Reconciliation of transaction]]&lt;br /&gt;
* [[invoicing|Send an invoice]]&lt;br /&gt;
* [[payment reminders|Send payment reminder]]&lt;br /&gt;
* [[invoice approval|Process for approving invoices (/filed receipts)]]&lt;br /&gt;
&lt;br /&gt;
=== Bunq ===&lt;br /&gt;
&lt;br /&gt;
* [[top up account|Top up expense account]]&lt;br /&gt;
&lt;br /&gt;
== Work Process ==&lt;br /&gt;
&lt;br /&gt;
* [[Definition of done|Definition of Done]]&lt;br /&gt;
* [[Incident Handling|Incident Handling]]&lt;br /&gt;
* [[SRE Maintenance|SRE Maintenance]]&lt;br /&gt;
* [[Release checklist|Release checklist]]&lt;br /&gt;
* [[Handover]]&lt;br /&gt;
&lt;br /&gt;
== Internal Process ==&lt;br /&gt;
* [[timetracking|Timetracking process]]&lt;br /&gt;
* [[Starting work for a new client]]&lt;br /&gt;
* [[12 percent|12% time]]&lt;br /&gt;
* [[Annual leave|Annual leave]]&lt;br /&gt;
* [[Bonus allocation|Bonus allocation]]&lt;br /&gt;
* [[Calamity leave|Calamity leave]]&lt;br /&gt;
* [[Overtime|Overtime]]&lt;br /&gt;
* [[Retrospectives|Retrospectives]]&lt;br /&gt;
* [[Sick leave|Sick leave]]&lt;br /&gt;
* [[Training and self-study|Training and Self-Study]]&lt;br /&gt;
* [[Daily|Daily]]&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* Era Inventory [[project_era_inventory_api|API Description]]&lt;br /&gt;
&lt;br /&gt;
== SRE ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;To be further populated with guide from drive&#039;&#039;&lt;br /&gt;
* [[create gitlab runner host|Create a GitLab runner host]]&lt;br /&gt;
* [[vm setup|Create a (Debian) VM]]&lt;br /&gt;
* [[border update|Process for updating a border]]&lt;br /&gt;
* [[border reboot|Reboot border without downtime]]&lt;br /&gt;
* [[WS Proxmox node reboot|Reboot WS Proxmox node without downtime]]&lt;br /&gt;
* [[Resize VM Disk]]&lt;br /&gt;
* [[SRE tools]]&lt;br /&gt;
* [[Enroll Mac in Kerberos]]&lt;br /&gt;
* [[New Mac Setup]]&lt;br /&gt;
* [[Creating a VM on Hetzner]]&lt;br /&gt;
* [[Rebooting VM]]&lt;br /&gt;
* [[Rebooting Offsite]]&lt;br /&gt;
* [[ssh-fingerprints|Verifying SSH fingerprints]]&lt;br /&gt;
* [[Removing VM]]&lt;br /&gt;
* [[Install a new Disk in Server]]&lt;br /&gt;
* [[Setting Up Wildcard Subdomains with SSL on a Debian Application]]&lt;br /&gt;
* [[s3 bucket backup|Get credentials to backup S3 bucket to Zombie]]&lt;br /&gt;
* [[Hardware Incident Response: Memory Slot Failure on banshee]]&lt;br /&gt;
* [[dfz switch setup|DFZ Switch setup]]&lt;br /&gt;
* [[Update Zulip Server version]]&lt;br /&gt;
=== SLA ===&lt;br /&gt;
* [[Response for Backup Service being down for an extended period of time]]&lt;br /&gt;
&lt;br /&gt;
== Other ==&lt;br /&gt;
&lt;br /&gt;
* [[stack|Greenfield stack]]&lt;br /&gt;
* [[standard tools|Standard Tools]]&lt;br /&gt;
* [[list of unfurl debuggers|List of unfurl debuggers]]&lt;br /&gt;
* [[Recommended suppliers]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=New_receipt&amp;diff=644</id>
		<title>New receipt</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=New_receipt&amp;diff=644"/>
		<updated>2025-10-14T12:42:10Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;# Log into Exact&lt;br /&gt;
# Go to “Purchase” &amp;gt; “Entries” &amp;gt; “Create”&lt;br /&gt;
# Click “60 - Inkoopboek”&lt;br /&gt;
# Add a picture of the receipt as an attachment&lt;br /&gt;
## Always make sure you have the receipt!&lt;br /&gt;
## If you don&#039;t have a receipt (and can&#039;t get one), you still want to fill a description line with correct account and VAT as described in 6, 7 and 8.&lt;br /&gt;
# Under “Supplier”, find the supplier and select the correct supplier&lt;br /&gt;
## If the supplier does not exist, you have to add a new supplier, but there is no guide for this&lt;br /&gt;
# Under “Invoice Data”&lt;br /&gt;
## Enter the description of what is on the receipt (e.g. enter “lunch stuff” for when you have a receipt for buying lunch stuff)&lt;br /&gt;
## Set the &#039;Payment Condition&#039;:&lt;br /&gt;
### IDEAL for receipts paid with ideal&lt;br /&gt;
### PIN for receipts paid with debit card&lt;br /&gt;
### One of the payment durations if it&#039;s an invoice with a certain payment term&lt;br /&gt;
## Enter total amount on the receipt under “Total amount” &lt;br /&gt;
## Update “Invoice Date” (And probably also “Due date”) to match the date on the receipt&lt;br /&gt;
# Make sure the expenses are for the correct G/L Account, e.g. lunch/borrel stuff should be under 4012&lt;br /&gt;
# Make sure the VAT code is correct, as well as the VAT amount&lt;br /&gt;
## For receipts for food stuffs (lunch, drinks, diner, etc.), the VAT code needs to be `0` (`Geen BTW`), despite the fact that the receipt may state VAT&lt;br /&gt;
# Click “Save + new”&lt;br /&gt;
# Ask Wouter to reconcile the payment+receipt&lt;br /&gt;
# Done&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Payment_reminders&amp;diff=627</id>
		<title>Payment reminders</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Payment_reminders&amp;diff=627"/>
		<updated>2025-07-30T09:56:58Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;# Log in to Exact&lt;br /&gt;
# On the main Exact page go to &amp;quot;Verkoop&amp;quot; &amp;gt; &amp;quot;Openstaande posten&amp;quot; &amp;gt; &amp;quot;Herinneringen afdrukken&amp;quot; / &amp;quot;Sales&amp;quot; &amp;gt; &amp;quot;Outstanding items&amp;quot; &amp;gt; &amp;quot;Print reminders&amp;quot;&lt;br /&gt;
# Select the invoice for which you want to send a reminder and click &amp;quot;Email&amp;quot; (If you want Exact to send the email) or &amp;quot;Afdrukken&amp;quot; / &amp;quot;Print&amp;quot; (If you want to generate a pdf that you can email yourself)&lt;br /&gt;
# Select the correct Template, verify this is what you want and send&lt;br /&gt;
# When returning to the overview page you will now see a link to view sent reminders for this invoice&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Border_reboot&amp;diff=626</id>
		<title>Border reboot</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Border_reboot&amp;diff=626"/>
		<updated>2025-07-22T12:01:38Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Pre-flight checks */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: Throughout this guide &amp;lt;ipv4&amp;gt; and &amp;lt;ipv6&amp;gt; are to be replaced by the correct IP&#039;s. If you don&#039;t know, you can press the &#039;tab&#039; key (twice) on your keyboard after typing &#039;neighbors&#039; to get shown the options.&lt;br /&gt;
&lt;br /&gt;
== Pre-flight checks ==&lt;br /&gt;
These checks are to be done on the &#039;&#039;&#039;OTHER&#039;&#039;&#039; border (So the border that will stay online), to ensure that when the border that&#039;s being rebooted is down the cluster won&#039;t lose network connectivity. The commands are to be invoked in `vtysh`.&lt;br /&gt;
* Confirm our IPv4 block is announced over BGP with `show ip bgp neighbors &amp;lt;ipv4&amp;gt; advertised-routes`&lt;br /&gt;
* Confirm our IPv6 block is announced over BGP with `show bgp neighbors &amp;lt;ipv6&amp;gt; advertised-routes`&lt;br /&gt;
* Confirm that the border receives the ROUTED IPv4 routes from the router with `show ip route`&lt;br /&gt;
* Confirm that the border received the ROUTED &amp;amp; LAN IPv6 routes from the router with `show ipv6 route`&lt;br /&gt;
&lt;br /&gt;
These checks are to be done on the host you want to reboot&lt;br /&gt;
* Set a maintenance period on Zabbix.&lt;br /&gt;
* Post in the Zulip in the relevant topic (incident&#039;s topic / &#039;SRE - General&#039; stream) that the border is going to be rebooted.&lt;br /&gt;
&lt;br /&gt;
== Disabling routing through a border ==&lt;br /&gt;
* First, perform the pre-flight checks on the &#039;&#039;&#039;OTHER&#039;&#039;&#039; border.&lt;br /&gt;
* Second, set a maintenance period on Zabbix for the border that&#039;s being rebooted.&lt;br /&gt;
&lt;br /&gt;
Third, on a border in `vtysh`, update the running configuration by invoking the following:&lt;br /&gt;
&lt;br /&gt;
* config&lt;br /&gt;
* router bgp&lt;br /&gt;
* neighbor &amp;lt;ipv4&amp;gt; shutdown&lt;br /&gt;
* neighbor &amp;lt;ipv6&amp;gt; shutdown&lt;br /&gt;
* exit&lt;br /&gt;
* router ospf&lt;br /&gt;
* no default-information originate&lt;br /&gt;
* exit&lt;br /&gt;
* router ospf6&lt;br /&gt;
* no default-information originate&lt;br /&gt;
* exit&lt;br /&gt;
* exit&lt;br /&gt;
* exit&lt;br /&gt;
&lt;br /&gt;
== Reboot the border ==&lt;br /&gt;
&lt;br /&gt;
* After performing the pre-flight checks and disabling the routing, you can choose to wait until traffic has decreased (e.g. using `bmon` to check bandwidth used on interfaces)&lt;br /&gt;
* Execute `reboot` command&lt;br /&gt;
* When the border is back online, execute relavant items (system uptime, operating system, reboot required) to ensure these will not activate a trigger after disabling maintenance mode&lt;br /&gt;
* If you do not expect any Zabbix alert related to the reboot to be fired, delete the maintenance period&lt;br /&gt;
&lt;br /&gt;
== Troubleshooting ==&lt;br /&gt;
&lt;br /&gt;
Undoing the shutdown of the neighbors can be done by invoking `no neighbor &amp;lt;ipv4&amp;gt;/&amp;lt;ipv6&amp;gt; shutdown` in the `router bgp` part of the configuration.&lt;br /&gt;
&lt;br /&gt;
And the `no default-information originate` can be undone by invoking `default-information originate` in the corect ospf part of the configuration (ospf or ospf6, depending on which one you wish to re-enable).&lt;br /&gt;
&lt;br /&gt;
A reload/restart of the service will also reset to normal configuration.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=625</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=625"/>
		<updated>2025-07-08T10:31:46Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirmed it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|center|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;br /&gt;
&lt;br /&gt;
[[File:Image(1).png|thumb|center|alt=Banshee iDRAC logs|Banshee iDRAC logs]]&lt;br /&gt;
&lt;br /&gt;
== Came up with a plan ==&lt;br /&gt;
&lt;br /&gt;
Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.&lt;br /&gt;
&lt;br /&gt;
== VMs Migration and Shutdown ==&lt;br /&gt;
&lt;br /&gt;
While Proxmox HA is designed to automatically handle VM migrations in the event of node failures, in this case the degraded state of the memory made the bulk migration process unstable, causing the host to crash mid-way through and sending VMs into fencing mode. In hindsight, migrating VMs manually one by one would likely have been a safer strategy. Further technical details of the incident and recovery process can be found in the Zulip conversation.&lt;br /&gt;
&lt;br /&gt;
Things to consider next time:&lt;br /&gt;
* Avoid bulk HA-triggered migration if the server is already unstable — migrate VMs manually one at a time&lt;br /&gt;
* Verify HA master node is responsive before initiating HA operations&lt;br /&gt;
* Test migration procedures on a non-critical VM first&lt;br /&gt;
&lt;br /&gt;
== Intervention at the Datacenter ==&lt;br /&gt;
&lt;br /&gt;
Once the replacement memory stick was delivered, we scheduled a physical intervention at the datacenter to carry out the replacement. The goal was to bring the banshee node back online without any hardware issues and reintegrate it into the cluster safely.&lt;br /&gt;
&lt;br /&gt;
To guide the intervention, we used a checklist outlining all the necessary steps — from powering down the machine and replacing the faulty DIMM to validating the memory installation and carefully reintroducing VMs to the node. This helped ensure that each task was executed in the correct order and nothing critical was overlooked.&lt;br /&gt;
&lt;br /&gt;
Note: The checklist we followed is not intended to be a definitive or one-size-fits-all procedure. It should instead be seen as a practical example — a starting point that can be adapted depending on the specific hardware issue, node role, and service criticality involved in future incidents.&lt;br /&gt;
&lt;br /&gt;
* Remove Banshee from HA groups to prevent automatic VM migrations&lt;br /&gt;
* Set up maintenance period for Banshee&lt;br /&gt;
* Turn Banshee off&lt;br /&gt;
* Disconnect Banshee&lt;br /&gt;
* Open up Banshee&lt;br /&gt;
* Locate and remove the faulty memory stick&lt;br /&gt;
* Install the new memory stick in the correct slot&lt;br /&gt;
* Record serial numbers of memory sticks in Netbox&lt;br /&gt;
* Close Banshee, reconnect power and network, power it on, and connect a monitor&lt;br /&gt;
* Enter the Lifecycle Controller and confirm that the memory is detected&lt;br /&gt;
* In the Lifecycle Controller, run a memory test: if test fails repeat previous steps with other memory stick&lt;br /&gt;
* Migrate a few selected test VMs back to Banshee&lt;br /&gt;
* Once the system is stable and VMs are confirmed to run correctly, migrate all intended VMs to Banshee&lt;br /&gt;
* Add Banshee back to the original HA groups&lt;br /&gt;
* Make sure OSDs come back online&lt;br /&gt;
* Remove maintenance period&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=624</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=624"/>
		<updated>2025-07-08T10:00:49Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirmed it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|center|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;br /&gt;
&lt;br /&gt;
[[File:Image(1).png|thumb|center|alt=Banshee iDRAC logs|Banshee iDRAC logs]]&lt;br /&gt;
&lt;br /&gt;
== Came up with a plan ==&lt;br /&gt;
&lt;br /&gt;
Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.&lt;br /&gt;
&lt;br /&gt;
== VMs Migration and Shutdown ==&lt;br /&gt;
&lt;br /&gt;
While Proxmox HA is designed to automatically handle VM migrations in the event of node failures, in this case the degraded state of the memory made the bulk migration process unstable, causing the host to crash mid-way through and sending VMs into fencing mode. In hindsight, migrating VMs manually one by one would likely have been a safer strategy. Further technical details of the incident and recovery process can be found in the Zulip conversation.&lt;br /&gt;
&lt;br /&gt;
Things to consider next time:&lt;br /&gt;
* Avoid bulk HA-triggered migration if the server is already unstable — migrate VMs manually one at a time&lt;br /&gt;
* Verify HA master node is responsive before initiating HA operations&lt;br /&gt;
* Test migration procedures on a non-critical VM first&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=623</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=623"/>
		<updated>2025-07-08T09:28:58Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Confirmed it&amp;#039;s a hardware issue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirmed it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|center|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;br /&gt;
&lt;br /&gt;
[[File:Image(1).png|thumb|center|alt=Banshee iDRAC logs|Banshee iDRAC logs]]&lt;br /&gt;
&lt;br /&gt;
== Came up with a plan ==&lt;br /&gt;
&lt;br /&gt;
Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=621</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=621"/>
		<updated>2025-06-12T09:42:04Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Confirm it&amp;#039;s a hardware issue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirmed it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|center|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;br /&gt;
&lt;br /&gt;
[[File:Image(1).png|thumb|center|alt=Banshee iDRAC logs|Banshee iDRAC logs]]&lt;br /&gt;
&lt;br /&gt;
== Came up with a plan ==&lt;br /&gt;
&lt;br /&gt;
Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=620</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=620"/>
		<updated>2025-06-12T09:31:30Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Confirm it&amp;#039;s a hardware issue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirm it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|left|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;br /&gt;
&lt;br /&gt;
[[File:Image(1).png|thumb|left|alt=Banshee iDRAC logs|Banshee iDRAC logs]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Image(1).png&amp;diff=619</id>
		<title>File:Image(1).png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Image(1).png&amp;diff=619"/>
		<updated>2025-06-12T09:30:59Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Banshee iDRAC logs&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=618</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=618"/>
		<updated>2025-06-12T09:29:37Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Confirm it&amp;#039;s a hardware issue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirm it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|left|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=617</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=617"/>
		<updated>2025-06-12T09:27:08Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;br /&gt;
&lt;br /&gt;
== Confirm it&#039;s a hardware issue ==&lt;br /&gt;
&lt;br /&gt;
In this case we received 2 alerts&lt;br /&gt;
1.&lt;br /&gt;
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
&lt;br /&gt;
2.&lt;br /&gt;
* Overall System Status is Critical (5)&lt;br /&gt;
* Problem with memory in slot DIMM.Socket.A1&lt;br /&gt;
&lt;br /&gt;
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
journalctl -b | grep -i memory&lt;br /&gt;
journalctl -k | grep -i error&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We saw multiple entries reporting Hardware Error.&lt;br /&gt;
&lt;br /&gt;
This was also confirmed by checking hardware health on the iDRAC interface:&lt;br /&gt;
 &lt;br /&gt;
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|alt=Banshee&#039;s hardware health on iDRAC|Banshee&#039;s hardware health on iDRAC]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Banshee.idrac.ws.maxmaton.nl_restgui_index.html_8ce2fb21ce62c14bc4975f040b973a5f(1).png&amp;diff=616</id>
		<title>File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Banshee.idrac.ws.maxmaton.nl_restgui_index.html_8ce2fb21ce62c14bc4975f040b973a5f(1).png&amp;diff=616"/>
		<updated>2025-06-12T09:26:04Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Banshee Hardware health iDrac&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=615</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=615"/>
		<updated>2025-06-12T09:00:12Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
For full context and team discussions, see the [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical/topic/.E2.9C.94.20banshee.2Ews.2Emaxmaton.2Enl Zulip conversation related to this incident].&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=614</id>
		<title>Hardware Incident Response: Memory Slot Failure on banshee</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Hardware_Incident_Response:_Memory_Slot_Failure_on_banshee&amp;diff=614"/>
		<updated>2025-06-12T08:57:20Z</updated>

		<summary type="html">&lt;p&gt;Alois: Created page with &amp;quot;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.  ⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.  This write-up should be seen as...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions.&lt;br /&gt;
&lt;br /&gt;
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.&lt;br /&gt;
&lt;br /&gt;
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Internal&amp;diff=613</id>
		<title>Internal</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Internal&amp;diff=613"/>
		<updated>2025-06-12T08:54:44Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* SRE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Strategy ==&lt;br /&gt;
* [[Flow metrics|Flow metrics]]&lt;br /&gt;
* [[12%-time|12%-time]]&lt;br /&gt;
&lt;br /&gt;
== Finance ==&lt;br /&gt;
&lt;br /&gt;
=== Exact ===&lt;br /&gt;
&lt;br /&gt;
* [[booking bonus|Booking bonus]]&lt;br /&gt;
* [[booking wages|Booking wages]]&lt;br /&gt;
* [[booking quarterly hosting invoice|Booking quarterly hosting invoice]]&lt;br /&gt;
* [[new receipt|Enter a new receipt]]&lt;br /&gt;
* [[reconciliation|Reconciliation of transaction]]&lt;br /&gt;
* [[invoicing|Send an invoice]]&lt;br /&gt;
* [[payment reminders|Send payment reminder]]&lt;br /&gt;
* [[invoice approval|Process for approving invoices (/filed receipts)]]&lt;br /&gt;
&lt;br /&gt;
=== Bunq ===&lt;br /&gt;
&lt;br /&gt;
* [[top up account|Top up expense account]]&lt;br /&gt;
&lt;br /&gt;
== Work Process ==&lt;br /&gt;
&lt;br /&gt;
* [[Definition of done|Definition of Done]]&lt;br /&gt;
* [[Incident Handling|Incident Handling]]&lt;br /&gt;
* [[SRE Maintenance|SRE Maintenance]]&lt;br /&gt;
* [[Release checklist|Release checklist]]&lt;br /&gt;
&lt;br /&gt;
== Internal Process ==&lt;br /&gt;
* [[timetracking|Timetracking process]]&lt;br /&gt;
* [[Starting work for a new client]]&lt;br /&gt;
* [[12 percent|12% time]]&lt;br /&gt;
* [[Annual leave|Annual leave]]&lt;br /&gt;
* [[Bonus allocation|Bonus allocation]]&lt;br /&gt;
* [[Calamity leave|Calamity leave]]&lt;br /&gt;
* [[Overtime|Overtime]]&lt;br /&gt;
* [[Retrospectives|Retrospectives]]&lt;br /&gt;
* [[Sick leave|Sick leave]]&lt;br /&gt;
* [[Training and self-study|Training and Self-Study]]&lt;br /&gt;
* [[Daily|Daily]]&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* Era Inventory [[project_era_inventory_api|API Description]]&lt;br /&gt;
&lt;br /&gt;
== SRE ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;To be further populated with guide from drive&#039;&#039;&lt;br /&gt;
* [[create gitlab runner host|Create a GitLab runner host]]&lt;br /&gt;
* [[vm setup|Create a (Debian) VM]]&lt;br /&gt;
* [[border update|Process for updating a border]]&lt;br /&gt;
* [[border reboot|Reboot border without downtime]]&lt;br /&gt;
* [[WS Proxmox node reboot|Reboot WS Proxmox node without downtime]]&lt;br /&gt;
* [[Resize VM Disk]]&lt;br /&gt;
* [[SRE tools]]&lt;br /&gt;
* [[Enroll Mac in Kerberos]]&lt;br /&gt;
* [[New Mac Setup]]&lt;br /&gt;
* [[Creating a VM on Hetzner]]&lt;br /&gt;
* [[Rebooting VM]]&lt;br /&gt;
* [[Rebooting Offsite]]&lt;br /&gt;
* [[ssh-fingerprints|Verifying SSH fingerprints]]&lt;br /&gt;
* [[Removing VM]]&lt;br /&gt;
* [[Install a new Disk in Server]]&lt;br /&gt;
* [[Setting Up Wildcard Subdomains with SSL on a Debian Application]]&lt;br /&gt;
* [[Hardware Incident Response: Memory Slot Failure on banshee]]&lt;br /&gt;
=== SLA ===&lt;br /&gt;
* [[Response for Backup Service being down for an extended period of time]]&lt;br /&gt;
&lt;br /&gt;
== Other ==&lt;br /&gt;
&lt;br /&gt;
* [[stack|Greenfield stack]]&lt;br /&gt;
* [[standard tools|Standard Tools]]&lt;br /&gt;
* [[list of unfurl debuggers|List of unfurl debuggers]]&lt;br /&gt;
* [[Recommended suppliers]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Removing_VM&amp;diff=612</id>
		<title>Removing VM</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Removing_VM&amp;diff=612"/>
		<updated>2025-06-10T13:56:44Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This guide was written with the general idea of undoing the steps from our [https://docs.google.com/document/d/1UWESupnHD5jVHTAhdngAH5_HGViwDL_7G5xTX4b9CYs VM Setup guide] going in reverse order.&lt;br /&gt;
&lt;br /&gt;
=== Step 1 : Disable the host on Zabbix ===&lt;br /&gt;
This avoids having to create a maintenance period for the next step.&lt;br /&gt;
This can be done 2 ways:&lt;br /&gt;
*1st way : In Zabbix &amp;gt; Configuration &amp;gt; Hosts find the VM you want to disable, in the “status” column you should see “Enabled”, click “Enabled” to disable the VM you’ll get a confirmation prompt.&lt;br /&gt;
[[File:Screenshot 2024-09-02 at 11.04.36.png|200px|thumb|center|disabling host confirmation prompt]]&lt;br /&gt;
&lt;br /&gt;
*2nd way : In Zabbix &amp;gt; Configuration &amp;gt; Hosts, find the VM you want to disable, click its name, a configuration panel appears, uncheck the “Enabled” checkbox&lt;br /&gt;
[[File:Screenshot 2024-09-02 at 11.07.07.png|200px|thumb|center|alt=Zabbix configuration panel|configuration panel]]&lt;br /&gt;
&lt;br /&gt;
=== Step 2 : Remove VM from HA group ===&lt;br /&gt;
In Proxmox&lt;br /&gt;
*1. From your VM panel go to More &amp;gt; Manage HA in the top right corner&lt;br /&gt;
[[File:Proxmox-more-ha.png|200px|thumb|center|alt=location of the &amp;quot;Manage HA&amp;quot; option|location of the &amp;quot;Manage HA&amp;quot; option]]&lt;br /&gt;
*2. Set Request State to “stopped”&lt;br /&gt;
[[File:Screenshot 2024-09-02 at 11.33.16.png|200px|thumb|center|alt=HA config panel|HA config panel]]&lt;br /&gt;
&lt;br /&gt;
=== Step 3 : Undo IPA enrollment ===&lt;br /&gt;
&#039;&#039;&#039;This step can only be performed by the First Responder &#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
*1. Select the desired hostname and click delete &lt;br /&gt;
[[File:IPA-Hosts-Panel.png|200px|thumb|center|alt=IPA hosts panel|IPA hosts panel]]&lt;br /&gt;
&lt;br /&gt;
*2. This will open a modal, make sure to check the option to remove all A, AAAA, SSHFP and PTR records&lt;br /&gt;
[[File:IPA-host-removal.png|200px|thumb|center|alt=Host removal confirmation prompt|Host removal confirmation prompt]]&lt;br /&gt;
&lt;br /&gt;
=== Step 4 : Remove all references ===&lt;br /&gt;
Remove any reference of your VM from our [https://git.dsinternal.net/delft-solutions/infrastructure/dns DNS repo]&lt;br /&gt;
&lt;br /&gt;
=== Step 5 : Remove the vm on proxmox ===&lt;br /&gt;
quick note: make sure to note down the vm&#039;s id as it will be needed for step 6&lt;br /&gt;
&lt;br /&gt;
*1. From your vm panel go to More &amp;gt; Remove, in the top right corner&lt;br /&gt;
[[File:Proxmox-vm-removal-location.png|200px|thumb|center|alt=Proxmox vm removal option location|Proxmox vm removal option location]]&lt;br /&gt;
&lt;br /&gt;
*2. This will open a modal, make sure to check all boxes and confirm the id of the VM you’re removing&lt;br /&gt;
[[File:VM-removal-confirmation-prompt.png|200px|thumb|center|alt=VM removal confirmation panel|VM removal confirmation panel]]&lt;br /&gt;
&lt;br /&gt;
=== Step 6 : Purge any backup data ===&lt;br /&gt;
&lt;br /&gt;
*1. Backups are stored in zombie.maxmaton.nl at the time this guide is written, because this is a third party service we can not purge the data ourselves. Ask Max to purge backup data providing him with the id of the VM you want to purge backup data for.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Removing_VM&amp;diff=611</id>
		<title>Removing VM</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Removing_VM&amp;diff=611"/>
		<updated>2025-06-10T13:52:46Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Step 5 : Remove the vm on proxmox */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This guide was written with the general idea of undoing the steps from our [https://docs.google.com/document/d/1UWESupnHD5jVHTAhdngAH5_HGViwDL_7G5xTX4b9CYs VM Setup guide] going in reverse order.&lt;br /&gt;
&lt;br /&gt;
=== Step 1 : Disable the host on Zabbix ===&lt;br /&gt;
This avoids having to create a maintenance period for the next step.&lt;br /&gt;
This can be done 2 ways:&lt;br /&gt;
*1st way : In Zabbix &amp;gt; Configuration &amp;gt; Hosts find the VM you want to disable, in the “status” column you should see “Enabled”, click “Enabled” to disable the VM you’ll get a confirmation prompt.&lt;br /&gt;
[[File:Screenshot 2024-09-02 at 11.04.36.png|200px|thumb|center|disabling host confirmation prompt]]&lt;br /&gt;
&lt;br /&gt;
*2nd way : In Zabbix &amp;gt; Configuration &amp;gt; Hosts, find the VM you want to disable, click its name, a configuration panel appears, uncheck the “Enabled” checkbox&lt;br /&gt;
[[File:Screenshot 2024-09-02 at 11.07.07.png|200px|thumb|center|alt=Zabbix configuration panel|configuration panel]]&lt;br /&gt;
&lt;br /&gt;
=== Step 2 : Remove VM from HA group ===&lt;br /&gt;
In Proxmox&lt;br /&gt;
*1. From your VM panel go to More &amp;gt; Manage HA in the top right corner&lt;br /&gt;
[[File:Proxmox-more-ha.png|200px|thumb|center|alt=location of the &amp;quot;Manage HA&amp;quot; option|location of the &amp;quot;Manage HA&amp;quot; option]]&lt;br /&gt;
*2. Set Request State to “stopped”&lt;br /&gt;
[[File:Screenshot 2024-09-02 at 11.33.16.png|200px|thumb|center|alt=HA config panel|HA config panel]]&lt;br /&gt;
&lt;br /&gt;
=== Step 3 : Undo IPA enrollment ===&lt;br /&gt;
&#039;&#039;&#039;This step can only be performed by the First Responder &#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
*1. Select the desired hostname and click delete &lt;br /&gt;
[[File:IPA-Hosts-Panel.png|200px|thumb|center|alt=IPA hosts panel|IPA hosts panel]]&lt;br /&gt;
&lt;br /&gt;
*2. This will open a modal, make sure to check the option to remove all A, AAAA, SSHFP and PTR records&lt;br /&gt;
[[File:IPA-host-removal.png|200px|thumb|center|alt=Host removal confirmation prompt|Host removal confirmation prompt]]&lt;br /&gt;
&lt;br /&gt;
=== Step 4 : Remove all references ===&lt;br /&gt;
Remove any reference of your VM from our [https://git.dsinternal.net/delft-solutions/infrastructure/dns DNS repo]&lt;br /&gt;
&lt;br /&gt;
=== Step 5 : Remove the vm on proxmox ===&lt;br /&gt;
quick note: make sure to note down the vm&#039;s id as it will be needed for step 6&lt;br /&gt;
&lt;br /&gt;
*1. From your vm panel go to More &amp;gt; Remove, in the top right corner&lt;br /&gt;
[[File:Proxmox-vm-removal-location.png|200px|thumb|center|alt=Proxmox vm removal option location|Proxmox vm removal option location]]&lt;br /&gt;
&lt;br /&gt;
*2. This will open a modal, make sure to check all boxes and confirm the id of the VM you’re removing&lt;br /&gt;
[[File:VM-removal-confirmation-prompt.png|200px|thumb|center|alt=VM removal confirmation panel|VM removal confirmation panel]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=606</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=606"/>
		<updated>2025-05-02T14:15:27Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Checklist =&lt;br /&gt;
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You&#039;re encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.&lt;br /&gt;
&lt;br /&gt;
=== Critical Incidents ===&lt;br /&gt;
Critical incidents must be resolved within 16 hours. &lt;br /&gt;
&lt;br /&gt;
# Acknowledge trigger in Zabbix.&lt;br /&gt;
# Check if incident is still ongoing.&lt;br /&gt;
# If ongoing and clients are potentially affected, notify the affected clients via Slack.&lt;br /&gt;
# Document all actions taken in Zulip topic.&lt;br /&gt;
# Create plan of action.&lt;br /&gt;
# Execute plan and document results in Zabbix thread. &lt;br /&gt;
# If unresolved, create new plan.&lt;br /&gt;
# When resolved:&lt;br /&gt;
## Verify trigger is no longer firing.&lt;br /&gt;
## Mark Zulip topic as resolved if no other incidents for host.&lt;br /&gt;
## Check for related triggers and resolve them.&lt;br /&gt;
&lt;br /&gt;
Common issues that have occurred previously, and &#039;&#039;could&#039;&#039; occur again:&lt;br /&gt;
* SSH down: Check MaxStartups throttling, apply custom SSH config&lt;br /&gt;
* No backup: Verify backup process is running, check devteam email&lt;br /&gt;
* HTTPS down on Sunday: this can be due to Gitlab updates&lt;br /&gt;
&lt;br /&gt;
=== Non-Critical Incidents ===&lt;br /&gt;
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.&lt;br /&gt;
&lt;br /&gt;
# Acknowledge in Zabbix thread&lt;br /&gt;
# Check metrics sheet for existing milestone&lt;br /&gt;
## If a milestone exists:&lt;br /&gt;
### Add Lynx project ID to Zulip topic&lt;br /&gt;
### Add 🔁 emoji if ID already reported&lt;br /&gt;
## If no milestone exists:&lt;br /&gt;
### Add to metrics sheet&lt;br /&gt;
### Create Lynx project (priority 99, then 20 after estimation)&lt;br /&gt;
### Create Kimai activity&lt;br /&gt;
### Document IDs in Zulip topic&lt;br /&gt;
&lt;br /&gt;
=== Informational Incidents ===&lt;br /&gt;
Informational incidents must be acknowledged within 72 hours.&lt;br /&gt;
&lt;br /&gt;
# Acknowledge in Zabbix&lt;br /&gt;
# Verify issue&lt;br /&gt;
# Take action if needed&lt;br /&gt;
&lt;br /&gt;
=== External Reports ===&lt;br /&gt;
&lt;br /&gt;
# Acknowledge receipt&lt;br /&gt;
# Classify report as critical, non-critical or informational. &lt;br /&gt;
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details. &lt;br /&gt;
# Proceed with checklist above for the type of incident.&lt;br /&gt;
&lt;br /&gt;
= Full procedure =&lt;br /&gt;
&lt;br /&gt;
== General Rules ==&lt;br /&gt;
&lt;br /&gt;
# When an incident is in progress, and person A is handling it, then all incidents in area X, are handled by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over.&lt;br /&gt;
# FR always has the last word on what solution to apply for resolving an incident.&lt;br /&gt;
&lt;br /&gt;
== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve others to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaneously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the following steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VMs to reboot, or the router having an connectivity issue will lead to most other VMs having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VMs.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VMs for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automatically have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions made in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematically find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relatively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the time frame that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VMs are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automatically updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediately re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunately, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediately, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;constraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip integration ==&lt;br /&gt;
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=605</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=605"/>
		<updated>2025-05-02T14:09:09Z</updated>

		<summary type="html">&lt;p&gt;Alois: Add general rules&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Checklist =&lt;br /&gt;
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You&#039;re encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.&lt;br /&gt;
 &lt;br /&gt;
=== General Rules ===&lt;br /&gt;
&lt;br /&gt;
# When an incident is in progress, and person A is handling it, then all incidents in area X, are handed by person A, rather than the FR. Unless working day ends. Person A should communicate clearly to FR when their day is over.&lt;br /&gt;
# FR always has the last word on what solution to apply for resolving an incident.&lt;br /&gt;
&lt;br /&gt;
=== Critical Incidents ===&lt;br /&gt;
Critical incidents must be resolved within 16 hours. &lt;br /&gt;
&lt;br /&gt;
# Acknowledge trigger in Zabbix.&lt;br /&gt;
# Check if incident is still ongoing.&lt;br /&gt;
# If ongoing and clients are potentially affected, notify the affected clients via Slack.&lt;br /&gt;
# Document all actions taken in Zulip topic.&lt;br /&gt;
# Create plan of action.&lt;br /&gt;
# Execute plan and document results in Zabbix thread. &lt;br /&gt;
# If unresolved, create new plan.&lt;br /&gt;
# When resolved:&lt;br /&gt;
## Verify trigger is no longer firing.&lt;br /&gt;
## Mark Zulip topic as resolved if no other incidents for host.&lt;br /&gt;
## Check for related triggers and resolve them.&lt;br /&gt;
&lt;br /&gt;
Common issues that have occurred previously, and &#039;&#039;could&#039;&#039; occur again:&lt;br /&gt;
* SSH down: Check MaxStartups throttling, apply custom SSH config&lt;br /&gt;
* No backup: Verify backup process is running, check devteam email&lt;br /&gt;
* HTTPS down on Sunday: this can be due to Gitlab updates&lt;br /&gt;
&lt;br /&gt;
=== Non-Critical Incidents ===&lt;br /&gt;
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.&lt;br /&gt;
&lt;br /&gt;
# Acknowledge in Zabbix thread&lt;br /&gt;
# Check metrics sheet for existing milestone&lt;br /&gt;
## If a milestone exists:&lt;br /&gt;
### Add Lynx project ID to Zulip topic&lt;br /&gt;
### Add 🔁 emoji if ID already reported&lt;br /&gt;
## If no milestone exists:&lt;br /&gt;
### Add to metrics sheet&lt;br /&gt;
### Create Lynx project (priority 99, then 20 after estimation)&lt;br /&gt;
### Create Kimai activity&lt;br /&gt;
### Document IDs in Zulip topic&lt;br /&gt;
&lt;br /&gt;
=== Informational Incidents ===&lt;br /&gt;
Informational incidents must be acknowledged within 72 hours.&lt;br /&gt;
&lt;br /&gt;
# Acknowledge in Zabbix&lt;br /&gt;
# Verify issue&lt;br /&gt;
# Take action if needed&lt;br /&gt;
&lt;br /&gt;
=== External Reports ===&lt;br /&gt;
&lt;br /&gt;
# Acknowledge receipt&lt;br /&gt;
# Classify report as critical, non-critical or informational. &lt;br /&gt;
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details. &lt;br /&gt;
# Proceed with checklist above for the type of incident.&lt;br /&gt;
&lt;br /&gt;
= Full procedure =&lt;br /&gt;
&lt;br /&gt;
== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve others to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaneously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the following steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VMs to reboot, or the router having an connectivity issue will lead to most other VMs having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VMs.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VMs for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automatically have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions made in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematically find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relatively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the time frame that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VMs are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automatically updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediately re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunately, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediately, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;constraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip integration ==&lt;br /&gt;
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=604</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=604"/>
		<updated>2025-05-02T13:45:01Z</updated>

		<summary type="html">&lt;p&gt;Alois: Fix typos&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Checklist =&lt;br /&gt;
This checklist is a shorter, imperative version of [[Incident Handling#Full_procedure|the longer procedure below]]. You&#039;re encouraged to read the [[Incident Handling#Full_procedure|full procedure]] at least once to improve your understanding of the core material.&lt;br /&gt;
 &lt;br /&gt;
=== Critical Incidents ===&lt;br /&gt;
Critical incidents must be resolved within 16 hours. &lt;br /&gt;
&lt;br /&gt;
# Acknowledge trigger in Zabbix.&lt;br /&gt;
# Check if incident is still ongoing.&lt;br /&gt;
# If ongoing and clients are potentially affected, notify the affected clients via Slack.&lt;br /&gt;
# Document all actions taken in Zulip topic.&lt;br /&gt;
# Create plan of action.&lt;br /&gt;
# Execute plan and document results in Zabbix thread. &lt;br /&gt;
# If unresolved, create new plan.&lt;br /&gt;
# When resolved:&lt;br /&gt;
## Verify trigger is no longer firing.&lt;br /&gt;
## Mark Zulip topic as resolved if no other incidents for host.&lt;br /&gt;
## Check for related triggers and resolve them.&lt;br /&gt;
&lt;br /&gt;
Common issues that have occurred previously, and &#039;&#039;could&#039;&#039; occur again:&lt;br /&gt;
* SSH down: Check MaxStartups throttling, apply custom SSH config&lt;br /&gt;
* No backup: Verify backup process is running, check devteam email&lt;br /&gt;
* HTTPS down on Sunday: this can be due to Gitlab updates&lt;br /&gt;
&lt;br /&gt;
=== Non-Critical Incidents ===&lt;br /&gt;
Non-critical incidents must be acknowledged within 9 hours and resolved within 1 week.&lt;br /&gt;
&lt;br /&gt;
# Acknowledge in Zabbix thread&lt;br /&gt;
# Check metrics sheet for existing milestone&lt;br /&gt;
## If a milestone exists:&lt;br /&gt;
### Add Lynx project ID to Zulip topic&lt;br /&gt;
### Add 🔁 emoji if ID already reported&lt;br /&gt;
## If no milestone exists:&lt;br /&gt;
### Add to metrics sheet&lt;br /&gt;
### Create Lynx project (priority 99, then 20 after estimation)&lt;br /&gt;
### Create Kimai activity&lt;br /&gt;
### Document IDs in Zulip topic&lt;br /&gt;
&lt;br /&gt;
=== Informational Incidents ===&lt;br /&gt;
Informational incidents must be acknowledged within 72 hours.&lt;br /&gt;
&lt;br /&gt;
# Acknowledge in Zabbix&lt;br /&gt;
# Verify issue&lt;br /&gt;
# Take action if needed&lt;br /&gt;
&lt;br /&gt;
=== External Reports ===&lt;br /&gt;
&lt;br /&gt;
# Acknowledge receipt&lt;br /&gt;
# Classify report as critical, non-critical or informational. &lt;br /&gt;
# Create a Zulip topic in SRE # Critical, SRE ## Non-critical or SRE ### Informational (depending on classification) and add sufficient details. &lt;br /&gt;
# Proceed with checklist above for the type of incident.&lt;br /&gt;
&lt;br /&gt;
= Full procedure =&lt;br /&gt;
&lt;br /&gt;
== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve others to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaneously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the following steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VMs to reboot, or the router having an connectivity issue will lead to most other VMs having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VMs.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VMs for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automatically have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions made in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematically find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relatively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the time frame that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VMs are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automatically updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediately re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunately, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost integration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediately, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;constraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip integration ==&lt;br /&gt;
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=603</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=603"/>
		<updated>2025-03-19T14:59:28Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Step 3 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
output should look something like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.29.14.png|thumb|center|`fdisk -l /dev/sda`]]&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
Resize the Partition to Use the Extra Space:&lt;br /&gt;
&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; command will help you determine)&lt;br /&gt;
  * `parted /dev/sda`, opens the partition table editor for /dev/sda&lt;br /&gt;
  * `print`, double-checks the disk layout&lt;br /&gt;
  * `resizepart 2 100%`, choose partition you want to resize, here partition 2 is resized to use all available space&lt;br /&gt;
  * `quit`, exit parted&lt;br /&gt;
&lt;br /&gt;
You can verify the updated partition table running `lsblk`, this lists all storage devices, showing partitions and their sizes.&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.49.02.png|thumb|center|`lsblk`]]&lt;br /&gt;
&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2), this expands the filesystem on /dev/sda2 to use the full partition&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.53.12.png|thumb|center|`resize2fs /dev/sda2`]]&lt;br /&gt;
&lt;br /&gt;
Verify available space running `df -h`, displays disk usage in a human-readable format&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.55.56.png|thumb|center|`df -h`]]&lt;br /&gt;
&lt;br /&gt;
As a final check you can also check Zabbix for Latest Data &amp;gt; Total Space for your vm&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=602</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=602"/>
		<updated>2025-03-19T14:59:12Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Step 3 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
output should look something like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.29.14.png|thumb|center|`fdisk -l /dev/sda`]]&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
Resize the Partition to Use the Extra Space:&lt;br /&gt;
&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; command will help you determine)&lt;br /&gt;
  * `parted /dev/sda`, opens the partition table editor for /dev/sda&lt;br /&gt;
  * `print`, double-checks the disk layout&lt;br /&gt;
  * `resizepart 2 100%`, choose partition you want to resize, here partition 2 is resized to use all available space&lt;br /&gt;
  * `quit`, exit parted&lt;br /&gt;
&lt;br /&gt;
You can verify the updated partition table running `lsblk`, this lists all storage devices, showing partitions and their sizes.&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.49.02.png|thumb|center|`lsblk`]]&lt;br /&gt;
&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2), this expands the filesystem on /dev/sda2 to use the full partition&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.53.12.png|thumb|center|`resize2fs /dev/sda2`]]&lt;br /&gt;
&lt;br /&gt;
Verify available space running `df -h`, displays disk usage in a human-readable format&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.55.56.png|thumb|`df -h`]]&lt;br /&gt;
&lt;br /&gt;
As a final check you can also check Zabbix for Latest Data &amp;gt; Total Space for your vm&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.55.56.png&amp;diff=601</id>
		<title>File:Screenshot 2025-03-19 at 15.55.56.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.55.56.png&amp;diff=601"/>
		<updated>2025-03-19T14:56:41Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;df -h output&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=600</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=600"/>
		<updated>2025-03-19T14:54:14Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Step 3 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
output should look something like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.29.14.png|thumb|center|`fdisk -l /dev/sda`]]&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
Resize the Partition to Use the Extra Space:&lt;br /&gt;
&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; command will help you determine)&lt;br /&gt;
  * `parted /dev/sda`, opens the partition table editor for /dev/sda&lt;br /&gt;
  * `print`, double-checks the disk layout&lt;br /&gt;
  * `resizepart 2 100%`, choose partition you want to resize, here partition 2 is resized to use all available space&lt;br /&gt;
  * `quit`, exit parted&lt;br /&gt;
&lt;br /&gt;
You can verify the updated partition table running `lsblk`, this lists all storage devices, showing partitions and their sizes.&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.49.02.png|thumb|center|`lsblk`]]&lt;br /&gt;
&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2), this expands the filesystem on /dev/sda2 to use the full partition&lt;br /&gt;
output should look like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.53.12.png|thumb|center|`resize2fs /dev/sda2`]]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.53.12.png&amp;diff=599</id>
		<title>File:Screenshot 2025-03-19 at 15.53.12.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.53.12.png&amp;diff=599"/>
		<updated>2025-03-19T14:53:56Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;resize2fs /dev/sda2 output&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=598</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=598"/>
		<updated>2025-03-19T14:51:29Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Step 2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
output should look something like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.29.14.png|thumb|center|`fdisk -l /dev/sda`]]&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
Resize the Partition to Use the Extra Space:&lt;br /&gt;
&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; command will help you determine)&lt;br /&gt;
  * `parted /dev/sda`, opens the partition table editor for /dev/sda&lt;br /&gt;
  * `print`, double-checks the disk layout&lt;br /&gt;
  * `resizepart 2 100%`, choose partition you want to resize, here partition 2 is resized to use all available space&lt;br /&gt;
  * `quit`, exit parted&lt;br /&gt;
&lt;br /&gt;
You can verify the updated partition table running `lsblk`, this lists all storage devices, showing partitions and their sizes.&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.49.02.png|thumb|center|`lsblk`]]&lt;br /&gt;
&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2)&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.49.02.png&amp;diff=597</id>
		<title>File:Screenshot 2025-03-19 at 15.49.02.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.49.02.png&amp;diff=597"/>
		<updated>2025-03-19T14:49:44Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;lsblk: Lists all storage devices, showing partitions and their sizes.&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=596</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=596"/>
		<updated>2025-03-19T14:31:49Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Step 2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
output should look something like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.29.14.png|thumb|center|`fdisk -l /dev/sda`]]&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
Resize the Partition to Use the Extra Space:&lt;br /&gt;
&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; command will help you determine)&lt;br /&gt;
  * `parted /dev/sda`, opens the partition table editor for /dev/sda&lt;br /&gt;
  * `print`, double-checks the disk layout&lt;br /&gt;
  * `resizepart 2 100%`, choose partition you want to resize, here partition 2 is resized to use all available space&lt;br /&gt;
  * `quit`, exit parted&lt;br /&gt;
&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2)&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=595</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=595"/>
		<updated>2025-03-19T14:31:39Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
output should look something like this:&lt;br /&gt;
[[File:Screenshot 2025-03-19 at 15.29.14.png|thumb|center|`fdisk -l /dev/sda`]]&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
To increase the disk partition:&lt;br /&gt;
* In the VM, first get the partition table for info (Assuming the disk is /dev/sda) with `fdisk -l /dev/sda | grep ^/dev`&lt;br /&gt;
Output should something like this:&lt;br /&gt;
[[File:Fdisk-output.png|thumb]]&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
  * parted /dev/sda&lt;br /&gt;
  * print&lt;br /&gt;
  * resizepart 2 100%&lt;br /&gt;
  * quit&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; will help you determine)&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2)&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.29.14.png&amp;diff=594</id>
		<title>File:Screenshot 2025-03-19 at 15.29.14.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_15.29.14.png&amp;diff=594"/>
		<updated>2025-03-19T14:30:27Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;`fdisk -l /dev/sda` output&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=593</id>
		<title>Resize VM Disk</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Resize_VM_Disk&amp;diff=593"/>
		<updated>2025-03-19T14:12:41Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Draft =&lt;br /&gt;
This article is a draft. It only tells you how to increase a disk, and only if the disk is partitioned in a certain way. Also not tested by anyone else yet. (other then by the author of this article: Wouter)&lt;br /&gt;
= Intro =&lt;br /&gt;
When a virtual machine (VM) needs more disk space, you can expand its storage by resizing the disk in the Proxmox GUI and then adjusting the partition inside the VM using Linux commands.&lt;br /&gt;
These steps only work if the root partition is at the end of the disk. (The automated install config/preseed ensures this)&lt;br /&gt;
== Step 1 ==&lt;br /&gt;
Before making any changes, you can confirm the current size of your disk.&lt;br /&gt;
running `fdisk -l /dev/sda`, this lists partition table details of /dev/sda, showing its total size and partitions.&lt;br /&gt;
&lt;br /&gt;
Next, to increase the disk &#039;physical&#039; size:&lt;br /&gt;
* On Proxmox web gui: Select your VM then Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize &lt;br /&gt;
[[File:Screenshot 2025-03-19 at 14.54.13.png|thumb|center|Hardware &amp;gt; Hard Disk &amp;gt; Disk Action &amp;gt; Resize]]&lt;br /&gt;
* Enter the new disk size and click OK&lt;br /&gt;
&lt;br /&gt;
After resizing in Proxmox, check if the disk size change is recognized by the system by running `fdisk -l /dev/sda` again&lt;br /&gt;
The total disk size should reflect the increased value. However, partitions remain unchanged. This means the top layer is aware of the size increase&lt;br /&gt;
now we need to inform the other layers.&lt;br /&gt;
&lt;br /&gt;
If the total disk size does not show an increase as expected, you may need to run `partprobe` to force the system to re-read the partition table without requiring a reboot.&lt;br /&gt;
&lt;br /&gt;
== Step 2 ==&lt;br /&gt;
To increase the disk partition:&lt;br /&gt;
* In the VM, first get the partition table for info (Assuming the disk is /dev/sda) with `fdisk -l /dev/sda | grep ^/dev`&lt;br /&gt;
Output should something like this:&lt;br /&gt;
[[File:Fdisk-output.png|thumb]]&lt;br /&gt;
* Resize using &#039;parted&#039;:&lt;br /&gt;
  * parted /dev/sda&lt;br /&gt;
  * print&lt;br /&gt;
  * resizepart 2 100%&lt;br /&gt;
  * quit&lt;br /&gt;
This is assuming that you need to resize /dev/sda2 (which the &#039;print&#039; will help you determine)&lt;br /&gt;
== Step 3 ==&lt;br /&gt;
To increase file system&#039;s size:&lt;br /&gt;
* &#039;resize2fs /dev/sda2&#039; (Assuming you need to increase /dev/sda2)&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_14.54.13.png&amp;diff=592</id>
		<title>File:Screenshot 2025-03-19 at 14.54.13.png</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=File:Screenshot_2025-03-19_at_14.54.13.png&amp;diff=592"/>
		<updated>2025-03-19T13:57:25Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;how to increase disk space&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=514</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=514"/>
		<updated>2025-01-27T08:42:14Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Handover */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM&#039; to reboot, or the router having an connectivity issue will lead to most other VM&#039;s having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM&#039;s.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM&#039;s for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM&#039;s are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;contraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip integration ==&lt;br /&gt;
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR to the IPA sla-first-responder user group and enables Zabbix calling for the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=503</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=503"/>
		<updated>2025-01-13T08:49:23Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Handover */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM&#039; to reboot, or the router having an connectivity issue will lead to most other VM&#039;s having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM&#039;s.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM&#039;s for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM&#039;s are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;contraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip integration ==&lt;br /&gt;
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group and enables Zabbix calling for that the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes alert emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=502</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=502"/>
		<updated>2025-01-13T08:49:01Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* If an incident is reported by other means than the Zabbix-Zulip intergration */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM&#039; to reboot, or the router having an connectivity issue will lead to most other VM&#039;s having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM&#039;s.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM&#039;s for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM&#039;s are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;contraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip integration ==&lt;br /&gt;
Besides the automated Zabbix-Zulip integration, incidents can also be reported through emails from cron jobs, direct emails from customers, or topics in SRE General (such as alerts about Zulip updates or issues raised by colleagues), etc.&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group and enables Zabbix calling for that the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=501</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=501"/>
		<updated>2025-01-13T08:47:08Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Handover */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM&#039; to reboot, or the router having an connectivity issue will lead to most other VM&#039;s having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM&#039;s.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM&#039;s for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM&#039;s are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
===== Some known issues and their resolutions =====&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartups throttling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;contraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip intergration ==&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group and enables Zabbix calling for that the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* Before the handover, the acting FR must ensure that all active incidents are acknowledged (this includes emails or opened topics in SRE General, etc...), updated with the latest status, and properly documented.&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=495</id>
		<title>Incident Handling</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Incident_Handling&amp;diff=495"/>
		<updated>2024-12-17T14:38:45Z</updated>

		<summary type="html">&lt;p&gt;Alois: /* Acknowledging */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Zulip migration ==&lt;br /&gt;
Due to a migration to Zulip, the integration as was available on Mattermost is not available yet on Zulip. This leads to the following process changes:&lt;br /&gt;
* Acknowlegements and triggers resolving are not posted to Zulip by Zabbix&lt;br /&gt;
* Triggers are grouped in a topic on Zulip per host&lt;br /&gt;
* When an incident has been fully resolved, mark the topic as resolved, when any other incidents reported for the host are resolved&lt;br /&gt;
* There&#039;s no `?ongoing`, instead for now we can track open incidents by checking for unresolved topics&lt;br /&gt;
* The posting of incidents is less smart (only posting when not posted yet), so in order to prevent an incident from not being reported due to network issues or the likes, a message is posted after an inteval (8 hours for non-critical and lower, 1 hour for critical and above) while the incident has not been acknowleged.&lt;br /&gt;
* Incidents can be manually tracked by creating a topic by hand and reporting the problem.&lt;br /&gt;
* There is no automatic gitlab issue creation or syncing anymore.&lt;br /&gt;
&lt;br /&gt;
Finally, where this process says to do something on Mattermost, you should now do so on Zulip. The updates in the process chapters themselves are WIP.&lt;br /&gt;
&lt;br /&gt;
== Critical incidents ==&lt;br /&gt;
&#039;&#039;&#039;Critical incidents are resolved within 16 hours.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
As first responder you take on the responsiblity of seeing an incident resolved. This does not mean that you are the person required to do all the work. You can attempt to involve other to help you (often referred to as escalating the incident), but since other are not on-call, they are not obliged to help you, especially outside of normal working hours.&lt;br /&gt;
Involving multiple people can quickly be required if multiple critical incidents with different causes occur simultaniously. In that case, the First Responder usually takes on a more information management role and steers those that are brought on into resolving the issues. (Example: if a server crashes, several critical triggers can fire, but the underlying cause can quite quickly be determined to be a single issue, the crashed server. So you wouldn&#039;t need to call in people to manage each incident. But a client&#039;s service being down in one cluster while in a different cluster a different VM no longer boots is likely to be to different issues, so in order to resolve them on time you&#039;d want to call in help to resolve the incident in time).&lt;br /&gt;
&lt;br /&gt;
=== Process ===&lt;br /&gt;
The general process is made up of the folowing steps. Each step has additional information on how to handle/execute them in the sections below.&lt;br /&gt;
# Take responsibility for seeing the incident resolved&lt;br /&gt;
# Determine if incident is still ongoing&lt;br /&gt;
# If ongoing: Communicate to affected clients that the issue is being investigated&lt;br /&gt;
# Communicate plan/next steps (even if that is gathering information)&lt;br /&gt;
# Communicate findings/results of executed plan, go back to previous step if not resolved&lt;br /&gt;
# Resolve incident + cleanup&lt;br /&gt;
&lt;br /&gt;
During working on an incident it is expected that all communication is done in the incident&#039;s thread. This means all information to a problem can be found in a clear a predictable place. Sometimes an incident can be resolved by work done in another incident. In that case, it is required to post a link to that thread in the incident&#039;s thread with the comment that the resolution is done in that thread.&lt;br /&gt;
&lt;br /&gt;
==== Acknowledge the incident on Zabbix ====&lt;br /&gt;
The first step is to take responsibility for seeing the incident resolved by acknowledging the incident on Zabbix. Simply acknowledging the trigger suffices. It is however entirely possible that multiple critical incidents are firing at the same time. This can be a coincidence, or can be because of a share cause of failure. For example, a server crashing will cause server VM&#039; to reboot, or the router having an connectivity issue will lead to most other VM&#039;s having connectivity issues as well. If there are multiple critical incidents, it is advised to quickly observe what&#039;s ongoing, Zabbix is the best source of firing triggers for this, and pick the incident that is likely the root cause to  &lt;br /&gt;
&lt;br /&gt;
* Acknowledging an incident on Zabbix will stop Zabbix from calling the First Responder to notify them of the ongoing incident. And stops Zabbix from posting reminders on Zulip.&lt;br /&gt;
&lt;br /&gt;
==== Determine if incident is still ongoing ====&lt;br /&gt;
The next step is to check if the reported problem is still ongoing. Depending on the observations made here your process to follow and steps needed to resolve the incident can change. There are three options:&lt;br /&gt;
# The trigger resolved itself and the problem cannot be observed. Example: HTTPS is down for a site, but the FR can access the site through HTTPS without incident.&lt;br /&gt;
# The trigger resolved itself and the problem can still be observed.&lt;br /&gt;
# The trigger is still firing but the problem cannot be observed: Our triggers might not be perfect, so it could be that something else is causing it to fire. A simple example would be that Zabbix reports that the the DNS for a site can&#039;t be resolved, but in reality there&#039;s a bug in the script we wrote that checks if the DNS resolves and the DNS resolves fine. Final note: keep in mind that an &#039;it works on my machine&#039; does not necessarily mean it works for most other people, so depening on the trigger you need to do some evaluations if your tests suffice. &lt;br /&gt;
&lt;br /&gt;
In order to make sure you are actually trying to observe the same thing as the trigger is looking for, make sure to check the trigger definition and the current data of the associated item(s). Some triggers might fire if one of multiple conditions is met (Such as a trigger that monitors the ping response time firing if the value exceeds a certain threshold, or if no data for a certain period of time was observed).&lt;br /&gt;
&lt;br /&gt;
Make sure to report your findings in the incident&#039;s thread. It&#039;s advised to post a screenshot of the relevant item(s) and your own observations. (Continuing the ping example, you would post a screenshot of the relevant values, state your conclusion why the trigger is firing, and your own observations/pings)&lt;br /&gt;
&lt;br /&gt;
==== Communicate to affected clients ====&lt;br /&gt;
If the incident is still ongoing and the service is down, we need to communicate to affected clients that we are aware of the problem and that we are investigating it. This is because critical incident usually mean the service is down, something the clients can notice/are affected by, so we to be transparent that something is going on. There are some additional notes to this though:&lt;br /&gt;
* If an incident has already resolved itself and the problem is no longer observable, we don&#039;t communicate anything. Doing so might only cause confusion, and since the client has not reported any issues, they have not had a noticeable problem with it themselves.&lt;br /&gt;
* Although a critical incident generally means that the client service is down or experiencing reduced service, not all critical incidents are of that nature. Some are more administrative, or are only an issue for Delft Solutions itself. As of writing I don&#039;t have an exhaustive list, but here is those I can think of:&lt;br /&gt;
** SSH Service is down: We don&#039;t have any clients that SSH into their services, so it&#039;s generally not a problem. But SSH is mostly used for SRE maintenance and publishing new builds. The SRE maintenance is an internal problem, so no need to communicate to the client. The publishing is done to Kaboom, preventing new builds from being published, and the two SM VM&#039;s.&lt;br /&gt;
** No backup for x days: Clients don&#039;t notice it if a backup is running late, so no need to communicate with clients. Just need to make sure the backup gets completed&lt;br /&gt;
** SSL certificate is expiring in &amp;lt; 24 hours: This is a bit dependent on how soon this incident is being handled, but if it handled quickly, the certificate never actually expired, and there has not been any disruption to the client&#039;s service, so no need for communicating about it.&lt;br /&gt;
* Determining which clients are being affected can be done by looking at the host&#039;s DNS in the trigger, and/or looking up the VM in Proxmox and checking the tags of the VM&#039;s for client names. In the case that this issue is causing multiple other critical triggers to fire, you would have to check for which clients are affected by those incidents.&lt;br /&gt;
* Communicating to DS about ongoing incidents is usually assumed to be automaticly have been done by the fact that the incident was reported on Zulip.&lt;br /&gt;
&lt;br /&gt;
As always, report the decisions taken and actions maded in the incident thread. (e.g.: I&#039;ve sent a message in the Slack to let Kaboom know that we aware of problem x, and that we are investigating it)&lt;br /&gt;
&lt;br /&gt;
==== Communicate plan/next steps + Communicate findings/results of executed plan ====&lt;br /&gt;
This is the main part of handling an incident. There are several actions you can take in these steps, but at the basis they consist of sharing your next steps, performing those, and reporting the results. The reason all this needs to be reported is to ensure that all known information about a problem is logged, making it easier for someone else to be onboarded into the issue, for later reference if a similar issue is encountered, and even for use during the incident itself in case an older configuration needs to be referenced after you changed it.&lt;br /&gt;
The objective from these steps is determining what is actually wrong and how to resolve it. Depending on the observations made earlier on whether the incident is still ongoing and is (still) observable your investigation can go into different directions. (e.g. Find the underlying cause for a trigger, or determining why the trigger is firing while it likely shouldn&#039;t, and then how to resolve that underlying cause or how to update the trigger to work better)&lt;br /&gt;
&lt;br /&gt;
There are three main types of steps defined, but you are not limited to these:&lt;br /&gt;
# Hypothesis: If you have an idea what could be causing it, you would state your hypothesis and your next step would be to prove that hypothesis. For example, for an incident &#039;SSH service is down on X&#039; your hypothesis could be that this is due to &#039;MaxStartups&#039; throttling, which can be proven by &#039;grep&#039;ing journalctl for that, and compare the start and end times of throttling with the timestamps of the item reporting the status of the SSH service.&lt;br /&gt;
# Information gathering: Sometimes it just helps to get some facts about the situation collected. What is usefull information that is relevant depends on the triggers, but some examples are: The syslog/journalctl of the host from around the time of the incident (it can contain a reference to the an underlying problem in various levels of explicitness), the ping response from several hosts on the route to a host or a traceroute (this helps with networking issues). The gathered information is usually intended to help you come up with an hypothesis on what&#039;s wrong.&lt;br /&gt;
# Investigative: The most rigorous of process. The full process is described here originally [https://docs.google.com/document/d/1AQYJM1Q9l2Tyk6zfCVaQ2aEq-dpbfUH5okE88bpKkhw/edit#heading=h.5fq2skijqbdc Drive - Final Coundown - General Investigative Process]. To summarize, when you don&#039;t know why something is failing, and/or don&#039;t have any decent hypotheses to follow up, you can follow this process to systematicly find the problem.&lt;br /&gt;
&lt;br /&gt;
Regarding the resolution to an incident: The resolution to any incident is usually one of two things:&lt;br /&gt;
# Fix the underlying problem.&lt;br /&gt;
# Fix the trigger itself.&lt;br /&gt;
Fixing the trigger is relavively straightforward, but do make sure document in the thread what you changed to which trigger.&lt;br /&gt;
Fixing the underlying problem can be more complex. A trade-off needs to be made sometimes between resolving technical debt, or simply patching the current system to resolve the issue. We usually look for a resolution that ensures that the problem won&#039;t re-occur soon, or makes it unexpected/unlikely for the problem to re-occur. Taking into account the timeframe that is available to resolve the incident you can make some trade-offs. An example would be: normal backups of VM&#039;s are failing due to the Proxmox backup server being down/unreachable and it is determined that this cannot be resolved at that moment. We can set up automatic backups to local storage temporary to resolve the immediate problem and ensure we keep our SLO&#039;s versus setting up a new Proxmox Backup server at a different location. Since we don&#039;t have much time to resolve the problem, the resolution would be to set up the automatic backups to local storage, and set up a new Proxmox Backup Server later as a seperate issue.&lt;br /&gt;
&lt;br /&gt;
Some know issues and their resolutions:&lt;br /&gt;
* SSH service is down: The internet is a vile place. There&#039;s constant port scanning and hacking attempts ongoing to any machine connected to the internet (mostly IPv4). Due to this, SSH has a throttling functionality build in to prevent a system from being DDOS&#039;ed by the amount of malicious SSH requests. This throttling can cause the Zabbix server from being denied an SSH connection, of which several failures fire this trigger. This hypothesis can be proven with a `journalctl -u ssh | grep &#039;MaxStartupsThrottling&#039;` (you probably want to select a relevant time period with `--since &amp;quot;2 hours ago&amp;quot;` or something similar to prevent having to process a month of logging). You can then compare the throttling start and end times with the timestamps of the item data itself. The resolution for the issue is to add our custom ssh configuration [https://chat.dsinternal.net/#narrow/stream/23-SRE---General/topic/DS.20Whitelisted.20Custom.20SSH.20configuration/near/1620 Custom SSH Configuration].&lt;br /&gt;
* No backup for 3 days: Are S3 backup is very slow. Not much to prove as an underlying issue here. What needs to be done is check that the backup process is ongoing. The Zabbix latest data can be checked to verify that backups are running by checking that that days backups were done for the smaller buckets. The devteam email can be checked for if the backup process could not start on day due to it already running (it takes 24+ hours, and an attempt to start it is done each day by cron).&lt;br /&gt;
* git.* HTTPS is down: On Sunday mostly, Gitlab gets automaticly updated, but this incurs some downtime as the service is restarted. This is usually short enough to not be reported to Zulip as per our settings, but sometimes it&#039;s longer. If the service does not stay down, the issue can be just resolved.&lt;br /&gt;
&lt;br /&gt;
==== Resolve incident + cleanup ====&lt;br /&gt;
When you&#039;ve executed and verified the resolution in the previous steps we can proceed resolving the issue in our Mattermost integration. Resolving an incident can be done by doing the following:&lt;br /&gt;
# Verify that the trigger is no longer firing. An incident will be immediatly re-opened if the trigger is still firing, and the incident cannot be considered resolved if the trigger is still firing. If the trigger is still firing but you&#039;re sure that you&#039;ve resolved the problem, you might need to force the item the trigger depends on to update. This can be done by finding the item in the host&#039;s configuration on Zabbix and selecting &#039;Execute Now&#039;, after a short period this should force Zabbix to re-execute the item. You can check the timestamps in the latest data of an item to check if it was updated.&lt;br /&gt;
# Close the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
Unfortunatly, some problems cause multiple critical and non-critical triggers to fire. This means we have to check Zabbix and Zulip for other fired triggers and ongoing incidents. The goal is to identify critical and non-critical incidents that were caused by the incident/underlying issue you just resolved.&lt;br /&gt;
# First, these incidents need to be acknowledged on Zabbix, and in the acknowledgement message you mention the incident/problem that caused this.&lt;br /&gt;
# Next, check the incident tracked by the integration on Mattermost using the `?ongoing` command. Resolve incidents that were (re-)opened by this incident by executing the following steps. If the first two fail (problem still persists, trigger is still firing), the incident needs to considered it&#039;s own issue and the relevant process needs to be followed (critical or non-critical depending on criticality).&lt;br /&gt;
## Ensuring the mentioned problem is no longer observable&lt;br /&gt;
## The trigger has resolved (You might need to force an update with `Execute Now`).&lt;br /&gt;
## Posting a link to the main incident you resolved with the comment that the underlying problem was resolved in that topic.&lt;br /&gt;
## Closing the incident by marking the topic as resolved, when there are no other triggers firing for the host.&lt;br /&gt;
&lt;br /&gt;
When you are done, there should be no more critical triggers firing in Zabbix or open in the Zabbix-Mattermost intergration, for which no-one has taken responsibility or you have taken responsibility for and are not actively handling.&lt;br /&gt;
&lt;br /&gt;
===Additional context===&lt;br /&gt;
* Critical incidents are posted in [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical &#039;&#039;&#039;SLA - Critical&#039;&#039;&#039;].&lt;br /&gt;
* &amp;lt;s&amp;gt;When it is being tracked on GitLab a heavy check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;Responses on the thread and on GitLab are automatically synced (to some extend)&amp;lt;/s&amp;gt;&lt;br /&gt;
* &amp;lt;s&amp;gt;When you reply with &#039;&#039;&#039;I agree that this has been fully resolved&#039;&#039;&#039; eventually our Zabbix-Mattermost integration will pick this up and a green check mark is added to the message.&amp;lt;/s&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Non-Critical incidents ==&lt;br /&gt;
* Non-critical incidents are acknowledged within 9 hours and resolved within one week.&lt;br /&gt;
&lt;br /&gt;
=== Acknowledging ===&lt;br /&gt;
Fully acknowledging a non-critical incident requires the following tasks to have been completed:&lt;br /&gt;
* Acknowledging the incident on Zabbix, which means you take responsibility of completing the steps listed below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The next steps don&#039;t have to be done immediatly, as they have dependencies, but be started and scheduled for completion the next work day.&lt;br /&gt;
&lt;br /&gt;
Check if there&#039;s already a uncompleted milestone for this host with this issue in the metrics sheet.&lt;br /&gt;
If a milestone is already present:&lt;br /&gt;
* Report in the topic the Lynx project ID for resolving this issue.&lt;br /&gt;
* If the ID has already been reported in the topic, we don&#039;t want to report it again and again, instead add the 🔁 emoji (:repeat:) under the zabbix bot alert&lt;br /&gt;
&lt;br /&gt;
If a milestone is NOT already present:&lt;br /&gt;
* Add the non-critical incident as a milestone in the metrics sheet, following the naming convention&lt;br /&gt;
** Start date is the date of the incident&lt;br /&gt;
** DoD states what needs to be true for the non-critical incident to be consider resolved&lt;br /&gt;
* Add the non-critical incident to Lynx as a project&lt;br /&gt;
** Follow the naming convention below for the title &amp;amp; project ID&lt;br /&gt;
** Tasks need to be added&lt;br /&gt;
** Final tasks needs to have the SLO deadline set as &#039;contraint&#039;&lt;br /&gt;
** Project priority is set to 99 while not estimated yet. After the estimation is done, the priority should be set to 20&lt;br /&gt;
** The tasks are estimated for SP&lt;br /&gt;
* The Lynx project ID is reported in the non-critical incident&#039;s topic on Zulip, and logged in the metrics sheet&lt;br /&gt;
* A Kimai activity is created in Kimai for the non-critical incident, following the naming convetion&lt;br /&gt;
&lt;br /&gt;
==== Naming convention ====&lt;br /&gt;
* Kimai activity name needs to follow the pattern: &#039;&amp;lt;YYYY-MM&amp;gt; &amp;lt;problem_title&amp;gt;&#039;. For &amp;lt;problem_title&amp;gt;, incorporate the trigger title and hostname for clarity.&lt;br /&gt;
* Milestone name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project name needs to follow the pattern: &#039;Delft Solutions Hosting Incident response work &amp;lt;kimai_activity_name&amp;gt;&#039;&lt;br /&gt;
* Lynx project ID needs to follow the pattern: &#039;SRE&amp;lt;YYMM&amp;gt;&amp;lt;XXX&amp;gt;&#039;, where &amp;lt;XXX&amp;gt; is some three letter shorthand that relates to the problem/host&lt;br /&gt;
&lt;br /&gt;
== Informational incidents ==&lt;br /&gt;
* Informational incidents are acknowledged within 72 hours&lt;br /&gt;
&lt;br /&gt;
Checklist&lt;br /&gt;
# Acknowledge on Zabbix&lt;br /&gt;
# Sanity check the event, post result in thread&lt;br /&gt;
# If action needed, perform action&lt;br /&gt;
&lt;br /&gt;
== If an incident is reported by other means than the Zabbix-Zulip intergration ==&lt;br /&gt;
# Acknowledge receipt.&lt;br /&gt;
# Classify the incident as critical, non-critical, or informational.&lt;br /&gt;
# Create an topic in the relevant SRE channel, stating the problem and that you is responsible for resolving it.&lt;br /&gt;
# Proceed to treat the incident according to the criticality you just classified it as. (So for a critical incident, it means you now start the critical incident handling process)&lt;br /&gt;
&lt;br /&gt;
== Handover ==&lt;br /&gt;
When handing over the responsibility of &#039;&#039;&#039;first responder&#039;&#039;&#039; (FR), the following needs to happen:&lt;br /&gt;
* The handover can be initiated by both the upcoming FR or the acting FR&lt;br /&gt;
* Acting FR adds the upcoming FR the the IPA sla-first-responder user group and enables Zabbix calling for that the upcoming FR if they have that set by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;br /&gt;
* The upcoming FR makes sure they are aware of the state of the SLA and knows what questions they wants to ask the acting FR.&lt;br /&gt;
* The upcoming FR makes sure they are subscribed to the right channels.&lt;br /&gt;
&lt;br /&gt;
The following steps can be done async or in person:&lt;br /&gt;
* The acting FR announces/informs the upcoming FR has been added to the sla-first-responder group (In Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel] if asynq).&lt;br /&gt;
* If the acting FR wants to hand over responsibility for any ongoing incident they also state which incidents they want the upcoming FR to take over.&lt;br /&gt;
* If there are any particularities the upcoming FR needs to be aware of, those are shared.&lt;br /&gt;
* The upcoming FR asks their questions until they are satisfied and able to take over the FR&lt;br /&gt;
* The upcoming FR ensures they are subscribed to the following channels on Zulip: [https://chat.dsinternal.net/#narrow/stream/23-SRE---General SRE - General], [https://chat.dsinternal.net/#narrow/stream/24-SRE-.23-Critical SRE # Critical] and if part of the SRE team [https://chat.dsinternal.net/#streams/4/SRE%20##%20Non-critical SRE ## Non-Critical] and [https://chat.dsinternal.net/#streams/5/SRE%20###%20Informational SRE ### Informational].&lt;br /&gt;
* The upcoming FR announces/informs that they are now the acting FR over Zulip&#039;s [https://chat.dsinternal.net/#narrow/stream/13-Organisational Organisational channel]&lt;br /&gt;
* The now acting FR removes the previous FR from IPA the sla-first-responder user group and disables Zabbix calling for the previous FR if they had that enabled by going to Zabbix &amp;gt; Configuration &amp;gt; Actions &amp;gt; [https://status.delftinfra.net/zabbix/actionconf.php?eventsource=0# Trigger actions]&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
	<entry>
		<id>https://docs.delftsolutions.nl/index.php?title=Setting_Up_Wildcard_Subdomains_with_SSL_on_a_Debian_Application&amp;diff=484</id>
		<title>Setting Up Wildcard Subdomains with SSL on a Debian Application</title>
		<link rel="alternate" type="text/html" href="https://docs.delftsolutions.nl/index.php?title=Setting_Up_Wildcard_Subdomains_with_SSL_on_a_Debian_Application&amp;diff=484"/>
		<updated>2024-10-31T14:04:30Z</updated>

		<summary type="html">&lt;p&gt;Alois: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This guide provides a step-by-step approach to setting up wildcard subdomains with SSL on a Debian-based application. Wildcard subdomains allow applications to dynamically support multiple subdomains (abc.example.com, xyz.example.com) under a single SSL certificate.&lt;br /&gt;
&lt;br /&gt;
In this guide, the focus is on configuring a Debian package to handle wildcard subdomains primarily through updates to the debian/postinst file. The guide uses the &#039;&#039;kaboom-api&#039;&#039; app as an example, with &amp;lt;code&amp;gt;staging-elearning.nl&amp;lt;/code&amp;gt; as the domain name on which we’re setting up wildcard subdomains.&lt;br /&gt;
&lt;br /&gt;
If you do not need or prefer not to modify the application code itself, you can still follow the key steps and commands described in this guide directly from the terminal. This will allow you to set up SSL for wildcard subdomains without diving into the application’s Debian packaging configuration.&lt;br /&gt;
&lt;br /&gt;
By the end of this guide, you will have a fully automated process for configuring and renewing SSL certificates for wildcard subdomains, leveraging tools like Certbot and DNS authentication.&lt;br /&gt;
&lt;br /&gt;
== Prerequisites ==&lt;br /&gt;
&lt;br /&gt;
=== 1.	DNS Configuration for Wildcard Subdomains ===&lt;br /&gt;
&lt;br /&gt;
Access your DNS repository and add the necessary A and AAAA records for the wildcard subdomain you plan to use. This typically involves adding entries like &amp;lt;code&amp;gt;*.example.com&amp;lt;/code&amp;gt; pointing to your server’s IP address.&lt;br /&gt;
&lt;br /&gt;
The following is an example of what has been done for the domain name &amp;lt;code&amp;gt;staging-elearning.nl&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;staging-elearning.nl.zone&amp;lt;/code&amp;gt;&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
$ORIGIN staging-elearning.nl.&lt;br /&gt;
$TTL 3600&lt;br /&gt;
@	IN	3600	SOA	delftsolutions.ns1.signaldomain.nl.	info.signaldomain.nl. (&lt;br /&gt;
		&amp;lt;serial&amp;gt;	; don&#039;t modify, auto incremented&lt;br /&gt;
		86400		; secondary refresh&lt;br /&gt;
		7200		; secondary retry&lt;br /&gt;
		3600000		; secondary expiry&lt;br /&gt;
		600			; negative response ttl&lt;br /&gt;
	)&lt;br /&gt;
&lt;br /&gt;
@	3600	IN	NS	ns2.signaldomain.net.&lt;br /&gt;
@	3600	IN	NS	delftsolutions.ns1.signaldomain.nl.&lt;br /&gt;
&lt;br /&gt;
*   IN  3600 A     193.5.147.172&lt;br /&gt;
*   IN  3600 AAAA  2a0c:8187:0:201::196&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== 2. Required Packages ===&lt;br /&gt;
&lt;br /&gt;
* python3-certbot-dns-rfc2136&lt;br /&gt;
* openssl&lt;br /&gt;
* nginx&lt;br /&gt;
* dnsutils&lt;br /&gt;
* certbot&lt;br /&gt;
&lt;br /&gt;
== Configuring SSL and Wildcard Subdomains ==&lt;br /&gt;
&lt;br /&gt;
=== 1. Handling Environment Variables ===&lt;br /&gt;
&lt;br /&gt;
We’ll first add three environment variables to capture essential information: DNS_AUTHENTICATION, CERTBOT_EMAIL, and FQDN. These variables will be defined using Debconf, which allows us to prompt for values during installation and configuration.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;DNS_AUTHENTICATION&#039;&#039;&#039;: This string is required for Certbot’s DNS-based challenge verification. The format includes a keyname, algorithm, and secret for authentication, followed by the authoritative DNS hostname.&lt;br /&gt;
This string is currently obtained using the signal domain api package running the command :&lt;br /&gt;
&amp;lt;code&amp;gt;signaldomain-api key certbot create &amp;lt;domain_name&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The expected format is: &amp;lt;code&amp;gt;dns://&amp;lt;key_name&amp;gt;:&amp;lt;key_algorithm&amp;gt;~&amp;lt;key_secret_base64&amp;gt;@&amp;lt;authoritive_nameserver_domainname&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
example: &amp;lt;code&amp;gt;dns://staging-elearning_nl__certbot._keys.delftsolutions.signaldomain._internal.usersignal.nl.:hmac-sha256~&amp;lt;key_secret&amp;gt;@ns1.signaldomain.nl/staging-elearning_nl__certbot._keys.delftsolutions.signaldomain._internal.usersignal.nl.&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;CERTBOT_EMAIL&#039;&#039;&#039;: This email address is used when registering an account with Let’s Encrypt. Important notifications about certificate issues will be sent to this address.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;FQDN&#039;&#039;&#039;: This is the fully qualified domain name of the primary domain for which wildcard SSL certificates will be issued. &amp;lt;code&amp;gt;staging-elearning.nl&amp;lt;/code&amp;gt; for this guide example.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Here’s how to set these variables in &amp;lt;code&amp;gt;debian/templates&amp;lt;/code&amp;gt;, define prompts in &amp;lt;code&amp;gt;debian/config&amp;lt;/code&amp;gt;, and retrieve them in &amp;lt;code&amp;gt;debian/postinst&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
in &amp;lt;code&amp;gt;debian/templates&amp;lt;/code&amp;gt;, add the following entries to create prompt templates for each variable:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
…&lt;br /&gt;
Template: &amp;lt;PKG_NAME&amp;gt;/DNS_AUTHENTICATION&lt;br /&gt;
Type: string&lt;br /&gt;
Default:&lt;br /&gt;
Description: DNS authentication string in the following format: dns://&amp;lt;key_name&amp;gt;:&amp;lt;key_algorithm&amp;gt;~&amp;lt;key_secret_base64&amp;gt;@&amp;lt;authoritative_nameserver_domainname&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Template: &amp;lt;PKG_NAME&amp;gt;/CERTBOT_EMAIL&lt;br /&gt;
Type: string&lt;br /&gt;
Default:&lt;br /&gt;
Description: Enter the email that certificate issues should be reported to. Entering this will result in accepting the Let&#039;s Encrypt terms and conditions.&lt;br /&gt;
&lt;br /&gt;
Template: &amp;lt;PKG_NAME&amp;gt;/FQDN&lt;br /&gt;
Type: string&lt;br /&gt;
Default:&lt;br /&gt;
Description: Enter the fully qualified domain name for ...&lt;br /&gt;
…&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;lt;PKG_NAME&amp;gt;&amp;lt;/code&amp;gt; is the name of your Debian package, &amp;lt;code&amp;gt;kaboom-api&amp;lt;/code&amp;gt; in our case. &lt;br /&gt;
&lt;br /&gt;
Add the following lines to your &amp;lt;code&amp;gt;debian/config&amp;lt;/code&amp;gt; file to prompt for these variables during configuration:&lt;br /&gt;
in &amp;lt;code&amp;gt;debian/templates&amp;lt;/code&amp;gt;, add the following entries to create prompt templates for each variable:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
…&lt;br /&gt;
db_input medium &amp;lt;PKG_NAME&amp;gt;/CERTBOT_EMAIL || true&lt;br /&gt;
db_input medium &amp;lt;PKG_NAME&amp;gt;/DNS_AUTHENTICATION || true&lt;br /&gt;
db_input medium &amp;lt;PKG_NAME&amp;gt;/FQDN || true&lt;br /&gt;
…&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In &amp;lt;code&amp;gt;debian/postinst&amp;lt;/code&amp;gt;, retrieve the stored values with the following lines:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
…&lt;br /&gt;
db_get &amp;lt;PKG_NAME&amp;gt;/DNS_AUTHENTICATION&lt;br /&gt;
DNS_AUTHENTICATION=&amp;quot;$RET&amp;quot;&lt;br /&gt;
&lt;br /&gt;
db_get &amp;lt;PKG_NAME&amp;gt;/CERTBOT_EMAIL&lt;br /&gt;
CERTBOT_EMAIL=&amp;quot;$RET&amp;quot;&lt;br /&gt;
&lt;br /&gt;
db_get &amp;lt;PKG_NAME&amp;gt;/FQDN&lt;br /&gt;
FQDN=&amp;quot;$RET&amp;quot;&lt;br /&gt;
…&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
These values shall be set or updated once the whole config is over, by running the following command:&lt;br /&gt;
&amp;lt;code&amp;gt;sudo dpkg-reconfigure &amp;lt;PKG_NAME&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== 2. Automating SSL and Wildcard Domain Setup in postinst ===&lt;br /&gt;
Here we will break down concern by concern how to configure the &amp;lt;code&amp;gt;debian/postinst&amp;lt;/code&amp;gt; file.&lt;br /&gt;
&lt;br /&gt;
==== a. Creating the dns-auth.conf File ====&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;dns-auth.conf&amp;lt;/code&amp;gt; file will be generated from the DNS_AUTHENTICATION variable, which contains the details for Certbot’s DNS challenge configuration. Add the following to the &amp;lt;code&amp;gt;debian/postinst&amp;lt;/code&amp;gt; file to create this file:&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
dns_hostname_path=&amp;quot;$(cut -d&#039;@&#039; -f2- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$DNS_AUTHENTICATION&amp;quot;)&amp;quot;&lt;br /&gt;
dns_schema_auth=&amp;quot;$(cut -d&#039;@&#039; -f1 &amp;lt;&amp;lt;&amp;lt;&amp;quot;$DNS_AUTHENTICATION&amp;quot;)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
dns_hostname=&amp;quot;$(cut -d&#039;/&#039; -f1 &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_hostname_path&amp;quot;)&amp;quot;&lt;br /&gt;
dns_auth=&amp;quot;$(cut -d&#039;/&#039; -f3- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_schema_auth&amp;quot;)&amp;quot;&lt;br /&gt;
dns_auth_keyname=&amp;quot;$(cut -d&#039;:&#039; -f1 &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_auth&amp;quot;)&amp;quot;&lt;br /&gt;
dns_auth_algorithm=&amp;quot;$(cut -d&#039;:&#039; -f2- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_auth&amp;quot; | cut -d&#039;~&#039; -f1 | tr &#039;[:lower:]&#039; &#039;[:upper:]&#039;)&amp;quot;&lt;br /&gt;
dns_auth_secret=&amp;quot;$(cut -d&#039;:&#039; -f2- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_auth&amp;quot; | cut -d&#039;~&#039; -f2-)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
dns_host_aaaa=&amp;quot;$(dig +short AAAA &amp;quot;$dns_hostname&amp;quot; | head -n1)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
[ -d etc/&amp;lt;DNS_CONF_DIR&amp;gt; ] || mkdir -p etc/&amp;lt;DNS_CONF_DIR&amp;gt;&lt;br /&gt;
&lt;br /&gt;
umask 266&lt;br /&gt;
cat &amp;gt; etc/&amp;lt;DNS_CONF_DIR&amp;gt;/dns-auth.conf &amp;lt;&amp;lt;CONF&lt;br /&gt;
# Managed by apt, please use dpkg-reconfigure &amp;lt;PKG_NAME&amp;gt; to modify&lt;br /&gt;
dns_rfc2136_server = $dns_host_aaaa&lt;br /&gt;
dns_rfc2136_port = 53&lt;br /&gt;
dns_rfc2136_name = $dns_auth_keyname&lt;br /&gt;
dns_rfc2136_secret = $dns_auth_secret&lt;br /&gt;
dns_rfc2136_algorithm = $dns_auth_algorithm&lt;br /&gt;
CONF&lt;br /&gt;
umask 022&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This configuration file will be used by Certbot to authenticate and verify domain ownership via DNS challenges.&lt;br /&gt;
In the case of our guide with the kaboom-api example, &amp;lt;code&amp;gt;&amp;lt;DNS_CONF_DIR&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;kaboom&amp;lt;/code&amp;gt;, it&#039;s up to you to select the right naming for your case.&lt;br /&gt;
&lt;br /&gt;
Once the script has been executed The &amp;lt;code&amp;gt;dns-auth.conf&amp;lt;/code&amp;gt; file should look something like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
dns_rfc2136_server = 2a0c:8187::120&lt;br /&gt;
dns_rfc2136_port = 53&lt;br /&gt;
dns_rfc2136_name = staging-elearning_nl__certbot._keys.delftsolutions.signaldomain._internal.usersignal.nl.&lt;br /&gt;
dns_rfc2136_secret = &amp;lt;secret-key&amp;gt;&lt;br /&gt;
dns_rfc2136_algorithm = HMAC-SHA256 &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Make sure that proper letter case is observed as this would cause the script to fail with unclear error messages.&lt;br /&gt;
&lt;br /&gt;
==== b. Setting Up Certbot and Requesting Certificates ====&lt;br /&gt;
&lt;br /&gt;
To handle SSL certificates, Certbot needs to register an account (if not already registered) and request a certificate for the primary domain and wildcard subdomain. Add the following to postinst to check and register Certbot, then request the certificate:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
certbot_account_count=&amp;quot;$(find /etc/letsencrypt/accounts/acme-v02.api.letsencrypt.org/directory/ -maxdepth 1 -mindepth 1 | wc -l)&amp;quot;&lt;br /&gt;
if [ &amp;quot;z$certbot_account_count&amp;quot; = &amp;quot;z0&amp;quot; ]; then&lt;br /&gt;
    certbot register --non-interactive --email &amp;quot;$CERTBOT_EMAIL&amp;quot; --no-eff-email --agree-tos&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
[ ! -f &amp;quot;/etc/letsencrypt/live/&amp;lt;CERT_NAME&amp;gt;/fullchain.pem&amp;quot; ] || certbot certonly --non-interactive --cert-name &amp;lt;CERT_NAME&amp;gt; --dns-rfc2136 --dns-rfc2136-credentials etc/&amp;lt;DNS_CONF_DIR&amp;gt;/dns-auth.conf --domain &amp;quot;$FQDN&amp;quot; --domain &amp;quot;*.$FQDN&amp;quot; --deploy-hook /usr/share/&amp;lt;PKG_NAME&amp;gt;/bin/cert-deploy&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This block registers Certbot, checks for an existing certificate, and if none exists, requests a new certificate using DNS authentication with the specified dns-auth.conf file. The --deploy-hook option calls the cert-deploy file after each certificate issuance or renewal. We will create the cert-deploy in &#039;&#039;&#039;step f&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
In the case of our guide with the kaboom-api example, &amp;lt;code&amp;gt;&amp;lt;CERT_NAME&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;kaboom-elearning&amp;lt;/code&amp;gt;, again it&#039;s up to you to select the right naming for your case.&lt;br /&gt;
&lt;br /&gt;
==== c. Generating Diffie-Hellman Parameters for SSL ====&lt;br /&gt;
&lt;br /&gt;
Diffie-Hellman parameters enhance SSL security. To ensure this file exists, add the following to &amp;lt;code&amp;gt;debian/postinst&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
[ -f &amp;quot;etc/&amp;lt;DNS_CONF_DIR&amp;gt;/ssl-dhparams.pem&amp;quot; ] || openssl dhparam -out etc/&amp;lt;DNS_CONF_DIR&amp;gt;/ssl-dhparams.pem 2048&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This code checks for an existing &amp;lt;code&amp;gt;ssl-dhparams.pem&amp;lt;/code&amp;gt; file, generating one if it doesn’t exist, using 2048-bit encryption for security.&lt;br /&gt;
&lt;br /&gt;
==== d. Configuring Nginx for Wildcard SSL ====&lt;br /&gt;
&lt;br /&gt;
Finally, configure Nginx to handle requests for the wildcard domain and apply SSL settings. Here’s the code to create a new Nginx server block for the wildcard domain:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
cat &amp;gt;/etc/nginx/sites-available/&amp;lt;CERT_NAME&amp;gt; &amp;lt;&amp;lt;CONF&lt;br /&gt;
server {&lt;br /&gt;
	root /usr/share/&amp;lt;PKG_NAME&amp;gt;/public;&lt;br /&gt;
	server_name *.$FQDN;&lt;br /&gt;
&lt;br /&gt;
	location / {&lt;br /&gt;
    	proxy_pass http://&amp;lt;APP_UPSTREAM&amp;gt;/;&lt;br /&gt;
    	proxy_set_header Host \$host;&lt;br /&gt;
    	proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;&lt;br /&gt;
    	proxy_set_header X-Forwarded-Proto \$scheme;&lt;br /&gt;
    	proxy_buffers 8 32k;&lt;br /&gt;
    	proxy_buffer_size 64k;&lt;br /&gt;
    	client_max_body_size 0;&lt;br /&gt;
    	proxy_redirect off;&lt;br /&gt;
    	proxy_buffering off;&lt;br /&gt;
	}&lt;br /&gt;
&lt;br /&gt;
	listen [::]:443 ssl http2;&lt;br /&gt;
	listen 443 ssl http2;&lt;br /&gt;
&lt;br /&gt;
	ssl_certificate /etc/letsencrypt/live/&amp;lt;CERT_NAME&amp;gt;/fullchain.pem;&lt;br /&gt;
	ssl_certificate_key /etc/letsencrypt/live/&amp;lt;CERT_NAME&amp;gt;/privkey.pem;&lt;br /&gt;
	ssl_dhparam etc/&amp;lt;DNS_CONF_DIR&amp;gt;/ssl-dhparams.pem;&lt;br /&gt;
&lt;br /&gt;
	ssl_session_cache shared:le_nginx_SSL:10m;&lt;br /&gt;
	ssl_session_timeout 1440m;&lt;br /&gt;
	ssl_session_tickets off;&lt;br /&gt;
&lt;br /&gt;
	ssl_protocols TLSv1.2 TLSv1.3;&lt;br /&gt;
	ssl_prefer_server_ciphers off;&lt;br /&gt;
	ssl_ciphers &amp;quot;ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384&amp;quot;;&lt;br /&gt;
}&lt;br /&gt;
CONF&lt;br /&gt;
&lt;br /&gt;
[ -L /etc/nginx/sites-enabled/&amp;lt;CERT_NAME&amp;gt; ] || ln -s /etc/nginx/sites-available/&amp;lt;CERT_NAME&amp;gt; /etc/nginx/sites-enabled&lt;br /&gt;
&lt;br /&gt;
nginx -q -t &amp;amp;&amp;amp; service nginx reload&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the case of our guide with the kaboom-api example, &amp;lt;code&amp;gt;&amp;lt;APP_UPSTREAM&amp;gt;&amp;lt;/code&amp;gt; is &amp;lt;code&amp;gt;kaboom-api&amp;lt;/code&amp;gt;, it refers to the backend application server(s) receiving proxied requests.&lt;br /&gt;
&lt;br /&gt;
==== e. Wrapping everything in an if statement ====&lt;br /&gt;
&lt;br /&gt;
You do not want to run that part of the postinst script if you do not have the DNS_AUTHENTICATION, CERTBOT_EMAIL, and FQDN variables set.&lt;br /&gt;
&lt;br /&gt;
This is why we’ll wrap everything we just covered inside an if statement&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
if [ -n &amp;quot;$DNS_AUTHENTICATION&amp;quot; ] &amp;amp;&amp;amp; [ -n &amp;quot;$CERTBOT_EMAIL&amp;quot; ] &amp;amp;&amp;amp; [ -n &amp;quot;$FQDN&amp;quot; ] ; then&lt;br /&gt;
  #Everything we just wrote&lt;br /&gt;
else&lt;br /&gt;
  echo &amp;quot;one or more of DNS_AUTHENTICATION, CERTBOT_EMAIL, FQDN are missing, skipping  wildcard subdomains SSL certificate setup.&amp;quot;&lt;br /&gt;
fi&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is what the finished code looks like in the kaboom-api example&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
if [ -n &amp;quot;$DNS_AUTHENTICATION&amp;quot; ] &amp;amp;&amp;amp; [ -n &amp;quot;$CERTBOT_EMAIL&amp;quot; ] &amp;amp;&amp;amp; [ -n &amp;quot;$FQDN&amp;quot; ] ; then&lt;br /&gt;
&lt;br /&gt;
	dns_hostname_path=&amp;quot;$(cut -d&#039;@&#039; -f2- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$DNS_AUTHENTICATION&amp;quot;)&amp;quot;&lt;br /&gt;
	dns_schema_auth=&amp;quot;$(cut -d&#039;@&#039; -f1 &amp;lt;&amp;lt;&amp;lt;&amp;quot;$DNS_AUTHENTICATION&amp;quot;)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
	dns_hostname=&amp;quot;$(cut -d&#039;/&#039; -f1 &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_hostname_path&amp;quot;)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
	dns_auth=&amp;quot;$(cut -d&#039;/&#039; -f3- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_schema_auth&amp;quot;)&amp;quot;&lt;br /&gt;
	dns_auth_keyname=&amp;quot;$(cut -d&#039;:&#039; -f1 &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_auth&amp;quot;)&amp;quot;&lt;br /&gt;
	dns_auth_algorithm=&amp;quot;$(cut -d&#039;:&#039; -f2- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_auth&amp;quot; | cut -d&#039;~&#039; -f1 | tr &#039;[:lower:]&#039; &#039;[:upper:]&#039;)&amp;quot;&lt;br /&gt;
	dns_auth_secret=&amp;quot;$(cut -d&#039;:&#039; -f2- &amp;lt;&amp;lt;&amp;lt;&amp;quot;$dns_auth&amp;quot; | cut -d&#039;~&#039; -f2-)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
	dns_host_aaaa=&amp;quot;$(dig +short AAAA &amp;quot;$dns_hostname&amp;quot; | head -n1)&amp;quot;&lt;br /&gt;
&lt;br /&gt;
	[ -d /etc/kaboom ] || mkdir -p /etc/kaboom&lt;br /&gt;
&lt;br /&gt;
	umask 266&lt;br /&gt;
	cat &amp;gt;/etc/kaboom/dns-auth.conf &amp;lt;&amp;lt;CONF&lt;br /&gt;
# Managed by apt, please use dpkg-reconfigure kaboom-api to modify&lt;br /&gt;
dns_rfc2136_server = $dns_host_aaaa&lt;br /&gt;
dns_rfc2136_port = 53&lt;br /&gt;
dns_rfc2136_name = $dns_auth_keyname&lt;br /&gt;
dns_rfc2136_secret = $dns_auth_secret&lt;br /&gt;
dns_rfc2136_algorithm = $dns_auth_algorithm&lt;br /&gt;
CONF&lt;br /&gt;
	umask 022&lt;br /&gt;
&lt;br /&gt;
	certbot_account_count=&amp;quot;$(find /etc/letsencrypt/accounts/acme-v02.api.letsencrypt.org/directory/ -maxdepth 1 -mindepth 1 | wc -l)&amp;quot;&lt;br /&gt;
	if [ &amp;quot;z$certbot_account_count&amp;quot; = &amp;quot;z0&amp;quot; ]; then&lt;br /&gt;
    	certbot register --non-interactive --email &amp;quot;$CERTBOT_EMAIL&amp;quot; --no-eff-email --agree-tos&lt;br /&gt;
	fi&lt;br /&gt;
&lt;br /&gt;
	echo &amp;quot;Checking if SSL certificate already exists&amp;quot;&lt;br /&gt;
	if [ ! -f &amp;quot;/etc/letsencrypt/live/kaboom-elearning/fullchain.pem&amp;quot; ]; then&lt;br /&gt;
            	echo &amp;quot;Requesting new certificate for $FQDN and *.$FQDN&amp;quot;&lt;br /&gt;
            	certbot certonly --non-interactive --cert-name kaboom-elearning --dns-rfc2136 --dns-rfc2136-credentials /etc/kaboom/dns-auth.conf --domain &amp;quot;$FQDN&amp;quot; --domain &amp;quot;*.$FQDN&amp;quot; --deploy-hook /usr/share/kaboom-api/bin/cert-deploy&lt;br /&gt;
    	if [ $? -eq 0 ]; then&lt;br /&gt;
        	echo &amp;quot;Certificate obtained successfully&amp;quot;&lt;br /&gt;
    	else&lt;br /&gt;
        	echo &amp;quot;Error obtaining certificate&amp;quot;&lt;br /&gt;
    	fi&lt;br /&gt;
	else&lt;br /&gt;
    	echo &amp;quot;Certificate already exists&amp;quot;&lt;br /&gt;
	fi&lt;br /&gt;
&lt;br /&gt;
	echo &amp;quot;Checking if SSL DHParams file already exists&amp;quot;&lt;br /&gt;
	if [ ! -f &amp;quot;/etc/kaboom/ssl-dhparams.pem&amp;quot; ]; then&lt;br /&gt;
            	openssl dhparam -out /etc/kaboom/ssl-dhparams.pem 2048&lt;br /&gt;
    	if [ $? -eq 0 ]; then&lt;br /&gt;
        	echo &amp;quot;DHParams generated successfully&amp;quot;&lt;br /&gt;
    	else&lt;br /&gt;
        	echo &amp;quot;Error generating DHParams&amp;quot;&lt;br /&gt;
    	fi&lt;br /&gt;
	else&lt;br /&gt;
    	echo &amp;quot;DHParams file already exists&amp;quot;&lt;br /&gt;
	fi&lt;br /&gt;
&lt;br /&gt;
	cat &amp;gt;/etc/nginx/sites-available/kaboom-elearning &amp;lt;&amp;lt;CONF&lt;br /&gt;
server {&lt;br /&gt;
	root /usr/share/kaboom-api/public;&lt;br /&gt;
	server_name *.$FQDN;&lt;br /&gt;
&lt;br /&gt;
	location / {&lt;br /&gt;
        	proxy_pass http://kaboom_api/;&lt;br /&gt;
        	proxy_set_header Host \$host;&lt;br /&gt;
        	proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;&lt;br /&gt;
        	proxy_set_header X-Forwarded-Proto \$scheme;&lt;br /&gt;
        	proxy_buffers 8 32k;&lt;br /&gt;
        	proxy_buffer_size 64k;&lt;br /&gt;
        	client_max_body_size 0;&lt;br /&gt;
        	proxy_redirect off;&lt;br /&gt;
        	proxy_buffering off;&lt;br /&gt;
	}&lt;br /&gt;
&lt;br /&gt;
	listen [::]:443 ssl http2;&lt;br /&gt;
	listen 443 ssl http2;&lt;br /&gt;
&lt;br /&gt;
	ssl_certificate /etc/letsencrypt/live/kaboom-elearning/fullchain.pem;&lt;br /&gt;
	ssl_certificate_key /etc/letsencrypt/live/kaboom-elearning/privkey.pem;&lt;br /&gt;
	ssl_dhparam /etc/kaboom/ssl-dhparams.pem;&lt;br /&gt;
&lt;br /&gt;
	ssl_session_cache shared:le_nginx_SSL:10m;&lt;br /&gt;
	ssl_session_timeout 1440m;&lt;br /&gt;
	ssl_session_tickets off;&lt;br /&gt;
&lt;br /&gt;
	ssl_protocols TLSv1.2 TLSv1.3;&lt;br /&gt;
	ssl_prefer_server_ciphers off;&lt;br /&gt;
	ssl_ciphers &amp;quot;ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384&amp;quot;;&lt;br /&gt;
}&lt;br /&gt;
CONF&lt;br /&gt;
&lt;br /&gt;
	[ -L /etc/nginx/sites-enabled/kaboom-elearning ] || ln -s /etc/nginx/sites-available/kaboom-elearning /etc/nginx/sites-enabled&lt;br /&gt;
&lt;br /&gt;
	nginx -q -t &amp;amp;&amp;amp; service nginx reload&lt;br /&gt;
&lt;br /&gt;
else&lt;br /&gt;
	echo &amp;quot;one or more of DNS_AUTHENTICATION, CERTBOT_EMAIL, FQDN are missing, skipping wildcard subdomains SSL certificate setup.&amp;quot;&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== f. Creating the cert-deploy Deploy Hook ====&lt;br /&gt;
&lt;br /&gt;
The certbot command calls in a &amp;lt;code&amp;gt;cert-deploy&amp;lt;/code&amp;gt; file via the --deploy-hook flag. This &amp;lt;code&amp;gt;cert-deploy&amp;lt;/code&amp;gt; script, should be created in &amp;lt;code&amp;gt;/usr/share/&amp;lt;PKG_NAME&amp;gt;/bin&amp;lt;/code&amp;gt;, and runs after each certificate issuance or renewal.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
&lt;br /&gt;
set -euo pipefail&lt;br /&gt;
&lt;br /&gt;
if [ &amp;quot;z$RENEWED_LINEAGE&amp;quot; != &amp;quot;z/etc/letsencrypt/live/&amp;lt;CERT_NAME&amp;gt;&amp;quot; ]; then&lt;br /&gt;
    echo &amp;quot;Unknown certificate renewed, ignoring&amp;quot; 1&amp;gt;&amp;amp;2&lt;br /&gt;
    exit 1&lt;br /&gt;
fi&lt;br /&gt;
&lt;br /&gt;
nginx -q -t &amp;amp;&amp;amp; service nginx reload&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== g. Final Step: Applying Configuration Changes ====&lt;br /&gt;
&lt;br /&gt;
To complete the setup and ensure all configurations take effect, we need to set the required variables (DNS_AUTHENTICATION, CERTBOT_EMAIL, FQDN), run the following command:&lt;br /&gt;
&amp;lt;code&amp;gt;sudo dpkg-reconfigure &amp;lt;PKG_NAME&amp;gt;&amp;lt;/code&amp;gt;&lt;/div&gt;</summary>
		<author><name>Alois</name></author>
	</entry>
</feed>