WS Proxmox node reboot: Difference between revisions

Revision as of 07:33, 1 March 2024

If you're expecting to reboot every node in the cluster, do the node with the containers last, to limit the amount of downtime and reboots for them
Updating a node: `apt update` and `apt full-upgrade`
Make sure all VMs are actually migratable before adding to a HA group
If there are containers on the device you are looking to reboot- you are going to need to also create a maintenance mode to cover them (for example teamspeak or stats)
Containers will inherit the OS of their host, so you will also need to handle triggers related to their OS updating, where appropriate

If a VM or container is going to incur downtime, you must let the affected parties know in advance. Ideally they should be informed the previous day.

Check all Ceph pools are running on at least 3/2 replication
Check that all running VM's on the node you want to reboot are in HA (if not, add them or migrate them away manually)
Check that Ceph is healthy -> No remapped PG's, or degraded data redundancy
You have communicated that downtime is expected to the users who will be affected (Ideally one day in advance)

Start maintenance mode for the Proxmox node and any containers running on the node
Start maintenance mode for Ceph, specify that we only want to surpress the trigger for health state being in warning by setting tag `ceph_health` equals `warning`
Let affected parties know that the mainenance period you told them about in the preflight checks is about to take place.

If a kernel update was done, manually execute the `Operating system` item manually to detect the update. Manually executing the two items that indicate a reboot is also usefull if they were firing, to stop them/check no further reboots are needed.
Ackowledge & close triggers
Remove maintenance modes

Ensure that Kaboom API is running on Screwdriver or Paloma. This is to get the best performance for the VM.

@@ Line 5: / Line 5: @@
 * If there are containers on the device you are looking to reboot- you are going to need to also create a maintenance mode to cover them (for example teamspeak or stats)
 * Containers will inherit the OS of their host, so you will also need to handle triggers related to their OS updating, where appropriate
+== Pre-Work ==
 * If a VM or container is going to incur downtime, you must let the affected parties know in advance. Ideally they should be informed the previous day.