WS Proxmox node reboot: Difference between revisions

Revision as of 06:58, 27 February 2024

If you're expecting to reboot every node in the cluster, do the node with the containers last, to limit the amount of downtime and reboots for them
Updating a node: `apt update` and `apt full-upgrade`

Check all Ceph pools are running on at least 3/2 replication
Check that all running VM's on the node you want to reboot are in HA (if not, add them or migrate them away manually)
Check that Ceph is healthy -> No remapped PG's, or degraded data redundancy

Start maintenance mode for the Proxmox node and any containers running on the node
Start maintenance mode for Ceph, specify that we only want to surpress the trigger for health state being in warning by setting tag `ceph_health` equals `warning`

Set noout flag on host: `ceph osd set-group noout <node>`
Reboot node through web GUI
Wait for node to come back up
Wait for OSD's to be back online
Remove noout flag on host: `ceph osd unset-group noout <node>`
If a kernel update was done, manually execute the `Operating system` item manually to detect the update. Manually executing the two items that indicate a reboot is also usefull if they were firing, to stop them/check no further reboots are needed.
Ackowledge & close triggers
Remove maintenance modes

Ensure that Kaboom API is running on Screwdriver or Paloma. This is to get the best performance for the VM.

@@ Line 4: / Line 4: @@
 == Pre flight checks ==
-* Check all Ceph pools are running on at least 2/3 replication
+* Check all Ceph pools are running on at least 3/2 replication
 * Check that all running VM's on the node you want to reboot are in HA (if not, add them or migrate them away manually)
 * Check that Ceph is healthy -> No remapped PG's, or degraded data redundancy