WS Proxmox node reboot: Difference between revisions

Latest revision as of 05:32, 12 February 2026

Tips & Notes

If you're expecting to reboot every node in the cluster, do the node with the containers last, to limit the amount of downtime and reboots for them
Updating a node: `apt update` and `apt full-upgrade`
Make sure all VMs are actually migratable before adding to a HA group
If there are containers on the device you are looking to reboot- you are going to need to also create a maintenance mode to cover them (for example teamspeak or awstats)
Containers will inherit the OS of their host, so you will also need to handle triggers related to their OS updating, where appropriate

Pre-Work

If a VM or container is going to incur downtime, you must let the affected parties know in advance. Ideally they should be informed the previous day.

Pre-flight checks

Check all Ceph pools are running on at least 3/2 replication
Check that all running VM's on the node you want to reboot are in HA (if not, add them or migrate them away manually)
- The `compute.*` VM's are not to be migrated! Rebooting a node with such a VM present requires shutting down the VM!
Check that Ceph is healthy -> No remapped PG's, or degraded data redundancy
You have communicated that downtime is expected to the users who will be affected (Ideally one day in advance)

Update Process

Update the node: `apt update` and `apt full-upgrade`
Check the packages that are removed/updated/installed correctly and they are sane (to make sense)

Reboot process

Complete the pre-flight checks
If you want to reboot for a kernel update, make sure the kernel is updated by following the Update Process written above
Start maintenance mode for the Proxmox node and any containers running on the node
Start maintenance mode for Ceph, specify that we only want to surpress the trigger for health state being in warning by setting tag `ceph_health` equals `warning`
Let affected parties know that the maintenance period you told them about in the preflight checks is about to take place.

Set noout flag on host: `ceph osd set-group noout <node>`

gain ssh access to host
Log in through IPA
Run the command

Place the node in HA maintenance mode so the cluster does not trigger failover or recovery actions `ha-manager crm-command node-maintenance enable <node>`
Reboot node through web GUI
Wait for node to come back up
Wait for OSD's to be back online
disable node maintenance mode `ha-manager crm-command node-maintenance disable <node>`
Remove noout flag on host: `ceph osd unset-group noout <node>` ,to do this:
If a kernel update was done, manually execute the `Operating system` item manually to detect the update. Manually executing the two items that indicate a reboot is also usefull if they were firing, to stop them/check no further reboots are needed.
Ackowledge & close triggers
Remove maintenance modes

Aftercare

Ensure that Kaboom API is running on Screwdriver or Paloma. This is to get the best performance for the VM.

@@ Line 8: / Line 8: @@
 * If a VM or container is going to incur downtime, you must let the affected parties know in advance. Ideally they should be informed the previous day.
-== Pre flight checks ==
+== Pre-flight checks ==
 * Check all Ceph pools are running on at least 3/2 replication
 * Check that all running VM's on the node you want to reboot are in HA (if not, add them or migrate them away manually)
+** '''The `compute.*` VM's are not to be migrated! Rebooting a node with such a VM present requires shutting down the VM!'''
 * Check that Ceph is healthy -> No remapped PG's, or degraded data redundancy
 * You have communicated that downtime is expected to the users who will be affected (Ideally one day in advance)
@@ Line 16: / Line 17: @@
 == Update Process ==
 * Update the node: `apt update` and `apt full-upgrade`
-* Check the packages that are removed/updated/installed correctly and they are the sane (to make sense)
+* Check the packages that are removed/updated/installed correctly and they are sane (to make sense)
 == Reboot process ==
+* Complete the pre-flight checks
+* If you want to reboot for a kernel update, make sure the kernel is updated by following the Update Process written above
 * Start maintenance mode for the Proxmox node and any containers running on the node
 * Start maintenance mode for Ceph, specify that we only want to surpress the trigger for health state being in warning by setting tag `ceph_health` equals `warning`
-* Let affected parties know that the mainenance period you told them about in the preflight checks is about to take place.
+* Let affected parties know that the maintenance period you told them about in the preflight checks is about to take place.
 [[File:Ceph-maintenance.png|thumb]]
 * Set noout flag on host: `ceph osd set-group noout <node>`
@@ Line 28: / Line 31: @@
 # Log in through IPA
 # Run the command
+* Place the node in HA maintenance mode so the cluster does not trigger failover or recovery actions `ha-manager crm-command node-maintenance enable <node>`
 * '''Reboot''' node through web GUI
 * Wait for node to come back up
 * Wait for OSD's to be back online
+* disable node maintenance mode `ha-manager crm-command node-maintenance disable <node>`
 * Remove noout flag on host: `ceph osd unset-group noout <node>` ,to do this:
 * If a kernel update was done, manually execute the `Operating system` item manually to detect the update. Manually executing the two items that indicate a reboot is also usefull if they were firing, to stop them/check no further reboots are needed.
 * Ackowledge & close triggers

WS Proxmox node reboot: Difference between revisions

Latest revision as of 05:32, 12 February 2026

Contents

Tips & Notes

Pre-Work

Pre-flight checks

Update Process

Reboot process

Aftercare

Navigation menu

WS Proxmox node reboot: Difference between revisions

Latest revision as of 05:32, 12 February 2026

Tips & Notes

Pre-Work

Pre-flight checks

Update Process

Reboot process

Aftercare

Navigation menu

Search