Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
Line 35: Line 35:


Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.
Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.
== VMs Migration and Shutdown ==
While Proxmox HA is designed to automatically handle VM migrations in the event of node failures, in this case the degraded state of the memory made the bulk migration process unstable, causing the host to crash mid-way through and sending VMs into fencing mode. In hindsight, migrating VMs manually one by one would likely have been a safer strategy. Further technical details of the incident and recovery process can be found in the Zulip conversation.
Things to consider next time:
* Avoid bulk HA-triggered migration if the server is already unstable — migrate VMs manually one at a time
* Verify HA master node is responsive before initiating HA operations
* Test migration procedures on a non-critical VM first
92

edits

Navigation menu