Hardware Incident Response: Memory Slot Failure on banshee
This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions. For full context and team discussions, see the Zulip conversation related to this incident.
⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.
Confirmed it's a hardware issue
In this case we received 2 alerts
1.
- iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure
- Overall System Status is Critical (5)
2.
- Overall System Status is Critical (5)
- Problem with memory in slot DIMM.Socket.A1
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:
journalctl -b | grep -i memory
journalctl -k | grep -i error
We saw multiple entries reporting Hardware Error.
This was also confirmed by checking hardware health on the iDRAC interface:
Came up with a plan
Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.
VMs Migration and Shutdown
While Proxmox HA is designed to automatically handle VM migrations in the event of node failures, in this case the degraded state of the memory made the bulk migration process unstable, causing the host to crash mid-way through and sending VMs into fencing mode. In hindsight, migrating VMs manually one by one would likely have been a safer strategy. Further technical details of the incident and recovery process can be found in the Zulip conversation.
Things to consider next time:
- Avoid bulk HA-triggered migration if the server is already unstable — migrate VMs manually one at a time
- Verify HA master node is responsive before initiating HA operations
- Test migration procedures on a non-critical VM first