Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Latest revision as of 02:31, 8 July 2025

This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions. For full context and team discussions, see the Zulip conversation related to this incident.

⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.

This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.

Confirmed it's a hardware issue

In this case we received 2 alerts

1.

iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure
Overall System Status is Critical (5)

2.

Overall System Status is Critical (5)
Problem with memory in slot DIMM.Socket.A1

To confirm the issue, we logged into the affected server (banshee) and ran the following commands:

journalctl -b | grep -i memory
journalctl -k | grep -i error

We saw multiple entries reporting Hardware Error.

This was also confirmed by checking hardware health on the iDRAC interface:

Banshee's hardware health on iDRAC

Banshee iDRAC logs

Came up with a plan

Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.

VMs Migration and Shutdown

While Proxmox HA is designed to automatically handle VM migrations in the event of node failures, in this case the degraded state of the memory made the bulk migration process unstable, causing the host to crash mid-way through and sending VMs into fencing mode. In hindsight, migrating VMs manually one by one would likely have been a safer strategy. Further technical details of the incident and recovery process can be found in the Zulip conversation.

Things to consider next time:

Avoid bulk HA-triggered migration if the server is already unstable — migrate VMs manually one at a time
Verify HA master node is responsive before initiating HA operations
Test migration procedures on a non-critical VM first

Intervention at the Datacenter

Once the replacement memory stick was delivered, we scheduled a physical intervention at the datacenter to carry out the replacement. The goal was to bring the banshee node back online without any hardware issues and reintegrate it into the cluster safely.

To guide the intervention, we used a checklist outlining all the necessary steps — from powering down the machine and replacing the faulty DIMM to validating the memory installation and carefully reintroducing VMs to the node. This helped ensure that each task was executed in the correct order and nothing critical was overlooked.

Note: The checklist we followed is not intended to be a definitive or one-size-fits-all procedure. It should instead be seen as a practical example — a starting point that can be adapted depending on the specific hardware issue, node role, and service criticality involved in future incidents.

Remove Banshee from HA groups to prevent automatic VM migrations
Set up maintenance period for Banshee
Turn Banshee off
Disconnect Banshee
Open up Banshee
Locate and remove the faulty memory stick
Install the new memory stick in the correct slot
Record serial numbers of memory sticks in Netbox
Close Banshee, reconnect power and network, power it on, and connect a monitor
Enter the Lifecycle Controller and confirm that the memory is detected
In the Lifecycle Controller, run a memory test: if test fails repeat previous steps with other memory stick
Migrate a few selected test VMs back to Banshee
Once the system is stable and VMs are confirmed to run correctly, migrate all intended VMs to Banshee
Add Banshee back to the original HA groups
Make sure OSDs come back online
Remove maintenance period

@@ Line 6: / Line 6: @@
 This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.
-== Confirm it's a hardware issue ==
+== Confirmed it's a hardware issue ==
 In this case we received 2 alerts
 .
 * iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure
@@ Line 27: / Line 28: @@
 This was also confirmed by checking hardware health on the iDRAC interface:
-[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|left|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]
+[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|center|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]
+[[File:Image(1).png|thumb|center|alt=Banshee iDRAC logs|Banshee iDRAC logs]]
+== Came up with a plan ==
+Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.
+== VMs Migration and Shutdown ==
+While Proxmox HA is designed to automatically handle VM migrations in the event of node failures, in this case the degraded state of the memory made the bulk migration process unstable, causing the host to crash mid-way through and sending VMs into fencing mode. In hindsight, migrating VMs manually one by one would likely have been a safer strategy. Further technical details of the incident and recovery process can be found in the Zulip conversation.
+Things to consider next time:
+* Avoid bulk HA-triggered migration if the server is already unstable — migrate VMs manually one at a time
+* Verify HA master node is responsive before initiating HA operations
+* Test migration procedures on a non-critical VM first
+== Intervention at the Datacenter ==
+Once the replacement memory stick was delivered, we scheduled a physical intervention at the datacenter to carry out the replacement. The goal was to bring the banshee node back online without any hardware issues and reintegrate it into the cluster safely.
+To guide the intervention, we used a checklist outlining all the necessary steps — from powering down the machine and replacing the faulty DIMM to validating the memory installation and carefully reintroducing VMs to the node. This helped ensure that each task was executed in the correct order and nothing critical was overlooked.
+Note: The checklist we followed is not intended to be a definitive or one-size-fits-all procedure. It should instead be seen as a practical example — a starting point that can be adapted depending on the specific hardware issue, node role, and service criticality involved in future incidents.
+* Remove Banshee from HA groups to prevent automatic VM migrations
+* Set up maintenance period for Banshee
+* Turn Banshee off
+* Disconnect Banshee
+* Open up Banshee
+* Locate and remove the faulty memory stick
+* Install the new memory stick in the correct slot
+* Record serial numbers of memory sticks in Netbox
+* Close Banshee, reconnect power and network, power it on, and connect a monitor
+* Enter the Lifecycle Controller and confirm that the memory is detected
+* In the Lifecycle Controller, run a memory test: if test fails repeat previous steps with other memory stick
+* Migrate a few selected test VMs back to Banshee
+* Once the system is stable and VMs are confirmed to run correctly, migrate all intended VMs to Banshee
+* Add Banshee back to the original HA groups
+* Make sure OSDs come back online
+* Remove maintenance period

Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Latest revision as of 02:31, 8 July 2025

Contents

Confirmed it's a hardware issue

Came up with a plan

VMs Migration and Shutdown

Intervention at the Datacenter

Navigation menu

Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Latest revision as of 02:31, 8 July 2025

Confirmed it's a hardware issue

Came up with a plan

VMs Migration and Shutdown

Intervention at the Datacenter

Navigation menu

Search