Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Hardware Incident Response: Memory Slot Failure on banshee (view source)

514 bytes added , 12 June 2025

92

edits

@@ Line 6: / Line 6: @@
 This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.
-== Confirm it's a hardware issue ==
+== Confirmed it's a hardware issue ==
 In this case we received 2 alerts
@@ Line 27: / Line 27: @@
 This was also confirmed by checking hardware health on the iDRAC interface:
-[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|left|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]
+[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|center|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]
-[[File:Image(1).png|thumb|left|alt=Banshee iDRAC logs|Banshee iDRAC logs]]
+[[File:Image(1).png|thumb|center|alt=Banshee iDRAC logs|Banshee iDRAC logs]]
+== Came up with a plan ==
+Once it was confirmed that the memory issue on banshee was a genuine hardware failure, it was decided that a physical intervention was necessary to replace the faulty memory module. The first step was to migrate all VMs running on Banshee to other available nodes in the Proxmox cluster to avoid service interruption. After ensuring that no critical workloads are running on banshee, the server could be safely shut down in preparation for hardware replacement at the datacenter.