Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

← Older edit

Hardware Incident Response: Memory Slot Failure on banshee (view source)

Revision as of 03:31, 8 July 2025

1,765 bytes added , 8 July 2025

no edit summary

Alois

94

edits

@@ Line 44: / Line 44: @@
 * Verify HA master node is responsive before initiating HA operations
 * Test migration procedures on a non-critical VM first
+== Intervention at the Datacenter ==
+Once the replacement memory stick was delivered, we scheduled a physical intervention at the datacenter to carry out the replacement. The goal was to bring the banshee node back online without any hardware issues and reintegrate it into the cluster safely.
+To guide the intervention, we used a checklist outlining all the necessary steps — from powering down the machine and replacing the faulty DIMM to validating the memory installation and carefully reintroducing VMs to the node. This helped ensure that each task was executed in the correct order and nothing critical was overlooked.
+Note: The checklist we followed is not intended to be a definitive or one-size-fits-all procedure. It should instead be seen as a practical example — a starting point that can be adapted depending on the specific hardware issue, node role, and service criticality involved in future incidents.
+* Remove Banshee from HA groups to prevent automatic VM migrations
+* Set up maintenance period for Banshee
+* Turn Banshee off
+* Disconnect Banshee
+* Open up Banshee
+* Locate and remove the faulty memory stick
+* Install the new memory stick in the correct slot
+* Record serial numbers of memory sticks in Netbox
+* Close Banshee, reconnect power and network, power it on, and connect a monitor
+* Enter the Lifecycle Controller and confirm that the memory is detected
+* In the Lifecycle Controller, run a memory test: if test fails repeat previous steps with other memory stick
+* Migrate a few selected test VMs back to Banshee
+* Once the system is stable and VMs are confirmed to run correctly, migrate all intended VMs to Banshee
+* Add Banshee back to the original HA groups
+* Make sure OSDs come back online
+* Remove maintenance period

Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Hardware Incident Response: Memory Slot Failure on banshee (view source)

Revision as of 03:31, 8 July 2025

Navigation menu

Search