Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

From Delft Solutions
Jump to navigation Jump to search
No edit summary
Line 27: Line 27:
This was also confirmed by checking hardware health on the iDRAC interface:
This was also confirmed by checking hardware health on the iDRAC interface:
   
   
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|left|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]

Revision as of 01:29, 12 June 2025

This document outlines how we handled a memory stick failure on the server banshee. It details the actual steps we took, the tools and commands we used, and the reasoning behind our decisions. For full context and team discussions, see the Zulip conversation related to this incident.

⚠️ Important: This is not a universal or comprehensive guide. Hardware failures — including memory issues — can vary widely in symptoms and impact. There may be multiple valid ways to respond depending on the urgency or available resources.

This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.

Confirm it's a hardware issue

In this case we received 2 alerts 1.

  • iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure
  • Overall System Status is Critical (5)

2.

  • Overall System Status is Critical (5)
  • Problem with memory in slot DIMM.Socket.A1

To confirm the issue, we logged into the affected server (banshee) and ran the following commands:

journalctl -b | grep -i memory
journalctl -k | grep -i error

We saw multiple entries reporting Hardware Error.

This was also confirmed by checking hardware health on the iDRAC interface:

Banshee's hardware health on iDRAC
Banshee's hardware health on iDRAC