Hardware Incident Response: Memory Slot Failure on banshee: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 5: Line 5:


This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.
This write-up should be seen as one practical example that may help guide similar interventions in the future or serve as a starting point when assessing next steps in a hardware-related incident.
== Confirm it's a hardware issue ==
In this case we received 2 alerts
1.
* iDRAC on banshee.idrac.ws.maxmaton.nl reporting critical failure
* Overall System Status is Critical (5)
2.
* Overall System Status is Critical (5)
* Problem with memory in slot DIMM.Socket.A1
To confirm the issue, we logged into the affected server (banshee) and ran the following commands:
<pre lang="bash">
journalctl -b | grep -i memory
journalctl -k | grep -i error
</pre>
We saw multiple entries reporting Hardware Error.
This was also confirmed by checking hardware health on the iDRAC interface:
[[File:Banshee.idrac.ws.maxmaton.nl restgui index.html 8ce2fb21ce62c14bc4975f040b973a5f(1).png|thumb|alt=Banshee's hardware health on iDRAC|Banshee's hardware health on iDRAC]]
92

edits

Navigation menu