Here's a post from the "cool story, bro" department. Maybe relevant to those who think ECC doesn't matter with today's RAM and it's error rates (not that anybody knows what those are).
Got a bunch of "ECC error corrected" messages every morning after some cronjob ran. No hard errors and quiet for the day so I kicked and screamed a couple days, thought about using the opportunity to go registered RAM on at least one CPU, or a new board combo or buy a guitar or whatever, you know how it goes. I finally rip a 3-pack of modules out of a different machine and get to changing that bank.
Turns out it is just one DIMM having become unnotched and not quite seated anymore. Pop back in, no more soft errors. Without ECC that could have scrambled my filesystem contents and cause general reboot madness, and you could have diagnosed it only with the machine offline (in memtest).
It even pointed out the correct DIMM right in dmesg. Supermicro FTW.
Got a bunch of "ECC error corrected" messages every morning after some cronjob ran. No hard errors and quiet for the day so I kicked and screamed a couple days, thought about using the opportunity to go registered RAM on at least one CPU, or a new board combo or buy a guitar or whatever, you know how it goes. I finally rip a 3-pack of modules out of a different machine and get to changing that bank.
Turns out it is just one DIMM having become unnotched and not quite seated anymore. Pop back in, no more soft errors. Without ECC that could have scrambled my filesystem contents and cause general reboot madness, and you could have diagnosed it only with the machine offline (in memtest).
It even pointed out the correct DIMM right in dmesg. Supermicro FTW.