Dealing with bad memory ranges on Linux.

My desktop machine has been randomly locking up for the last couple weeks. I started troubleshooting by refreshing my coolant.

This did not seem to help at all. I then installed memtest86 and tested my memory. I found a failure at a specific address. The failure occurred whether the CPU was overclocked or not.

I started to look for a way to skip this memory range, and I stumbled upon BadRAM. Unfortunately, I could not get this patch to compile on my system (AMD64). I tried various patches, and various kernel versions, always receiving the following error.

mm/page_alloc.c: In function 'badram_markpages':
mm/page_alloc.c:3982: error: 'mem_map' undeclared (first use in this function)
mm/page_alloc.c:3982: error: (Each undeclared identifier is reported only once
mm/page_alloc.c:3982: error: for each function it appears in.)
mm/page_alloc.c: In function 'badram_setup':
mm/page_alloc.c:4008: error: 'mem_map' undeclared (first use in this function)
make[1]: *** [mm/page_alloc.o] Error 1
make: *** [mm] Error 2

Finally I found out that there is already a memmap kernel parameter which is capable of masking out holes (broken ranges) of memory. I really don’t want to trash a 4Gb DIMM over a single bad byte of memory.

At this point I am not sure how to tell if it is active or not. I guess I will wait to see if my system continues to lock up.


About this entry