For When You Can't Have The Real Thing
[ start | index | login ]
start > Linux > Silent Northbridge MCE

Silent Northbridge MCE

Created by dave. Last edited by dave, 12 years and 305 days ago. Viewed 3,400 times. #2
[diff] [history] [edit] [rdf]
labels
attachments

Problem:

Gratuitous errors from a SMP Opteron system:

Feb 10 10:10:04 v20z-01 kernel: CPU 1: Silent Northbridge MCE 
Feb 10 10:10:04 v20z-01 kernel: Northbridge status a60000010005001b 
Feb 10 10:10:04 v20z-01 kernel:     GART TLB error generic level generic 
Feb 10 10:10:04 v20z-01 kernel:     extended error gart error 
Feb 10 10:10:04 v20z-01 kernel:     link number 0 
Feb 10 10:10:04 v20z-01 kernel:     err cpu1 
Feb 10 10:10:04 v20z-01 kernel:     processor context corrupt 
Feb 10 10:10:04 v20z-01 kernel:     error address valid 
Feb 10 10:10:04 v20z-01 kernel:     error uncorrected 
Feb 10 10:10:04 v20z-01 kernel:     previous error lost 
Feb 10 10:10:04 v20z-01 kernel:     error address 00000001fffe8118

Discussion

There are a bunch of errors with this symptom; check the third line of the error.

In the case of

ECC syndrome bits ce74
>>Link claims that these are non-fatal memory errors being corrected by the ECC memory on the system, and your kernel understands the system's ECC Assertion Interrupts, so it logs the event. To fix, change the memory, or possibly disable the ECC Assert Interrupts in the system's BIOS.

In the case of

GART TLB error generic level generic
>>Link suggests that in RedHat Enterprise 3 systems the culprit is the AGP driver that RedHat is using. If these errors do not appear to be fatal, you can generally ignore them. They do not appear on the same machines running CentOS 4.2 for example.
no comments | post comment
This is a collection of techical information, much of it learned the hard way. Consider it a lab book or a /info directory. I doubt much of it will be of use to anyone else.

Useful:


snipsnap.org | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt