Problem:
Gratuitous errors from a SMP Opteron system:
Feb 10 10:10:04 v20z-01 kernel: CPU 1: Silent Northbridge MCE
Feb 10 10:10:04 v20z-01 kernel: Northbridge status a60000010005001b
Feb 10 10:10:04 v20z-01 kernel: GART TLB error generic level generic
Feb 10 10:10:04 v20z-01 kernel: extended error gart error
Feb 10 10:10:04 v20z-01 kernel: link number 0
Feb 10 10:10:04 v20z-01 kernel: err cpu1
Feb 10 10:10:04 v20z-01 kernel: processor context corrupt
Feb 10 10:10:04 v20z-01 kernel: error address valid
Feb 10 10:10:04 v20z-01 kernel: error uncorrected
Feb 10 10:10:04 v20z-01 kernel: previous error lost
Feb 10 10:10:04 v20z-01 kernel: error address 00000001fffe8118
Discussion
There are a bunch of errors with this symptom; check the third line of the error.
In the case of
Link claims that these are non-fatal memory errors being corrected by the ECC memory on the system, and your kernel understands the system's ECC Assertion Interrupts, so it logs the event. To fix, change the memory, or possibly disable the ECC Assert Interrupts in the system's BIOS.
In the case of
GART TLB error generic level generic
Link suggests that in RedHat Enterprise 3 systems the culprit is the AGP driver that RedHat is using. If these errors do not appear to be fatal, you can generally ignore them. They do not appear on the same machines running CentOS 4.2 for example.