May 17 12:39:44 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:39:44 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0xa97802b50
May 17 12:39:44 oldweb kernel: EDAC MC2: CE page 0xa97802, offset 0xb50, grain 0, syndrome 0x11c1, row 0, channel 1, label "": amd64_edac
May 17 12:39:44 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:39:44 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c00011080a13
May 17 12:39:44 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x0000000a97802b50
May 17 12:39:44 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:42:14 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:42:14 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0xaa880c350
May 17 12:42:14 oldweb kernel: EDAC MC2: CE page 0xaa880c, offset 0x350, grain 0, syndrome 0x11c1, row 0, channel 1, label "": amd64_edac
May 17 12:42:14 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:42:14 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c00011080a13
May 17 12:42:14 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x0000000aa880c350
May 17 12:42:14 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:43:29 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:43:29 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0xbe1283db0
May 17 12:43:29 oldweb kernel: EDAC MC2: CE page 0xbe1283, offset 0xdb0, grain 0, syndrome 0x11c1, row 4, channel 1, label "": amd64_edac
May 17 12:43:29 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:43:29 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c00011080a13
May 17 12:43:29 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x0000000be1283db0
May 17 12:43:29 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:44:06 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:06 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x97eeb47a0
May 17 12:44:06 oldweb kernel: EDAC MC2: CE page 0x97eeb4, offset 0x7a0, grain 0, syndrome 0x11c1, row 5, channel 1, label "": amd64_edac
May 17 12:44:06 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:44:06 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c00011080a13
May 17 12:44:06 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x000000097eeb47a0
May 17 12:44:06 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:44:25 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:25 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8fe32c0c0
May 17 12:44:25 oldweb kernel: EDAC MC2: CE page 0x8fe32c, offset 0xc0, grain 0, syndrome 0x2242, row 1, channel 1, label "": amd64_edac
May 17 12:44:25 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:44:25 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc21400022080a13
May 17 12:44:25 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x00000008fe32c0c0
May 17 12:44:25 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:44:34 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:34 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8e2f2b0c0
May 17 12:44:34 oldweb kernel: EDAC MC2: CE page 0x8e2f2b, offset 0xc0, grain 0, syndrome 0x11c1, row 1, channel 1, label "": amd64_edac
May 17 12:44:34 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:44:34 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c00011080813
May 17 12:44:34 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x00000008e2f2b0c0
May 17 12:44:34 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
May 17 12:44:39 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:39 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d1504000
May 17 12:44:39 oldweb kernel: EDAC MC2: CE page 0x8d1504, offset 0x0, grain 0, syndrome 0x2242, row 0, channel 1, label "": amd64_edac
May 17 12:44:39 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:44:39 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc21400022080a13
May 17 12:44:39 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x00000008d1504000
May 17 12:44:39 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:44:41 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:41 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8b1012c80
May 17 12:44:41 oldweb kernel: EDAC MC2: CE page 0x8b1012, offset 0xc80, grain 0, syndrome 0x3383, row 0, channel 1, label "": amd64_edac
May 17 12:44:41 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:44:41 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc41c00033080a13
May 17 12:44:41 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x00000008b1012c80
May 17 12:44:41 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:44:42 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:42 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0xb66a9caf0
May 17 12:44:42 oldweb kernel: EDAC MC2: CE page 0xb66a9c, offset 0xaf0, grain 0, syndrome 0x11c1, row 4, channel 1, label "": amd64_edac
May 17 12:44:42 oldweb kernel: [Hardware Error]: Error Status: Corrected error, no action required.
May 17 12:44:42 oldweb kernel: [Hardware Error]: CPU:8 (10:2:3) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc60c00011080a13
May 17 12:44:42 oldweb kernel: [Hardware Error]: MC4_ADDR: 0x0000000b66a9caf0
May 17 12:44:42 oldweb kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
May 17 12:44:43 oldweb kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
May 17 12:44:43 oldweb kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8923a3f90
网上说 (node 2)是CPU2
于是用命令查看:
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
其他省略....
/sys/devices/system/edac/mc/mc2/csrow1/ch1_ce_count:4414 (该条内存不归0了,应该是他)
用其他命令如下
[root@oldweb ~]# grep [0-9] /sys/devices/system/edac/mc/mc2/csrow0/ch*_ce_count
/sys/devices/system/edac/mc/mc2/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:4976
[root@oldweb ~]# grep [0-9] /sys/devices/system/edac/mc/mc2/csrow1/ch*_ce_count
/sys/devices/system/edac/mc/mc2/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow1/ch1_ce_count:4414
count不为0的行即代表存在内存错误。
mc*:第好多个CPU(一定要看主板上标注的是CPU0还是CPU2,不要把CPU1当成第二颗,根据实际标注)。
csrow*:内存通道。
ch*:通道内的第几根内存。
通过分析知道是CPU2的第一个通道的DIMM 1 出问题了。于是拆下该内存。