What are Machine Check Exceptions (or MCE)?
A machine check exception is an error dedected by your system's processor. There are 2 major types of MCE errors, a notice or warning error, and a fatal execption. The warning will be logged by a "Machine Check Event logged" notice in your system logs, and can be later viewed via some Linux utilities. A fatal MCE will cause the machine to stop responding and the details of the MCE will be printed out to the system's console.
本文重点,主要看案例2,带你很好的理解mcelog如何工作的?
- mcelog的干什么的?
- mcelog 是 x86 的 Linux 系统上用来 检查硬件错误,特别是内存和CPU错误的工具.
- mcelog怎么运行的?这三种方式有什么优点?缺点?
- 有三种运行的方式,cron,daemon,trigger
- cron是最low的方式,会丢失,trigger是比较高级的方式,触发的。一般我们在el6.el7上都是用daemon的方式
- 线上情况:el6,el7上怎么运行的?
- el6上默认应该是使用cron,每小时运行一次,也可以使用daemon守护进程的方式(需要手动执行mcelog --daemon),默认日志打到/var/log/mcelog,和/var/log/message.
- el7上默认使用mcelog.service启动的,相当于daemon守护进程的方式,但是,默认日志只打到和/var/log/message,然而默认/var/log/mcelog文件不存在,这个需要在启动命令种加上--logfile=/var/log/mcelog才可以。
- 如何模拟硬件错误验证mcelog是否运行正常?
- 有一个工具叫:mceinject ,mcetest,参见下文的案例2
mcelog简介
可纠正和不可纠正的硬件错误统称为机器检查异常 (MCE)。CPU 自身能够纠正错误,并通知底层操作系统与 CPU 或缓存有关的问题。CPU 本身还能从某些错误中恢复。Oracle Linux 可将 mcelog 用作机器检查的日志子系统。首先,必须使用以下命令在服务器上安装软件包。
yum install mcelog.x86_64service mcelogd startchkconfig mcelogd onormcelog --daemonorsystemctl restart mcelog.service
mcelog 是 x86 的 Linux 系统上用来 检查硬件错误,特别是内存和CPU错误的工具.
mcelog 软件包有工作方式,有使用cron的,也有使用守护进程的,,这个取决于你的操作系统版本。
/etc/cron.hourly/mcelog.cron 中的 cron 作业每小时检查 MCE 并将其保存到 /var/log/mcelog 中。由后台程序控制 mcelog 的方法更好一些,因为这样可以更快速地检测到硬件错误并立即记录下来,而不必等待 cron 作业运行。使用 mcelog 能检测到总线错误、内存错误和 CPU 缓存错误之类的错误,如果即将发生硬件故障,可以提前通知。
mcelog 能捕获两类错误:已纠正的 和未纠正的。已纠正的错误是由 CPU 处理的事件,可用来识别可能预测更大问题的趋势。
未纠正的错误是关键异常,如果 CPU 无法恢复,往往会导致系统上的内核错误。这会导致应用程序重置和中断。对于未纠正的错误,mcelog 捕获错误的能力取决于错误导致热重启还是硬重启。如果是热重启,信息会被 mcelog 捕获,恢复后可看到。硬重启会导致数据丢失,而且 mcelog 可能捕获不到该事件。
如下示例显示的 mcelog 错误消息显示了 CPU 1 上一个已纠正的错误:
Hardware event. This is not a software error.MCE 0CPU 1 BANK 2ADDR 1234TIME 1364535025 Fri Mar 29 01:30:25 2013 MCG status:MCi status:Corrected errorError enabledMCi_ADDR register validMCA: No ErrorSTATUS 9400000000000000 MCGSTATUS 0MCGCAP c07 APICID 1 SOCKETID 0CPUID Vendor Intel Family 6 Model 58
为了进行测试和故障排除,可以使用 mce-test 包生成假的硬件 MCE 事件并执行系统测试。
mce-test 软件包含丰富的默认测试,能模拟真实硬件故障,甚至会导致内核错误。需要执行几个配置步骤才能对系统进行此类测试。
首先,需要安装几个支持软件包才能在测试系统上配置 mce-test。使用以下命令:
yum install gcc.x86_64 gcc-c++.x86_64 flex.x86_64 dialog.x86_64 ras-utils.x86_64 git.x86_64
mcelog相关
mcelog的启动方式
- cron,最老的方式,有确定,定时任务,会丢失一些
- daemon,el7上用这种
- trigger,高级一点的方式,触发的时候,看man mcelog
mcelog相关文件
/dev/mcelog 设备文件
/var/log/mcelog messages日志文件
/etc/mcelog/mcelog.conf配置文件
/var/run/mcelog.pid
默认故障日志只记录在/var/log/mcelog,并不记录到系统日志中。
如果需要在系统日志中也体现,需修改/etc/mcelog/mcelog.conf文件,将前面#去掉,并保存。# log output options# Log decoded machine checks in syslog (default stdout or syslog for daemon)#syslog = yes# Log decoded machine checks in syslog with error level#syslog-error = yes# Never log anything to syslog#no-syslog = yes# Append log output to logfile instead of stdout. Only when no syslog logging is active#logfile = filename
el6的mcelog
el6运行mcelog的方式
在el6上mcelog使用cron来运行,安装mcelog会自动产生如下文件:
/etc/cron.hourly/mcelog.cron
默认配置 /etc/cron.hourly/mcelog.cron 每小时执行一次。
这个定时脚本是软件包 mcelog安装的,这个工具mcelog目前仍在持续开发维护,可以从内核工具 或GitHub andikleen/mcelog 获得。#ps aux | grep mcelogroot 4177 0.0 0.0 6756 616 ? Ss Aug18 0:00 /usr/sbin/mcelog --daemon
el6手动运行mcelog的方式
# mcelog --daemon
el6上查看mcelog日志
#tail /var/log/mcelog
什么也没有说明,正常。
查看mcelog守护进程是否检测到错误信息
# mcelog --client
没有输出,表示正常。
解析系统异常时的mcelog输出:
# mcelog --ascii < file.logor# mcelog --ascii --file file.log
案例1:完全看不懂
mcelog --ascii --file /var/log/mcelog |tail 和sudo tail /var/log/mcelog看到的结果是一样的。
[root@zxl /home/ahao.mah]$mcelog --ascii --file /var/log/mcelog |tailmcelog: Cannot open /dev/mem for DMI decoding: Permission deniedMCA: MEMORY CONTROLLER MS_CHANNEL2_ERRTransaction: Memory scrubbing errorSTATUS cc000140000800c2 MCGSTATUS 0MCGCAP 1000c19 APICID 0 SOCKETID 0Hardware event. This is not a software error.CPU 0 BANK 0MISC 0 ADDR 0STATUS cc000140000800c2 MCGSTATUS 0MCGCAP 1000c19 APICID 0 SOCKETID 0(Fields were incomplete)
[root@zxl /home/ahao.mah]$sudo tail /var/log/mcelogMCi status:Error overflowCorrected errorMCi_MISC register validMCi_ADDR register validMCA: MEMORY CONTROLLER MS_CHANNEL2_ERRTransaction: Memory scrubbing errorSTATUS cc000140000800c2 MCGSTATUS 0MCGCAP 1000c19 APICID 0 SOCKETID 0CPUID Vendor Intel Family 6 Model 62
[root@zxl /home/ahao.mah]$sudo mcelog --clientMemory errorsSOCKET 0 CHANNEL any DIMM anycorrected memory errors: 13 total 0 in 24huncorrected memory errors: 0 total 0 in 24hSOCKET 0 CHANNEL 2 DIMM anycorrected memory errors: 4 total 4 in 24huncorrected memory errors: 0 total 0 in 24hPer page corrected memory statistics:198f0b2000: total 1 seen "1 in 24h" online198f0b9000: total 1 seen "1 in 24h" online1b8f0b5000: total 1 seen "1 in 24h" online1b8f0be000: total 1 seen "1 in 24h" online
el7的mcelog
el7运行mcelog
- 默认开机启动:mcelog.service
-
#systemctl is-enabled mcelog.serviceenabled
- 在el7不是使用cron运行mcelog,用mcelog.service管理
-
#systemctl cat mcelog.service# /usr/lib/systemd/system/mcelog.service[Unit]Description=Machine Check Exception Logging DaemonAfter=syslog.target# FIXME - due to upstream kernel bug always start the mcelog process# twice using the following ExecStartPre hack. This needs fixing.# There is a bug filed against systemd for the ExecStartPre bit# since it is not possible to specify that the ExecStarPre bit# is allowed and expected to fail without aborting the daemon.[Service]Type=forkingExecStartPre=/etc/mcelog/mcelog.setupExecStart=/usr/sbin/mcelog --ignorenodev --daemon --syslogStandardOutput=syslog[Install]WantedBy=multi-user.target
在RHEL 7.x平台,已经舍弃了使用cron方式运行mcelog程序的方法,而改为系统启动时运行mcelog.service服务进程。使用ps命令可以检查到系统运行了如下mcelog服务
-
/usr/sbin/mcelog --ignorenodev --daemon --syslog
--ignorenodev Exit silently when the device cannot be opened--daemon Run in background waiting for events (needs newer kernel)--syslog Log decoded machine checks in syslog (default stdout or syslog for daemon)
el7上查看mcelog日志
mcelog的相关配置
-
#grep MCE /boot/config-2.6.32-220.23.2.ali878.el6.x86_64CONFIG_X86_MCE=yCONFIG_X86_MCE_INTEL=yCONFIG_X86_MCE_AMD=yCONFIG_X86_MCE_THRESHOLD=yCONFIG_X86_MCE_INJECT=mCONFIG_EDAC_DECODE_MCE=m# CONFIG_EDAC_MCE_INJ is not setCONFIG_EDAC_MCE=y
安全
__mcelog__需要使用root身份运行,因为它需要出发动作,如page-offlining,这要求CAP_SYS_ADMIN。并且它需要打开设备/dev/mcelog和一个用于支持客户端的unix socket。
当mcelog运行在daemon模式,它会监听在一个unix socket上并处理mcelog --client的请求。默认会检查请求的uid/gid并且默认是0/0,可配置。客户端处理和相应是由daemon的完整的特权处理的。
测试
mce-inject使用方法
mce-inject用于测试mcelog能否正确的获取硬件错误信息,并进行正确解码,mce-inject可以向内核注入指定的错误信息,因此,可以很方便的了解到mcelog的功能是否正常。
这里需要注意的是,当用户利用mce-inject工具向内核注入不可恢复错误(如:fatal)时,会发生死机重新启动等现象,当然,可以通过更改sys文件系统下的tolerate文件来避免此现象的发生。安装mce-inject
-
#yum install -y ras-utils
tolerate文件配置
位置:/sys/devices/system/machinecheck/machinecheck*/
说明:其中machinecheck* 中的 *号由CPU的个数所决定的,如果是双核的,则存在machinecheck0和machinecheck1两个目录,对应目录里都有一个tolerate文件,tolerate中存放容忍程度值。
功能:向用户提供一个可选择的出现相应硬件错误时的容忍程度(tolerate),比如:当tolerate的值为1时,出现fatal错误时就会死机,重新启动,并且该错误信息并不被记录;当tolerate的值为3时(注意该值只用于测试),在出现fatal错误时,机器会容忍该错误不予响应,不会出现死机重新启动现象,并且会记录相关错误信息。查看tolerate
以root身份进入相应的目录进行查看即可。如:
-
#cd /sys/devices/system/machinecheck/machinecheck0#cat tolerate
查看CPU0的tolerate值。
设置tolerate以root身份进入相应的目录进行修改即可,设置tolerate的方法很多,如: -
#cd /sys/devices/system/machinecheck/machinecheck0#echo 3 > tolerant
数值含义
-
tolerate的取值可以为0、1、2、3。0: always panic on uncorrected errors, log corrected errors1: panic or SIGBUS on uncorrected errors, log corrected errors2: SIGBUS or log uncorrected errors (if possible), log corrected errors3: never panic or SIGBUS, log all errors (for testing only)
案例2:mce-inject使用
-
mce-inject的使用方法也很简单,不过在使用前要现将tolerate的值修改为3,以防止死机重启事件发生,然后,在终端以root身份执行:mce-inject filename ...filename 存放要注入的具体错误类型
1. 安装
-
yum install gcc.x86_64 gcc-c++.x86_64 flex.x86_64 dialog.x86_64 ras-utils.x86_64 git.x86_64
2. 捏造文件
例如,一个mce-filename文件correct的内容为:
-
#cat correctCPU 1 BANK 2STATUS correctedRIP 0x12341234
3. 加载mce-inject模块
-
# modprobe mce-inject
#modprobe -l | grep mce-injectkernel/arch/x86/kernel/cpu/mcheck/mce-inject.ko
4. 在终端输入
-
#mce-inject ./correct
即可成功注入,详细的输出结果可以查看/var/log/mcelog文件。
5. 查看/var/log/mcelog,/var/log/messages
-
#tail /var/log/mcelogCPU 1 BANK 2TIME 1475065726 Wed Sep 28 20:28:46 2016MCG status:MCi status:Corrected errorError enabledMCA: No ErrorSTATUS 9000000000000000 MCGSTATUS 0MCGCAP 1000c12 APICID 2 SOCKETID 0CPUID Vendor Intel Family 6 Model 45
#cat /var/log/messagesSep 28 20:41:24 dnstest08.tbc kernel: : [16423350.358386] Starting machine check poll CPU 1Sep 28 20:41:24 dnstest08.tbc kernel: : [16423350.371252] [Hardware Error]: Machine check events logged
6.同样的方式,在el7上也可以看到
tail /var/log/messages 可以看到日志,但是,/var/log/mcelog文件默认在el7上,却不存在!!
原因是,默认打到/var/log/messages ,不打到/var/log/mcelog。如果希望打到/var/log/mcelog,需要在mcelog 的service文件中,加入参数--logfile=/var/log/mcelog,然后重启mcelog . -
ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --syslog --logfile=/var/log/mcelog
#tail /var/log/messages -fSep 28 20:57:53 jiangyi02 kernel: Starting machine check poll CPU 1Sep 28 20:57:53 jiangyi02 kernel: Machine check poll done on CPU 1Sep 28 20:57:53 jiangyi02 mcelog: Hardware event. This is not a software error.Sep 28 20:57:53 jiangyi02 mcelog: MCE 0Sep 28 20:57:53 jiangyi02 mcelog: CPU 1 BANK 2Sep 28 20:57:53 jiangyi02 mcelog: TIME 1475067473 Wed Sep 28 20:57:53 2016Sep 28 20:57:53 jiangyi02 mcelog: MCG status:Sep 28 20:57:53 jiangyi02 mcelog: MCi status:Sep 28 20:57:53 jiangyi02 mcelog: Corrected errorSep 28 20:57:53 jiangyi02 mcelog: Error enabledSep 28 20:57:53 jiangyi02 mcelog: MCA: No ErrorSep 28 20:57:53 jiangyi02 mcelog: STATUS 9000000000000000 MCGSTATUS 0Sep 28 20:57:53 jiangyi02 mcelog: MCGCAP 1000c12 APICID 2 SOCKETID 0Sep 28 20:57:53 jiangyi02 mcelog: CPUID Vendor Intel Family 6 Model 45
可以通过文本文件提供输入的方式直接使用 mce-inject 可执行程序,但对于在系统上进行测试,功能更强的方法是使用 mce-test 程序。
-
#git clone https://github.com/andikleen/mce-test.gitCloning into 'mce-test'...remote: Counting objects: 2197, done.remote: Total 2197 (delta 0), reused 0 (delta 0), pack-reused 2197Receiving objects: 100% (2197/2197), 409.06 KiB | 57.00 KiB/s, done.Resolving deltas: 100% (1220/1220), done.Checking connectivity... done.
克隆 git 信息库之后,您就可以转到 mce-test 目录执行 mcemenu,这会转至 mce-test 工具主菜单
我们要做的第一件事是编译测试套件,所以选择 Compile 选项编译该测试套件要用到的所有可执行文件。然后可以从 Execute 菜单中执行测试。测试运行后,可以使用 Results 菜单查看测试结果。mce-test/doc 目录下的文档包含了有关测试以及如何根据需要充分利用该套件的所有信息。
-
硬件排查
- 日志报错查看/var/log/messages或/var/log/mcelog 有报错,不知道有什么方法可以找出mc0: csrow6: CPU_SrcID#0_Ha#0_Channel#3是哪个内存DIMM,其中的chanel和csrow分别代表什么意思?
-
[30200989.742558] { 1}[Hardware Error]: Hardware error from APEIGeneric Hardware Error Source: 65534[30200989.742562] { 1}[Hardware Error]: It has been corrected by h/wand requires no further action[30200989.742566] { 1}[Hardware Error]: event severity: corrected[30200989.742568] { 1}[Hardware Error]: Error 0, type: corrected[30200989.742571] { 1}[Hardware Error]: section type: unknown,330f1140-72a5-11df-9690-0002a5d5c51b[30200989.742578] { 2}[Hardware Error]: Hardware error from APEIGeneric Hardware Error Source: 0[30200989.742580] { 2}[Hardware Error]: It has been corrected by h/wand requires no further action[30200989.742608] { 2}[Hardware Error]: event severity: corrected[30200989.742609] { 2}[Hardware Error]: Error 0, type: corrected[30200989.742610] { 2}[Hardware Error]: fru_text: A5[30200989.742614] { 2}[Hardware Error]: section_type: memory error[30200989.742615] { 2}[Hardware Error]: error_status:0x0000000000000400[30200989.742617] { 2}[Hardware Error]: physical_address:0x0000000f98cf5fc0[30200989.742619] { 2}[Hardware Error]: node: 1 card: 1 module: 0rank: 1 bank: 1 row: 42861 column: 192[30200989.742621] { 2}[Hardware Error]: error_type: 13, scrubcorrected error[30200989.742623] { 2}[Hardware Error]: DIMM location: not present.DMI handle: 0x0000[30200989.742655] EDAC skx MC1: HANDLING MCE MEMORY ERROR[30200989.742661] EDAC skx MC1: CPU 0: Machine Check Event: 0 Bank1: 940000000000009f[30200989.742672] EDAC skx MC1: TSC 105192b3d65b124[30200989.742674] EDAC skx MC1: ADDR f98cf5fc0[30200989.742675] EDAC skx MC1: MISC 0[30200989.742677] EDAC skx MC1: PROCESSOR 0:50654 TIME 1557145393SOCKET 0 APIC 0[30200989.742694] EDAC MC1: 0 CE memory read error onCPU_SrcID#0_MC#1_Chan#1_DIMM#0 (channel:1 slot:0 page:0xf98cf5offset:0xfc0 grain:32 syndrome:0x0 - err_code:0000:009f socket:0imc:1 rank:1 bg:3 ba:1 row:a362 col:1a8)[30200989.744952] __get_any_page: 0xf98cf5 free huge page[30201088.985651] mce: [Hardware Error]: Machine check eventslogged
linux使用edac_util输出确认故障硬件位置 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 1 Corrected Errors
-
[root@zxl]# edac-util -vmc0: 0 Uncorrected Errors with no DIMM infomc0: 0 Corrected Errors with no DIMM infomc0: csrow0: 0 Uncorrected Errorsmc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errorsmc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errorsmc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errorsmc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errorsmc1: 0 Uncorrected Errors with no DIMM infomc1: 0 Corrected Errors with no DIMM infomc1: csrow0: 0 Uncorrected Errorsmc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 1 Corrected Errorsmc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errorsmc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errorsmc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
-
相关链接