Machine check - hardware errors - Intel 4790K
Posted: 21. Jan 2016, 17:39
Running Scientific Linux 7.2
I have run Virtual Box for several years (to run windows 2000 to support a system I use), previously on Intel E6500 CPU (Wolfdale family based hardware on Gigabyte EP43-UD3L)).
I have now set up a new linux system based on Intel 4790K CPU (Haswell family based hardware on Gigabyte GA-Z97X-UD5H ).
I now get for example
The kernel log indicates that hardware errors were detected.
:System log may have more information.
:The last 20 mcelog lines of system log are:
:==========================================
:Dec 21 17:59:13 fax mcelog: MCG status:
:Dec 21 17:59:13 fax mcelog: MCi status:
:Dec 21 17:59:13 fax mcelog: Corrected error
:Dec 21 17:59:13 fax mcelog: Error enabled
:Dec 21 17:59:13 fax mcelog: MCA: Internal parity error
:Dec 21 17:59:13 fax mcelog: STATUS 90000040000f0005 MCGSTATUS 0
:Dec 21 17:59:13 fax mcelog: MCGCAP c09 APICID 6 SOCKETID 0
:Dec 21 17:59:13 fax mcelog: CPUID Vendor Intel Family 6 Model 60
:Dec 22 09:42:37 fax mcelog: Hardware event. This is not a software error.
:Dec 22 09:42:37 fax mcelog: MCE 0
:Dec 22 09:42:37 fax mcelog: CPU 0 BANK 0
:Dec 22 09:42:37 fax mcelog: TIME 1450777357 Tue Dec 22 09:42:37 2015
:Dec 22 09:42:37 fax mcelog: MCG status:
:Dec 22 09:42:37 fax mcelog: MCi status:
:Dec 22 09:42:37 fax mcelog: Corrected error
:Dec 22 09:42:37 fax mcelog: Error enabled
:Dec 22 09:42:37 fax mcelog: MCA: Internal parity error
:Dec 22 09:42:37 fax mcelog: STATUS 90000040000f0005 MCGSTATUS 0
:Dec 22 09:42:37 fax mcelog: MCGCAP c09 APICItD 0 SOCKETID 0
:Dec 22 09:42:37 fax mcelog: CPUID Vendor Intel Family 6 Model 60
and
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCG status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCi status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Corrected error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988:: Error enabled
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]:MCA: Internal parity error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: STATUS 90000040000f0005 MCGSTATUS 0t
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCGCAP c09 APICID 6 SOCKETID 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: CPUID Vendor Intel Family 6 Model 60
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Hardware event. This is not a software error.
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCE 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: CPU 1 BANK 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: TIME 1453383350 Thu Jan 21 13:35:50 2016
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCG status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCi status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Corrected error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Error enabled
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCA: Internal parity error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: STATUS 90000040000f0005 MCGSTATUS 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCGCAP c09 APICID 2 SOCKETID 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: CPUID Vendor Intel Family 6 Model 60
Although they are reported as hardware errors, and they are not fatal, they are annoying, and can upset a backup, causing an rsync error, if the "hardware error" occurs, whilst backing up.
These errors only occur when a Virtual Machine is running under VirtuialBox. This happens irrespective of which Virtual Machine is running. These Virtual Machines were created on the previous Intel systems and imported to the new 4790KL system via an appliance.
I disabled ACPI in these Virtual Machines but the problem persists.
I cannot relate the frequency of the errors with activity on the Virtual Machines as I have on days when the machine is just in idle mode, that I can have more errors, than when the machine is in active use.
What I have noticed is that since updating Virtual Box to VirtualBox-5.0-5.0.12_104815_el7-1.x86_64. the frequency of machine checks has doubled from 12 to 30 plus a day. The machine dumps rapidly fill up /var/log, which means pruning on a regular basis
I have updated the BIOS on the motherboard to the latest stablel version for Gigabyte GA-Z97X-UD5H
I have browsed the web and I am aware of problems with the Haswell chipset and Virtual machines. I have contacted Intel, who have replaced the CPU, but the problem still exists. I have suggested a microcode (BIOS) fix, but this has met with silence.
Any ideas as to how to overcome this problem ...
Meriel
I have run Virtual Box for several years (to run windows 2000 to support a system I use), previously on Intel E6500 CPU (Wolfdale family based hardware on Gigabyte EP43-UD3L)).
I have now set up a new linux system based on Intel 4790K CPU (Haswell family based hardware on Gigabyte GA-Z97X-UD5H ).
I now get for example
The kernel log indicates that hardware errors were detected.
:System log may have more information.
:The last 20 mcelog lines of system log are:
:==========================================
:Dec 21 17:59:13 fax mcelog: MCG status:
:Dec 21 17:59:13 fax mcelog: MCi status:
:Dec 21 17:59:13 fax mcelog: Corrected error
:Dec 21 17:59:13 fax mcelog: Error enabled
:Dec 21 17:59:13 fax mcelog: MCA: Internal parity error
:Dec 21 17:59:13 fax mcelog: STATUS 90000040000f0005 MCGSTATUS 0
:Dec 21 17:59:13 fax mcelog: MCGCAP c09 APICID 6 SOCKETID 0
:Dec 21 17:59:13 fax mcelog: CPUID Vendor Intel Family 6 Model 60
:Dec 22 09:42:37 fax mcelog: Hardware event. This is not a software error.
:Dec 22 09:42:37 fax mcelog: MCE 0
:Dec 22 09:42:37 fax mcelog: CPU 0 BANK 0
:Dec 22 09:42:37 fax mcelog: TIME 1450777357 Tue Dec 22 09:42:37 2015
:Dec 22 09:42:37 fax mcelog: MCG status:
:Dec 22 09:42:37 fax mcelog: MCi status:
:Dec 22 09:42:37 fax mcelog: Corrected error
:Dec 22 09:42:37 fax mcelog: Error enabled
:Dec 22 09:42:37 fax mcelog: MCA: Internal parity error
:Dec 22 09:42:37 fax mcelog: STATUS 90000040000f0005 MCGSTATUS 0
:Dec 22 09:42:37 fax mcelog: MCGCAP c09 APICItD 0 SOCKETID 0
:Dec 22 09:42:37 fax mcelog: CPUID Vendor Intel Family 6 Model 60
and
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCG status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCi status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Corrected error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988:: Error enabled
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]:MCA: Internal parity error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: STATUS 90000040000f0005 MCGSTATUS 0t
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCGCAP c09 APICID 6 SOCKETID 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: CPUID Vendor Intel Family 6 Model 60
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Hardware event. This is not a software error.
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCE 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: CPU 1 BANK 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: TIME 1453383350 Thu Jan 21 13:35:50 2016
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCG status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCi status:
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Corrected error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: Error enabled
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCA: Internal parity error
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: STATUS 90000040000f0005 MCGSTATUS 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: MCGCAP c09 APICID 2 SOCKETID 0
Jan 21 13:35:50 fax.whealvor.co.uk mcelog[988]: CPUID Vendor Intel Family 6 Model 60
Although they are reported as hardware errors, and they are not fatal, they are annoying, and can upset a backup, causing an rsync error, if the "hardware error" occurs, whilst backing up.
These errors only occur when a Virtual Machine is running under VirtuialBox. This happens irrespective of which Virtual Machine is running. These Virtual Machines were created on the previous Intel systems and imported to the new 4790KL system via an appliance.
I disabled ACPI in these Virtual Machines but the problem persists.
I cannot relate the frequency of the errors with activity on the Virtual Machines as I have on days when the machine is just in idle mode, that I can have more errors, than when the machine is in active use.
What I have noticed is that since updating Virtual Box to VirtualBox-5.0-5.0.12_104815_el7-1.x86_64. the frequency of machine checks has doubled from 12 to 30 plus a day. The machine dumps rapidly fill up /var/log, which means pruning on a regular basis
I have updated the BIOS on the motherboard to the latest stablel version for Gigabyte GA-Z97X-UD5H
I have browsed the web and I am aware of problems with the Haswell chipset and Virtual machines. I have contacted Intel, who have replaced the CPU, but the problem still exists. I have suggested a microcode (BIOS) fix, but this has met with silence.
Any ideas as to how to overcome this problem ...
Meriel