Virtual SATA/IDE controller errors
Posted: 7. Dec 2009, 22:27
Summary: disk controller errors inside linux guests
In another thread I brought up this issue but I think it deserves it's own thread and a better explanation.
Essentially, I have experienced persistent issues with various disk controller errors being reported by my linux guests under virtualbox 3.0.12 and earlier, and also as well vmware server 1.0, across many many different physical hosts and host kernel versions and guest kernel versions and scsi, ide and sata guest disk controllers. These errors are not always the same exact message, but most of them result in aborted journals on my ext3 partitions which then cause machine malfunction since the disks all remount readonly on errors and these guests can't get much useful work accomplished when their disk is readonly.
For example, the following two dmesg outputs are the same physical hosts running vbox 3.0.8, and identical guest kernels:
Guest #1:
[1710985.943387] sd 0:0:0:0: [sda] Unhandled error code
[1710985.944568] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1710985.946582] end_request: I/O error, dev sda, sector 272831
[1710985.947830] sd 0:0:0:0: [sda] Unhandled error code
[1710985.948932] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1710985.950989] end_request: I/O error, dev sda, sector 272839
[1710985.952191] Aborting journal on device sda1.
[1710987.312987] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1710987.315182] Remounting filesystem read-only
[1710987.316344] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1710987.557810] EXT3-fs error (device sda1) in ext3_dirty_inode: Journal has aborted
Guest #2:
[1690993.020333] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[1690993.022422] ata1.00: cmd 61/02:00:53:b2:11/00:00:00:00:00/40 tag 0 ncq 1024 out
[1690993.022423] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[1690993.546246] ata1.00: status: { DRDY }
[1690993.547266] ata1.00: cmd 61/04:08:37:09:00/00:00:00:00:00/40 tag 1 ncq 2048 out
[1690993.547268] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[1690993.551319] ata1.00: status: { DRDY }
[1690993.552356] ata1: hard resetting link
[1690994.262256] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.260099] ata1.00: qc timeout (cmd 0xec)
[1690999.261215] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[1690999.262451] ata1.00: revalidation failed (errno=-5)
[1690999.263587] ata1: hard resetting link
[1690999.610271] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.612215] ata1.00: configured for UDMA/133
[1690999.613285] ata1.00: device reported invalid CHS sector 0
[1690999.614481] ata1.00: device reported invalid CHS sector 0
[1690999.615753] ata1: EH complete
[1690999.616946] ata1: illegal qc_active transition (00000001->00000002)
[1690999.618252] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1690999.620277] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1690999.620278] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1690999.624394] ata1.00: status: { DRDY }
[1690999.626730] ata1: hard resetting link
[1690999.970284] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.972034] ata1.00: configured for UDMA/133
[1690999.973121] ata1: EH complete
[1690999.974210] ata1: illegal qc_active transition (00000001->00000002)
[1690999.975617] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1690999.977603] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1690999.977605] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1690999.981698] ata1.00: status: { DRDY }
[1690999.982731] ata1: hard resetting link
[1691000.330264] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1691000.331988] ata1.00: configured for UDMA/133
[1691000.333072] ata1: EH complete
[1691000.334154] ata1: illegal qc_active transition (00000001->00000002)
[1691000.335545] ata1.00: NCQ disabled due to excessive errors
[1691000.336715] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1691000.338694] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1691000.338696] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1691000.343365] ata1.00: status: { DRDY }
[1691000.344391] ata1: hard resetting link
[1691000.691380] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1691000.693144] ata1.00: configured for UDMA/133
[1691000.694234] ata1: EH complete
The host in this case is a big box with redundant power, hardware raid5 and about 700gb online. It hasn't reported any disk i/o problems, odd kernel messages, and I have seen no evidence of file corruption on this box ever. These occur in this case at different times and don't appear related to one another. vmware server would behave similarly with ide and both flavors of scsi controllers as well but I quit using it about 1 1/2 years ago to explore other options such as qemu and now virtualbox. Qemu never had this problem as far as I am aware (although it has other issues which make it a no go for my application).
My testing and guessing leads me down the road of thinking that it's host loading which is the root cause of this problem. This box runs big backup jobs and gets quite loaded during the night when the backup is running. I don't think however, that any amount of loading should result in errors like this in guests, and I am really interested in a solution here if one is available.
thanks.
In another thread I brought up this issue but I think it deserves it's own thread and a better explanation.
Essentially, I have experienced persistent issues with various disk controller errors being reported by my linux guests under virtualbox 3.0.12 and earlier, and also as well vmware server 1.0, across many many different physical hosts and host kernel versions and guest kernel versions and scsi, ide and sata guest disk controllers. These errors are not always the same exact message, but most of them result in aborted journals on my ext3 partitions which then cause machine malfunction since the disks all remount readonly on errors and these guests can't get much useful work accomplished when their disk is readonly.
For example, the following two dmesg outputs are the same physical hosts running vbox 3.0.8, and identical guest kernels:
Guest #1:
[1710985.943387] sd 0:0:0:0: [sda] Unhandled error code
[1710985.944568] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1710985.946582] end_request: I/O error, dev sda, sector 272831
[1710985.947830] sd 0:0:0:0: [sda] Unhandled error code
[1710985.948932] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1710985.950989] end_request: I/O error, dev sda, sector 272839
[1710985.952191] Aborting journal on device sda1.
[1710987.312987] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1710987.315182] Remounting filesystem read-only
[1710987.316344] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1710987.557810] EXT3-fs error (device sda1) in ext3_dirty_inode: Journal has aborted
Guest #2:
[1690993.020333] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[1690993.022422] ata1.00: cmd 61/02:00:53:b2:11/00:00:00:00:00/40 tag 0 ncq 1024 out
[1690993.022423] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[1690993.546246] ata1.00: status: { DRDY }
[1690993.547266] ata1.00: cmd 61/04:08:37:09:00/00:00:00:00:00/40 tag 1 ncq 2048 out
[1690993.547268] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[1690993.551319] ata1.00: status: { DRDY }
[1690993.552356] ata1: hard resetting link
[1690994.262256] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.260099] ata1.00: qc timeout (cmd 0xec)
[1690999.261215] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[1690999.262451] ata1.00: revalidation failed (errno=-5)
[1690999.263587] ata1: hard resetting link
[1690999.610271] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.612215] ata1.00: configured for UDMA/133
[1690999.613285] ata1.00: device reported invalid CHS sector 0
[1690999.614481] ata1.00: device reported invalid CHS sector 0
[1690999.615753] ata1: EH complete
[1690999.616946] ata1: illegal qc_active transition (00000001->00000002)
[1690999.618252] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1690999.620277] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1690999.620278] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1690999.624394] ata1.00: status: { DRDY }
[1690999.626730] ata1: hard resetting link
[1690999.970284] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.972034] ata1.00: configured for UDMA/133
[1690999.973121] ata1: EH complete
[1690999.974210] ata1: illegal qc_active transition (00000001->00000002)
[1690999.975617] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1690999.977603] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1690999.977605] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1690999.981698] ata1.00: status: { DRDY }
[1690999.982731] ata1: hard resetting link
[1691000.330264] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1691000.331988] ata1.00: configured for UDMA/133
[1691000.333072] ata1: EH complete
[1691000.334154] ata1: illegal qc_active transition (00000001->00000002)
[1691000.335545] ata1.00: NCQ disabled due to excessive errors
[1691000.336715] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1691000.338694] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1691000.338696] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1691000.343365] ata1.00: status: { DRDY }
[1691000.344391] ata1: hard resetting link
[1691000.691380] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1691000.693144] ata1.00: configured for UDMA/133
[1691000.694234] ata1: EH complete
The host in this case is a big box with redundant power, hardware raid5 and about 700gb online. It hasn't reported any disk i/o problems, odd kernel messages, and I have seen no evidence of file corruption on this box ever. These occur in this case at different times and don't appear related to one another. vmware server would behave similarly with ide and both flavors of scsi controllers as well but I quit using it about 1 1/2 years ago to explore other options such as qemu and now virtualbox. Qemu never had this problem as far as I am aware (although it has other issues which make it a no go for my application).
My testing and guessing leads me down the road of thinking that it's host loading which is the root cause of this problem. This box runs big backup jobs and gets quite loaded during the night when the backup is running. I don't think however, that any amount of loading should result in errors like this in guests, and I am really interested in a solution here if one is available.
thanks.