virtualbox.org

Summary: disk controller errors inside linux guests

In another thread I brought up this issue but I think it deserves it's own thread and a better explanation.

Essentially, I have experienced persistent issues with various disk controller errors being reported by my linux guests under virtualbox 3.0.12 and earlier, and also as well vmware server 1.0, across many many different physical hosts and host kernel versions and guest kernel versions and scsi, ide and sata guest disk controllers. These errors are not always the same exact message, but most of them result in aborted journals on my ext3 partitions which then cause machine malfunction since the disks all remount readonly on errors and these guests can't get much useful work accomplished when their disk is readonly.

For example, the following two dmesg outputs are the same physical hosts running vbox 3.0.8, and identical guest kernels:

Guest #1:
[1710985.943387] sd 0:0:0:0: [sda] Unhandled error code
[1710985.944568] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1710985.946582] end_request: I/O error, dev sda, sector 272831
[1710985.947830] sd 0:0:0:0: [sda] Unhandled error code
[1710985.948932] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1710985.950989] end_request: I/O error, dev sda, sector 272839
[1710985.952191] Aborting journal on device sda1.
[1710987.312987] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1710987.315182] Remounting filesystem read-only
[1710987.316344] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1710987.557810] EXT3-fs error (device sda1) in ext3_dirty_inode: Journal has aborted

Guest #2:
[1690993.020333] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[1690993.022422] ata1.00: cmd 61/02:00:53:b2:11/00:00:00:00:00/40 tag 0 ncq 1024 out
[1690993.022423] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[1690993.546246] ata1.00: status: { DRDY }
[1690993.547266] ata1.00: cmd 61/04:08:37:09:00/00:00:00:00:00/40 tag 1 ncq 2048 out
[1690993.547268] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[1690993.551319] ata1.00: status: { DRDY }
[1690993.552356] ata1: hard resetting link
[1690994.262256] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.260099] ata1.00: qc timeout (cmd 0xec)
[1690999.261215] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[1690999.262451] ata1.00: revalidation failed (errno=-5)
[1690999.263587] ata1: hard resetting link
[1690999.610271] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.612215] ata1.00: configured for UDMA/133
[1690999.613285] ata1.00: device reported invalid CHS sector 0
[1690999.614481] ata1.00: device reported invalid CHS sector 0
[1690999.615753] ata1: EH complete
[1690999.616946] ata1: illegal qc_active transition (00000001->00000002)
[1690999.618252] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1690999.620277] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1690999.620278] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1690999.624394] ata1.00: status: { DRDY }
[1690999.626730] ata1: hard resetting link
[1690999.970284] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1690999.972034] ata1.00: configured for UDMA/133
[1690999.973121] ata1: EH complete
[1690999.974210] ata1: illegal qc_active transition (00000001->00000002)
[1690999.975617] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1690999.977603] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1690999.977605] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1690999.981698] ata1.00: status: { DRDY }
[1690999.982731] ata1: hard resetting link
[1691000.330264] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1691000.331988] ata1.00: configured for UDMA/133
[1691000.333072] ata1: EH complete
[1691000.334154] ata1: illegal qc_active transition (00000001->00000002)
[1691000.335545] ata1.00: NCQ disabled due to excessive errors
[1691000.336715] ata1.00: exception Emask 0x2 SAct 0x1 SErr 0x0 action 0x6 frozen
[1691000.338694] ata1.00: cmd 61/04:00:37:09:00/00:00:00:00:00/40 tag 0 ncq 2048 out
[1691000.338696] res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
[1691000.343365] ata1.00: status: { DRDY }
[1691000.344391] ata1: hard resetting link
[1691000.691380] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1691000.693144] ata1.00: configured for UDMA/133
[1691000.694234] ata1: EH complete

The host in this case is a big box with redundant power, hardware raid5 and about 700gb online. It hasn't reported any disk i/o problems, odd kernel messages, and I have seen no evidence of file corruption on this box ever. These occur in this case at different times and don't appear related to one another. vmware server would behave similarly with ide and both flavors of scsi controllers as well but I quit using it about 1 1/2 years ago to explore other options such as qemu and now virtualbox. Qemu never had this problem as far as I am aware (although it has other issues which make it a no go for my application).

My testing and guessing leads me down the road of thinking that it's host loading which is the root cause of this problem. This box runs big backup jobs and gets quite loaded during the night when the backup is running. I don't think however, that any amount of loading should result in errors like this in guests, and I am really interested in a solution here if one is available.

thanks.

Guest: openSuse 11.2

I have a bone to pick also! Never had this problem until I download the lastest image! After install I clicked on YaST and I got these error. Tried to update my installation, but unable to, keep failing because of these errors!

Dec 7 10:02:27 linux smartd[4397]: smartd 5.39 2009-08-08 r2872~ [x86_64-unknown-linux-gnu] (openSUSE RPM)#012Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net#012
Dec 7 10:02:27 linux smartd[4397]: Opened configuration file /etc/smartd.conf
Dec 7 10:02:27 linux smartd[4397]: Drive: DEVICESCAN, implied '-a' Directive on line 26 of file /etc/smartd.conf
Dec 7 10:02:27 linux smartd[4397]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Dec 7 10:02:27 linux smartd[4397]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Dec 7 10:02:27 linux smartd[4397]: Device: /dev/sda [SAT], opened
Dec 7 10:02:27 linux smartd[4397]: Device: /dev/sda [SAT], not found in smartd database.
Dec 7 10:02:27 linux smartd[4397]: Device: /dev/sda [SAT], lacks SMART capability
Dec 7 10:02:27 linux smartd[4397]: Device: /dev/sda [SAT], to proceed anyway, use '-T permissive' Directive.
Dec 7 10:02:27 linux smartd[4397]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Dec 7 10:02:29 linux /usr/sbin/cron[4611]: (CRON) STARTUP (V5.0)
Dec 7 10:02:55 linux python: hp-systray[5013]: warning: hp-systray should not be run as root/superuser.
Dec 7 10:02:55 linux python: hp-systray[5013]: error: hp-systray cannot be run as root. Exiting.
Dec 7 10:05:02 linux kernel: [ 222.716054] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
Dec 7 10:05:02 linux kernel: [ 222.716142] IP: [<ffffffff81159834>] __link_path_walk+0x194/0x870
Dec 7 10:05:02 linux kernel: [ 222.716226] PGD 116be067 PUD 116bf067 PMD 0
Dec 7 10:05:02 linux kernel: [ 222.716240] Oops: 0000 [#1] PREEMPT SMP
Dec 7 10:05:02 linux kernel: [ 222.716257] last sysfs file: /sys/devices/system/cpu/cpu1/topology/core_siblings
Dec 7 10:05:02 linux kernel: [ 222.716309] CPU 1
Dec 7 10:05:02 linux kernel: [ 222.716309] Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd vboxvideo drm fuse loop dm_mod snd_intel8x0 snd_ac97_codec i2c_piix4 ac97_bus snd_pcm snd_timer snd snd_page_alloc ppdev sg sr_mod pcspkr floppy parport_pc parport cdrom ac e1000 button vboxadd ext4 jbd2 crc16 fan processor piix ide_core thermal thermal_sys [last unloaded: preloadtrace]
Dec 7 10:05:02 linux kernel: [ 222.716309] Pid: 5014, comm: krunner Not tainted 2.6.31.5-0.1-desktop #1 VirtualBox
Dec 7 10:05:02 linux kernel: [ 222.716309] RIP: 0010:[<ffffffff81159834>] [<ffffffff81159834>] __link_path_walk+0x194/0x870
Dec 7 10:05:02 linux kernel: [ 222.716309] RSP: 0018:ffff88001166dd08 EFLAGS: 00010286
Dec 7 10:05:02 linux kernel: [ 222.716309] RAX: 0000000000000000 RBX: ffff88001fb8900b RCX: 0000000000000000
Dec 7 10:05:02 linux kernel: [ 222.716309] RDX: ffff88001d4fac30 RSI: ffff88001d4f7965 RDI: 0000000000000000
Dec 7 10:05:02 linux kernel: [ 222.716309] RBP: ffff88001166dd88 R08: ffff88001fb89005 R09: 0000000000000005
Dec 7 10:05:02 linux kernel: [ 222.716309] R10: 0000000000000004 R11: 0000000000000202 R12: ffff88001166de28
Dec 7 10:05:02 linux kernel: [ 222.716309] R13: ffff88001166dd38 R14: ffff88001166dd28 R15: 0000000000000001
Dec 7 10:05:02 linux kernel: [ 222.716309] FS: 00007f160f24d750(0000) GS:ffff880001c9c000(0000) knlGS:0000000000000000
Dec 7 10:05:02 linux kernel: [ 222.716309] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 7 10:05:02 linux kernel: [ 222.716309] CR2: 0000000000000050 CR3: 00000000116bd000 CR4: 00000000000006e0
Dec 7 10:05:02 linux kernel: [ 222.716309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 7 10:05:02 linux kernel: [ 222.716309] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
Dec 7 10:05:02 linux kernel: [ 222.716309] Process krunner (pid: 5014, threadinfo ffff88001166c000, task ffff88001166a700)
Dec 7 10:05:02 linux kernel: [ 222.716309] Stack:
Dec 7 10:05:02 linux kernel: [ 222.716309] ffff88001d401220 ffff88001f618188 ffff88001166de48 ffffffff8116a9a7
Dec 7 10:05:02 linux kernel: [ 222.716309] <0> 00000005135346f6 ffff88001fb89005 ffff88001f6180c0 ffff88001d4f78c0
Dec 7 10:05:02 linux kernel: [ 222.716309] <0> ffff88001fb89000 000000003b79228d ffff88001166dd88 ffff88001166de28
Dec 7 10:05:02 linux kernel: [ 222.716309] Call Trace:
Dec 7 10:05:02 linux kernel: [ 222.716309] [<ffffffff8115a496>] path_walk+0x66/0xd0
Dec 7 10:05:02 linux kernel: [ 222.716309] [<ffffffff8115a56b>] do_path_lookup+0x6b/0xb0
Dec 7 10:05:02 linux kernel: [ 222.716309] [<ffffffff8115a771>] user_path_at+0x61/0xc0
Dec 7 10:05:02 linux kernel: [ 222.716309] [<ffffffff8114a5c8>] sys_faccessat+0xe8/0x1f0
Dec 7 10:05:02 linux kernel: [ 222.716309] [<ffffffff8114a6f6>] sys_access+0x26/0x50
Dec 7 10:05:02 linux kernel: [ 222.716309] [<ffffffff8100c682>] system_call_fastpath+0x16/0x1b
Dec 7 10:05:02 linux kernel: [ 222.716309] [<00007f160c417eb7>] 0x7f160c417eb7
Dec 7 10:05:02 linux kernel: [ 222.716309] Code: 4c 89 f6 4c 89 e7 e8 8c fd ff ff 85 c0 0f 85 16 ff ff ff 48 8b 45 b8 48 8b 50 10 48 85 d2 0f 84 7e 04 00 00 48 8b 82 f0 00 00 00 <48> 83 78 50 00 0f 84 89 01 00 00 4c 89 e6 4c 89 ef e8 c6 06 00
Dec 7 10:05:02 linux kernel: [ 222.716309] RIP [<ffffffff81159834>] __link_path_walk+0x194/0x870
Dec 7 10:05:02 linux kernel: [ 222.716309] RSP <ffff88001166dd08>
Dec 7 10:05:02 linux kernel: [ 222.716309] CR2: 0000000000000050
Dec 7 10:05:02 linux kernel: [ 222.717026] ---[ end trace 3eb6c2ca19120dd7 ]---

going back to 3.0.10 or my OSE version!

As a followup I have had another vm die with yet more (and different) disk errors. This vm is configured with an IDE and not sata interface and has traditionally be extrememly stable, but not tonight. I dug around and looked finially at the Vbox.log file for the machine and found many errors of the form:

PIIX3 ATA: execution time for ATA command 0xca was 8 seconds
TM: Giving up catch-up attempt at a 60 000 573 714 ns lag; new total: 3 420 065 211 825 ns

I am attaching the full output. There's something screwy in saint louie (as my favorite rabbit used to say).

In looking at another vm, disk errors caused the aborted journal/remounted disk read only problem.

[1204764.596437] sd 0:0:0:0: [sda] Unhandled error code
[1204764.600906] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[1204764.604567] end_request: I/O error, dev sda, sector 215215
[1204764.605780] Aborting journal on device sda1.
[1204764.606934] EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
[1204764.609223] Remounting filesystem read-only

The VBox.log file for this vm is clean, containing only startup messages.

I think Im going to just file a bug report at this point.

I use vbox as a "sandbox" for testing various things at work, and all our systems are Centos 4 or 5, updated off the base repos to full current ... which is 4.8 and 5.4, currently.

I cannot get SATA to work with either of these distros at all; they crash at boot.

Both distros see and load the AHCI driver, then fail attempting to mount the attached drive.

The same exact image, if put on a hdd, will boot an Intel motherboard w/the same chipset SATA, so it's not the distro, it's definitely something buggy in Vbox.

So .... Me Too on the 'something screwy in saint louie' ...

- _litz

I would consider filing a ticket with bugtracker about these. version 3.1.0 had some pretty heavy changes in it and if you did not see these problems previously it might be that the new changes are aiding in the problem. Link is below my post. Please provide them with as much information as you can to assist in finding the problem.

I've filed multiple bug reports of several various issues (included log files as requested) ....

Several have now been sitting untouched for weeks.

So .... yeah, done that. Dunno what good it will do ...

- _litz

I'm still seeing this with 3.1.4. I searched the bug tracker, but couldn't find anything. Have there been bugs for this filed? If so, could you let me know the bug numbers?

OS: OpenSuse 11.2, VBox 3.1.4

Everything here is working perfectly so far in the new version. Keep up the good work and leave it alone!

Hello all:

Still a bit new to VirtualBox, I'm running several VM's(3 Ubuntu,1 Cent,1 Arch) with no issues.

I also have 2 Ubuntu server installs that are giving me disk errors like these inside the guest machines:

ata1.00: exception Emask 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ca/00:10:37:52:63/00:00:00:00:00/e0 tag 0 dma 8192 out

well as errors like these in the messages log of the guest vm's:

Apr 29 20:31:37 eng-cacti-2 kernel: [806854.462234] ata1: EH complete
Apr 29 21:01:40 eng-cacti-2 kernel: [808657.040584] ata1: lost interrupt (Status 0x48)
Apr 29 21:01:40 eng-cacti-2 kernel: [808657.047159] ata1: soft resetting link
Apr 29 21:01:45 eng-cacti-2 kernel: [808662.240220] ata1: link is slow to respond, please be patient (ready=0)
Apr 29 21:01:50 eng-cacti-2 kernel: [808667.101989] ata1: soft resetting link
Apr 29 21:01:53 eng-cacti-2 kernel: [808670.751758] ata1.00: configured for UDMA/33
Apr 29 21:01:53 eng-cacti-2 kernel: [808670.751764] ata1.00: device reported invalid CHS sector 0
Apr 29 21:01:53 eng-cacti-2 kernel: [808670.751777] ata1: EH complete
...

These machines are running Cacti and much more disk intensive operations than the VM's that are working well.
The VM's eventually freeze, on a couple of occasions I've seen errors (in the vm) about dropping the Journal and mounting read-only

Host OS:OSX Leopard Server on a Mac Pro dual Quad Core Xeon, 32 gb ram, Apple raid card (Raid 5)
Guest OS's: Ubuntu Server 9.10 64 bit, IDE Controller: PIIX4
VirtualBox Version 3.1.6 r59351

Not quite sure what other logs info I should include, any help is greatly appreciated.

Log file - Cropped a bit in the middle to get the allowed file size

I'm on 3.2.2 - Centos 5.4 X86_64 and I have two virtual hosts on my Dual Quad Core Xeon Server.

One virtual host is 1Tb in size and the other is 500Gb in size. I'm just in the testing phase and I keep getting a lot of ATA i/o exceptions with the 1Tb virtual host but not the 500Gb one....

If this is happening during testing - I don't think I can use in any mission critical capacity.....

Any ideas how to resolve?

mutface wrote:keep getting a lot of ATA i/o exceptions

maybe this one applies?
http://www.virtualbox.org/ticket/2524
http://www.virtualbox.org/manual/ch12.h ... iodicFlush

I'm getting the problems after I leave the virtual host running for 24 hours..... Loading is 0 - I don't have anything installed on it.

Host Server Spec
=============
Centos 5.4 - X86_64 (kernel - comes with 5.4)
Dual Quad Core - Xeon
Intel S5000VSA Server Board
6 Gb RAM
2 x 500Gb (Raid 1)
1 x 1.5Tb

Virtual Host 1
==========
420 Gb - 7 cores - Disk image on the Raid 1 partition
AHCI + SATA Port 0
2 Gb RAM
Centos 5.4 - X86_64

Virtual Host 2
==========
1 Tb - 7 cores - Disk image on 1.5 Tb HDD
AHCI + SATA Port 0
2 Gb RAM
Centos 5.4 - X86_64

/var/log/messages on Virtual Host 1 - seems normal but loads of error messages appear on console about ATA... and cannot even do basic read of /var/log/messages.

mutface wrote: Virtual Host 1
==========
420 Gb - 7 cores - Disk image on the Raid 1 partition
...
Centos 5.4 - X86_64

do you use divider=10 kernel parm with your CentOS guests?

and, RAID 1 is software RAID?

virtualbox.org

Virtual SATA/IDE controller errors

Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors

Re: Virtual SATA/IDE controller errors