Page 1 of 1

Regular ext4 corruption

Posted: 26. Oct 2010, 02:12
by paul.dorman
Hi all,

experiencing regular ext4 file system corruption in my Ubuntu Maverick 32-bit VMs. The host system is my i7 laptop (Ubuntu Maverick 64-bit, i7, 8GiB, 500GB BTRFS), and the VMs are Ubuntu Maverick 32-bit, 2VCPU, 2GiB, 8GB ext4. All machines are running 2.6.36-020636-generic kernels, running Virtual Box 3.2.10 r66523. I'm using the AHCI host adapter with Host I/O caching off.

Switching the VMs to a single CPU and no I/O APIC seems to improve things, but I haven't been able to definitely test. In addition to file system corruption, I get extremely high I/O and system loads, primarily from the ext4 jbd2 process.

If you are running similar systems, and have managed to avoid file system corruption and sluggish performance, what are your recommended settings? Are there bug reports, either with VirtualBox or with the Linux kernel which are tracking the underlying fault(s)?

I'm happy to provide a proper technical report if someone could tell me how to collect the right data for diagnosis.

- Paul

Re: Regular ext4 corruption

Posted: 26. Oct 2010, 03:36
by Perryg
Known problem with certain kernels and Ext4. Make sure that the host IO cache is enabled.

Re: Regular ext4 corruption

Posted: 26. Oct 2010, 04:22
by paul.dorman
It would be great to know what the exact problem is. Enabling host IO cache does not fix it. For instance, I just collected this from a system that's just faulted with host IO cache enabled:

Code: Select all

[ 2209.993822] ata3.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[ 2209.993856] ata3.00: failed command: WRITE FPDMA QUEUED
[ 2209.993888] ata3.00: cmd 61/18:00:40:95:2a/00:00:00:00:00/40 tag 0 ncq 12288 out 
[ 2209.993910]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[ 2209.993933] ata3.00: status: { DRDY }
[ 2209.993954] ata3.00: failed command: WRITE FPDMA QUEUED
[ 2209.993978] ata3.00: cmd 61/08:08:a0:4d:5c/00:00:00:00:00/40 tag 1 ncq 4096 out 
[ 2209.993994]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 2209.994012] ata3.00: status: { DRDY }
[ 2209.994043] ata3: hard resetting link
[ 2210.323630] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2215.321060] ata3.00: qc timeout (cmd 0xec)
[ 2215.321134] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 2215.321158] ata3.00: revalidation failed (errno=-5)
[ 2215.321193] ata3: hard resetting link
[ 2215.656577] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2225.656640] ata3.00: qc timeout (cmd 0xec)
[ 2225.656670] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 2225.656673] ata3.00: revalidation failed (errno=-5)
[ 2225.656681] ata3: limiting SATA link speed to 1.5 Gbps
[ 2225.656689] ata3: hard resetting link
[ 2225.988292] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 2255.991484] ata3.00: qc timeout (cmd 0xec)
[ 2255.991541] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 2255.991548] ata3.00: revalidation failed (errno=-5)
[ 2255.991557] ata3.00: disabled
[ 2255.991574] ata3.00: device reported invalid CHS sector 0
[ 2255.991580] ata3.00: device reported invalid CHS sector 0
[ 2255.991609] ata3: hard resetting link
[ 2256.321651] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 2256.321695] ata3: EH complete
[ 2256.321718] sd 2:0:0:0: [sda] Unhandled error code
[ 2256.321721] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 2256.321725] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 5c 4d a0 00 00 08 00
[ 2256.321733] end_request: I/O error, dev sda, sector 6049184
[ 2256.321739] Buffer I/O error on device sda1, logical block 755892
[ 2256.321741] lost page write due to I/O error on sda1
[ 2256.321782] sd 2:0:0:0: [sda] Unhandled error code
[ 2256.321785] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 2256.321788] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2a 95 40 00 00 18 00
[ 2256.321796] end_request: I/O error, dev sda, sector 2790720
[ 2256.321799] Buffer I/O error on device sda1, logical block 348584
[ 2256.321801] lost page write due to I/O error on sda1
[ 2256.321808] Buffer I/O error on device sda1, logical block 348585
[ 2256.321810] lost page write due to I/O error on sda1
[ 2256.321813] Buffer I/O error on device sda1, logical block 348586
[ 2256.321815] lost page write due to I/O error on sda1
[ 2256.321837] sd 2:0:0:0: [sda] Unhandled error code
[ 2256.321839] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 2256.321842] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 45 67 88 00 00 28 00
[ 2256.321872] JBD2: Detected IO errors while flushing file data on sda1-8
[ 2256.321872] 
[ 2256.321874] end_request: I/O error, dev sda, sector 4548488
[ 2256.321881] sd 2:0:0:0: [sda] Unhandled error code
[ 2256.321883] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 2256.321886] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00
[ 2256.321894] Aborting journal on device sda1-8.
[ 2256.321892]  84 0a 70 00 00 08 00
[ 2256.321897] end_request: I/O error, dev sda, sector 8653424
[ 2256.321900] Buffer I/O error on device sda1, logical block 1081422
[ 2256.321902] lost page write due to I/O error on sda1
[ 2256.321914] sd 2:0:0:0: [sda] Unhandled error code
[ 2256.321916] sd 2:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 2256.321919] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 84 3d 78 00 00 08 00
[ 2256.321927] end_request: I/O error, dev sda, sector 8666488
...
...

Re: Regular ext4 corruption

Posted: 26. Oct 2010, 04:48
by Perryg
I don't know the exact problem with your install but http://www.google.com/search?q=etx4+file+corruption shows some things that are interesting. I know that turning on the host IO cache solved my problem. From what I hear the fix is in one of the newer kernels, but I don't remember which version it is off hand.

Re: Regular ext4 corruption

Posted: 26. Oct 2010, 21:30
by Sasquatch
So where are you seeing this corruption? On the VM side, or on the Host side? If it's the Host, then it's no wonder. BTRFS isn't final, it's beta. There isn't an fsck utility for it either. Get a file system that is thoroughly tested, out of development and with proper file system integrity checks.

Re: Regular ext4 corruption

Posted: 26. Oct 2010, 23:39
by paul.dorman
I've had no issues with my btrfs partition on the host. The issue is with the VM ext4 file systems. Documented evidence or a diagnostic procedure would me more useful than conjecture for resolving the issue. Thanks for your input though. Very helpful.

Re: Regular ext4 corruption

Posted: 26. Oct 2010, 23:49
by Sasquatch
Having file system corruption inside the VDI (i.e. guest) is very rare and AFAIK didn't happen yet, unless the VDI itself got corrupted due to Host FS corruption. Can you please test this corruption case with a different Host FS, like ext3?

Re: Regular ext4 corruption

Posted: 16. Feb 2011, 12:08
by skestle
I had these problem(s) ('failed command: WRITE FPDMA QUEUED' boot failures), facing corruption for any change that I made to the hard drive, and tried all above solutions and nothing really worked (different client FS, Host Cache IO) until I moved all VBox files (the .VirtualBox directory, for hard drives, and the VirtualBox VMs folder for the snapshots, which also has the config) to an old ext2 drive.

I then linked the folders back, and everything worked seamlessly since. (Well, after I'd found out that 'sudo fsck -n' will always return errors since your drive is changing while you're running fsck).

BTW, my host VM drive is ext2, and I didn't dare install ext4 on the VM, opting for ext3 instead.

Re: Regular ext4 corruption

Posted: 18. Mar 2011, 22:15
by jigglywiggly
I am on 4.0.4 also experiencing this, have host i/o.

Having huge troubles installing Debian in a VM. (Ubuntu 10.04 host, 2.6.32.30 kernel or something, I even tried the natty 2.6.38 kernel, to no avail.
Can't install Windows 7 x64, as it just BSOD's in the middle of the install files, or complains about corrupt files.

It is quite a stress thing the VMs are on, 9x500gb hds in RAID 5, mdadm. But I didn't use to have this problem with the older virtualboxes. (HDs are fine, smart reports fine, and I ran benchies on all of them, nothing unusual, the system is perfectly stable)

Putting more cpus like 4, drastically makes it less stable.
Server specs: q6600 @ 3.5 ghz (stable OC, I ran intel burn test for 5 hours straight)
8 gigs of ram 1066
6200 LE

Re: Regular ext4 corruption

Posted: 8. Apr 2011, 12:14
by java_artisan
Did you guys succeed in the eliminating the corruption problems ? I'm asking because I'm having it too for the MySQL data files. I'm using ext4 for both the host and the guest. VB 4.0.4, 64bits, ubuntu 10.10.

And can any tell whether they're experiencing something comparable to this ticket ? http://www.virtualbox.org/ticket/8511 My VM's are locking up for a yet unknown reason. But I'm suspecting it's about file system corruptions.

Thanks !

Jan

Re: Regular ext4 corruption

Posted: 9. Apr 2011, 13:19
by Sasquatch
If you want to avoid this while still using EXT4, then don't use SATA for the hard drive controller in the VM settings. Other option is to use a different file system for the VMs storage (host side, of course). This corruption only occurs on the Host side in special cases where a lot of I/O is involved. A database could give that, but it should also cache a lot in memory to minimise the I/O.

Re: Regular ext4 corruption

Posted: 5. May 2011, 18:36
by frank
Note that we fixed this problem, see the public ticket 8773 and others. Affected are guest with SATA with guest RAM >= 2GB.

Re: Regular ext4 corruption

Posted: 7. May 2011, 13:16
by Sasquatch
Thanks for the fix Frank. That explains why I never got this issue, because none of my VMs have 2 GB of RAM. Lucky me I guess.