High I/O causing filesystem corruption

Discussions about using Linux guests in VirtualBox.
indigo42
Posts: 16
Joined: 18. Jan 2010, 09:41
Primary OS: Ubuntu other
VBox Version: OSE Debian
Guest OSses: kubuntu, windows, gOS

High I/O causing filesystem corruption

Post by indigo42 »

Hello,
We are running an Ubuntu 10.04 32bit guest on a RHEL 64 bit host. The guest storage control is SATA AHCI with host I/O caching enabled.
Host info:
OS Type: Linux (2.6.18-194.el5)
VirtualBox: 4.1.4 (74291)
Processors: Intel(R) Xeon(R) CPU 3060 @ 2.40GHz (2)
HWVirtEx, PAE, Long Mode (64-bit)

It seems during times of high I/O, like backups and etc, the guest filesystem resets to read-only. We have this issue with both ext3 and ext4 filesystems. Switching to ext3 seemed to lessen the occurrences though.

Here is a snip of syslog around the time of the corruption..

Code: Select all

Dec 15 08:01:48 vault sm-mta[1437]: rejecting connections on daemon MTA-v4: load average: 12
Dec 15 08:01:48 vault sm-mta[1437]: rejecting connections on daemon MSP-v4: load average: 12
Dec 15 08:02:03 vault sm-mta[1437]: rejecting connections on daemon MTA-v4: load average: 12
Dec 15 08:02:03 vault sm-mta[1437]: rejecting connections on daemon MSP-v4: load average: 12
Dec 15 08:02:18 vault sm-mta[1437]: rejecting connections on daemon MTA-v4: load average: 12
Dec 15 08:02:18 vault sm-mta[1437]: rejecting connections on daemon MSP-v4: load average: 12
Dec 15 08:02:29 vault kernel: [ 2338.011698] ata4.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
Dec 15 08:02:29 vault kernel: [ 2338.011714] ata4.00: failed command: WRITE FPDMA QUEUED
Dec 15 08:02:29 vault kernel: [ 2338.011721] ata4.00: cmd 61/00:00:b7:5c:9d/04:00:09:00:00/40 tag 0 ncq 524288 out
Dec 15 08:02:29 vault kernel: [ 2338.011723]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Dec 15 08:02:29 vault kernel: [ 2338.011726] ata4.00: status: { DRDY }
Dec 15 08:02:29 vault kernel: [ 2338.011729] ata4.00: failed command: WRITE FPDMA QUEUED
Dec 15 08:02:29 vault kernel: [ 2338.011736] ata4.00: cmd 61/00:08:77:31:9d/04:00:09:00:00/40 tag 1 ncq 524288 out
Dec 15 08:02:29 vault kernel: [ 2338.011737]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 15 08:02:29 vault kernel: [ 2338.011740] ata4.00: status: { DRDY }
Dec 15 08:02:29 vault kernel: [ 2338.011743] ata4.00: failed command: WRITE FPDMA QUEUED
Dec 15 08:02:29 vault kernel: [ 2338.011749] ata4.00: cmd 61/00:10:77:35:9d/04:00:09:00:00/40 tag 2 ncq 524288 out
Dec 15 08:02:29 vault kernel: [ 2338.011751]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Then later on in the file we this..

Code: Select all

Dec 15 08:02:29 vault kernel: [ 2338.011969] ata4: hard resetting link
Dec 15 08:02:29 vault kernel: [ 2338.332268] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 15 08:02:33 vault kernel: [ 2342.108804] ata4.00: configured for UDMA/133
Dec 15 08:02:33 vault kernel: [ 2342.108811] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108815] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108818] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108820] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108823] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108826] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108829] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108831] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108834] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108837] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108840] ata4.00: device reported invalid CHS sector 0
To fix the issue we have to shut down the guest and mount the drive and run fsck to repair it.
This issue seems similar, but it did not get resolved. viewtopic.php?f=3&t=25568&p=145197&hili ... ly#p145197
Perryg
Site Moderator
Posts: 34369
Joined: 6. Sep 2008, 22:55
Primary OS: Linux other
VBox Version: OSE self-compiled
Guest OSses: *NIX

Re: High I/O causing filesystem corruption

Post by Perryg »

This should be reported to bugtracker. You will need to setup an account there as it is on a different system.
It would also help if you can post the ticket number here so other can see the progress, add information, or see the final results.

Don't forget to attach the guests log file and the information you placed here.
indigo42
Posts: 16
Joined: 18. Jan 2010, 09:41
Primary OS: Ubuntu other
VBox Version: OSE Debian
Guest OSses: kubuntu, windows, gOS

Re: High I/O causing filesystem corruption

Post by indigo42 »

@Perryg
Thanks. I have created Ticket #10031
Perryg
Site Moderator
Posts: 34369
Joined: 6. Sep 2008, 22:55
Primary OS: Linux other
VBox Version: OSE self-compiled
Guest OSses: *NIX

Re: High I/O causing filesystem corruption

Post by Perryg »

You forgot to attach the guests log file. They are really strict about that.
indigo42
Posts: 16
Joined: 18. Jan 2010, 09:41
Primary OS: Ubuntu other
VBox Version: OSE Debian
Guest OSses: kubuntu, windows, gOS

Re: High I/O causing filesystem corruption

Post by indigo42 »

Sorry.
I attached 2 log files.
LocalHostArgentina
Posts: 2
Joined: 26. Dec 2011, 07:10
Primary OS: Other
VBox Version: OSE self-compiled
Guest OSses: Gentoo

Re: High I/O causing filesystem corruption

Post by LocalHostArgentina »

We have the exactly same problem above one year. We always update the VirtualBox and Kernel and have always the same problem, the vms with linux always go to "read only" mode and have to restart and repair the filesystem.
We also tried ext3, ext4 and reiserfs, and all the flags on/off and have the exactly same problem.

Please, it's time to fix this bug..
indigo42
Posts: 16
Joined: 18. Jan 2010, 09:41
Primary OS: Ubuntu other
VBox Version: OSE Debian
Guest OSses: kubuntu, windows, gOS

Re: High I/O causing filesystem corruption

Post by indigo42 »

@ LocalHostArgentina
Thanks for the reply. I'd like to ask you something..

What is the hardware and OS are you using for the host? I had stumbled across a similar issue posted somewhere and I thought they were using Intel Xeons and Red Hat. Of course I can't find the post now. I have been running Virtual Box on AMD and Ubuntu for several years and have never had this issue there.

Thanks! Let's help get this thing fixed!

J
LocalHostArgentina
Posts: 2
Joined: 26. Dec 2011, 07:10
Primary OS: Other
VBox Version: OSE self-compiled
Guest OSses: Gentoo

Re: High I/O causing filesystem corruption

Post by LocalHostArgentina »

We use Intel and AMD hardware, such like:

Mother Intel DQ67SW with all VT extensions turned on
Micro Intel Core i7
Memory Kingston DDR3
Disk WD
Power Supply CoolerMaster

and in AMD:

Mother MSI
Micro AMD Phenom X2
Memory Kingston DDR3
Disk WD
Power Supply Thermaltake

In all hardware we test, have the same problem.

We use Gentoo Linux for the host and multiple linux in guest and all have this exactly problem (read only and filesystem crash)

Maybe is some option in the kernel config of the host that's are making work bad? Such as CFQ or Deadline I/O Scheduler?
Please, someone in VirtualBox team is an expert in this to help us?

Regards
indigo42 wrote:@ LocalHostArgentina
Thanks for the reply. I'd like to ask you something..

What is the hardware and OS are you using for the host? I had stumbled across a similar issue posted somewhere and I thought they were using Intel Xeons and Red Hat. Of course I can't find the post now. I have been running Virtual Box on AMD and Ubuntu for several years and have never had this issue there.

Thanks! Let's help get this thing fixed!

J
Perryg
Site Moderator
Posts: 34369
Joined: 6. Sep 2008, 22:55
Primary OS: Linux other
VBox Version: OSE self-compiled
Guest OSses: *NIX

Re: High I/O causing filesystem corruption

Post by Perryg »

@LocalHostArgentina,

Yo need to add your issue to the trouble ticket. This goes direct to the DEVs and is the fastest way to seek resolution for this issue.

Here is the click-able link to the ticket https://www.virtualbox.org/ticket/10031
indigo42
Posts: 16
Joined: 18. Jan 2010, 09:41
Primary OS: Ubuntu other
VBox Version: OSE Debian
Guest OSses: kubuntu, windows, gOS

Re: High I/O causing filesystem corruption

Post by indigo42 »

All,
Got some good feed back from msurkova on a work around posted to the ticket. I'm posting it here as well. I'm planning on trying some of the suggestions and will post my results.
comment (by msurkova):

We also have similar messages at guest logs sometimes during high I/O load
(sometimes caused by guests, sometimes caused by host applications):

Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771243] ata4.00: status: { DRDY
}
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771247] ata4.00: failed command:
WRITE FPDMA QUEUED
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771253] ata4.00: cmd
61/08:08:a7:43:48/00:00:07:00:00/40 tag 1 ncq 4096 out
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771257] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771261] ata4.00: status: { DRDY
}
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771265] ata4.00: failed command:
WRITE FPDMA QUEUED
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771270] ata4.00: cmd
61/10:10:2f:1b:87/00:00:07:00:00/40 tag 2 ncq 8192 out
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771273] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771277] ata4.00: status: { DRDY
}
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771280] ata4.00: failed command:
WRITE FPDMA QUEUED
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771285] ata4.00: cmd
61/b0:18:87:a4:87/00:00:07:00:00/40 tag 3 ncq 90112 out
Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771288] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

However we do not have drastic consequences like filesystem corruption
maybe because problem goes away before guest OS retries operation.
We are running 64-bit Ubuntu 10.04 server on 24-CPU server with 64Gb RAM
as host and mix of OpenSuse 11.3 and Ubuntu 10.10 as guests.
host cache is enabled since VM disk images are on ext4 filesystem.
After digging around it seems that it may be not a bug in VirtualBox
itself, but rather I/O timeout issue when host OS flushes dirty disk
buffers to disk storage.
This may result in VirtualBox disk I/O subsystem to stop on write() call
for some time and in turn cause timeouts in guest SATA code.
So this may be problem for any type 2 virtualizer running on top of host
file system cache.

If this is sensible, I would like to get advice how to prevent such
undesired behavior. Following approaches are possible:
1). Upgrade host OS kernel to 2.6.36+ and disable "host I/O cache" for
virtual SATA adapters. This will allow VirtualBox to bypass host file
system buffers, not pollute them
with dirty pages and not be stopped during buffer flush. In fact this is
even recommended in VirtualBox manual (except for mentioning issue with
ext4 host file system)
2). Try to tune host os page flusher behavior to avoid accumulating of
significant amount of dirty pages in host file system buffers.
For example, set vm.dirty_background_ratio to 5 or below (this is still
large size - about 1Gb) or set vm.dirty_background_bytes to something
smaller like 20971520 to make page
flusher more active. Note that this may make host file system cache less
effective, but it prevents write operation stuck during buffer flush.
3). Tune VirtualBox to flush VM image files more often to prevent making
lot of pages in host buffer cache dirty as described here:
http://www.virtualbox.org/manual/ch12.h ... iodicFlush

Could VirtualBox developers tell if this makes sense and if so, which
approach to take ?
ziovanja
Posts: 7
Joined: 29. Oct 2011, 01:33
Primary OS: Debian other
VBox Version: OSE Debian
Guest OSses: Debian GNU/Linux

Re: High I/O causing filesystem corruption

Post by ziovanja »

LocalHostArgentina wrote:We have the exactly same problem above one year. We always update the VirtualBox and Kernel and have always the same problem, the vms with linux always go to "read only" mode and have to restart and repair the filesystem.
We also tried ext3, ext4 and reiserfs, and all the flags on/off and have the exactly same problem.

Please, it's time to fix this bug..
As long as we are experiencing the same situation, can you tel me if you found a solution to fix this annoying bug? Some of our VB servers are freezing at least once a week (causing serious damages on MySQL dbs) and this is leading me to consider the opportunity to leave VB forever. What did you do to solve this issue?
michaln
Oracle Corporation
Posts: 2973
Joined: 19. Dec 2007, 15:45
Primary OS: MS Windows 7
VBox Version: PUEL
Guest OSses: Any and all
Contact:

Re: High I/O causing filesystem corruption

Post by michaln »

One of the most important answers is right there - if you're on a Linux host and doing heavy disk I/O, do not use the host cache for the VMs, ever. The Linux I/O subsystem not very smart, it batches gobs of dirty pages in the filesystem cache, and when it runs out of free memory, flushes out everything to disk. That can take quite a long time (minutes) and there's nothing VirtualBox can do about it.

The asynchronous I/O in VirtualBox was designed explicitly to work around this host OS deficiency. The I/O doesn't go through the host's cache and is written to disk much more frequently in smaller chunks. However, VirtualBox isn't necessarily the only process running on the host and something else still may trigger the undesirable behavior.

The corollary to the above is obvious: If your host can't cope with the I/O load generated by the VMs plus the rest of the system, there will be trouble. Virtualization isn't magic and can't turn a slow disk into a fast one.
WK-VBOX
Posts: 4
Joined: 17. May 2012, 00:57

Re: High I/O causing filesystem corruption

Post by WK-VBOX »

michaln wrote:One of the most important answers is right there - if you're on a Linux host and doing heavy disk I/O, do not use the host cache for the VMs, ever. The Linux I/O subsystem not very smart, it batches gobs of dirty pages in the filesystem cache, and when it runs out of free memory, flushes out everything to disk. That can take quite a long time (minutes) and there's nothing VirtualBox can do about it.
In my testing, I've found the performance hit to be too server to give up host cache. Are there any tunables or techniques to reduce the above effect.
(such as running 'sync' out of cron). The idea would be to flush things out earlier and more consistantly.

-wk
jmalter
Posts: 2
Joined: 21. Jul 2012, 08:02
Primary OS: Ubuntu other
VBox Version: PUEL
Guest OSses: Ubuntu 10.0.4

Re: High I/O causing filesystem corruption

Post by jmalter »

michaln wrote:One of the most important answers is right there - if you're on a Linux host and doing heavy disk I/O, do not use the host cache for the VMs, ever. The Linux I/O subsystem not very smart, it batches gobs of dirty pages in the filesystem cache, and when it runs out of free memory, flushes out everything to disk. That can take quite a long time (minutes) and there's nothing VirtualBox can do about it.

The asynchronous I/O in VirtualBox was designed explicitly to work around this host OS deficiency. The I/O doesn't go through the host's cache and is written to disk much more frequently in smaller chunks. However, VirtualBox isn't necessarily the only process running on the host and something else still may trigger the undesirable behavior.

The corollary to the above is obvious: If your host can't cope with the I/O load generated by the VMs plus the rest of the system, there will be trouble. Virtualization isn't magic and can't turn a slow disk into a fast one.

We have exact the same situation, but fast Disks (SSD in a Raid 10 Configuration).
Since this error is effecting our work (1 reboot per day is needed) i am lookup for a fix for this issue.
Currently we evaluating KVM as an alternative for VirtualBox.
All workarounds IDE instead SATA, change parameter with sysctl and so on, does not have any effect.

Also the workaround in #10031 is a solution, because it dosn't help.
In the VM there is a MySQL, SQLLITE and a SVN Repository.
martyscholes
Posts: 202
Joined: 11. Sep 2011, 00:24
Primary OS: Solaris
VBox Version: PUEL
Guest OSses: Win 7, Ubuntu, Win XP, Vista, Win 8, Mint, Pear, Several Linux Virtual Appliances

Re: High I/O causing filesystem corruption

Post by martyscholes »

jmalter wrote:
michaln wrote:One of the most important answers is right there - if you're on a Linux host and doing heavy disk I/O, do not use the host cache for the VMs, ever. The Linux I/O subsystem not very smart, it batches gobs of dirty pages in the filesystem cache, and when it runs out of free memory, flushes out everything to disk. That can take quite a long time (minutes) and there's nothing VirtualBox can do about it.

The asynchronous I/O in VirtualBox was designed explicitly to work around this host OS deficiency. The I/O doesn't go through the host's cache and is written to disk much more frequently in smaller chunks. However, VirtualBox isn't necessarily the only process running on the host and something else still may trigger the undesirable behavior.

The corollary to the above is obvious: If your host can't cope with the I/O load generated by the VMs plus the rest of the system, there will be trouble. Virtualization isn't magic and can't turn a slow disk into a fast one.

We have exact the same situation, but fast Disks (SSD in a Raid 10 Configuration).
Since this error is effecting our work (1 reboot per day is needed) i am lookup for a fix for this issue.
Currently we evaluating KVM as an alternative for VirtualBox.
All workarounds IDE instead SATA, change parameter with sysctl and so on, does not have any effect.

Also the workaround in #10031 is a solution, because it dosn't help.
In the VM there is a MySQL, SQLLITE and a SVN Repository.
Are you saying that disabling host I/O caching still causes the filesystems to become corrupt? Honestly, I don't understand why anyone would use host I/O caching. If I understand it correctly, the guest thinks it is flushing to disk and this option allows the host to lie about the flush. It's no wonder that the guest disk becomes corrupt. In a production environment, I would think it insane to enable the caching. Disabling the caching would force the data to disk as often as it would on bare metal, so there is no performance loss. Enabling caching is merely smoke and mirrors at the expense of your data.

If you truly want some sort of write caching without investing money in dedicated hardware, on the host use something like ZFS with a mirrored SSD log device. While I am sure there are other options, this approach will provide blinding synchronous write speed and provable data integrity. Solaris is a wonderful platform for hosting VMs.
Locked