Here's a synopsis of the problem. Windows 7 guest boots up. Sometimes it works fine for a few minutes, sometimes it'll run for an hour or two, then the load average on the Linux box starts increasing. I've let it go to 250 before power cycling. When the load average starts increasing the sync command [sync() system call according to strace] hangs.
- Host specs:
Fedora Core 21
Linux proxy 3.18.3-201.fc21.x86_64 #1 SMP Mon Jan 19 15:59:31 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz
32GB RAM
VirtualBox 4.3.20 r96996
Seagate SSHD 2TB storage (ST2000DX001) setup in RAID-1 configuration (/dev/md)
XFS file system
64-bit Windows 7 guest hangs guest I/O after a random period of time with 2 CPU and 4 GB memory
I've tried various combinations of VirtualBox (chip set) settings for Windows 7, but can't find a stable combination. This was based on numerous reports of Windows 7 guest instability on this forum. I used VBoxManage clonehd to move from a dynamically allocated .vdi file to a fixed size. I've degragmented the XFS file system using xfs_fsr.
Yesterday, I tried an experiment and got some very interesting results. I moved my .vdi file from local storage to another server running ext4 and NFS mounted the .vdi file from the remote server to the VirtualBox host. I am able to boot into 64-bit Windows 7 guest and it works just fine. However, after some period of time, I/O on the host Linux box hangs and the Windows 7 guest continues to function! If I have a root shell open to the host OS from guest, I can still run some commands for troubleshooting. Networking and NFS continue to function and I can cleanly shutdown Windows 7 before power cycling the host box.
This latest finding of host hanging while guest continues to work led to this forum post, and will probably lead to a bug track request. Does anyone have any ideas as to what may be going on? I'm suspecting there may be some kernel interactions with 4.3.20 and have tried kernels 3.17.4, 3.17.6, 3.17.7, 3.17.8 and 3.18.3. I was running stable on 3.17.7-300fc21 for two weeks before I ran another yum update and broke everything again.
I'm looking for troubleshooting tips where I can track the root cause of this problem down. How do I find out where I/O is getting hung in the kernel?
Oh, I've also tried running 'smartctl --test=long' on the hard disks and they came back clean. When I copy my 128GB .vdi file from one location to another, it takes about an hour and I/O is very slow on the host during this operation.
I think that covers everything, if you have any questions feel free to ask.
Regards,
-Seann