Semi-solved: Fedora 13 I/O Freeze
Posted: 1. Sep 2010, 06:45
Hi; I have a problem with a fedora 13 guest freezing on I/O. In more detail, the guest boots OK, but freezes after about an hour.
In the scenario I'm clearest about, I boot the machine and open a terminal for a large install I want to do. I start the terminal and system monitor, and everything looks normal, four cpus heavily loaded from parallel installation tasks (the same problem happens if I run the install in synchronous mode), plenty of available memory etc. After an hour or so, though, when I come back I see the following symptoms:
.the install job is frozen (different place every time). If I click on the terminal it will probably progress a few steps, but then freeze again. After a couple more times, the terminal freezes permanently
.the system monitor is also frozen. If I click on it, it continues on as if nothing had happened (but the cpu usage has dropped to close to zero). There's plenty of physical memory still available, and zero swap showing
.The virtualbox network icon is blinking furiously - much more than in its unfrozen state - though I'm not running anything obvious that would cause this
.I can open new terminal windows (i.e. the gui is OK), but they are also frozen - can't type anything into them
.In this state, the shutdown dialogue appears to work OK, but the shutdown also freezes, so I can only stop the machine by a hard stop
.On reboot, there is very little in the logs. At around the time I guess the freeze started, I see sometimes see some ntpd messages in the /var/log/messages (no_sys_peer and spike_detect, not sure if they're abnormal), then /var/log/messages is completely empty until I reboot (even if it's overnight)
.On the host, while the job is running successfully, I see around 350% cpu use, which is about what I expect (4 cpus allocated to the guest). Once it has frozen, the host sees around 4% cpu use from the guest.
I think the only thing I'm doing unusual network-wise is using bridged networking (I need a fixed IP address so my students can login remotely); this is using a separate physical network adapter to the host. The same configuration has worked perfectly well before, and there's nothing in the host logs about network problems, so I doubt it's the network configuration (besides, it's hard to see how a network-based problem would freeze a terminal running in the guest console).
Oh, and I guess the writethrough of swap to a raw disk partition is unusual too; don't think it's implicated here, though, because the system monitor seems to indicate that swap space hasn't been used by the time the crash occurs.
Any thoughts on how to go about further diagnosing this would be greatly appreciated. At the moment, I haven't the faintest idea where to look next. Previously, I saw some messages in the guest logs about fprintd that didn't look good. Since I don't have a fingerprint scanner, I removed fprintd. This didn't seem to fix anything. so I'm stuck, and term starts today....
Best Wishes
Bob
System setup:
Host:
Hardware: dual 4-core Intel X5472, 16GB
Allocated to guest: 4 cpus, 8GB memory
OS: Fedora 13 64 bit (fully updated)
VirtualBox: 3.2.8
VTX entabled
nested paging disabled
PAE disabled
video memory: 64MB, 1 monitor, 3D/2D/remote display disabled
Main install disk: fixed size normal disk, 44GB over ext4 partition, on lvm per normal fedora install
Swap: writethrough to raw partition (16GB)
Network: adapter 1, PCnet-FAST III (Bridged, host eth1 <-> guest eth0)
Guest:
Fedora 13 64 bit (fully updated)
Resolution: it turned out to be a combination of two separate problems, very slow terminal I/O (presumably some kind of problem in the virtualisation of I/O, though I'm not too clear why this affected even console use) combined with random crashes due to smp. So perryg was right (thank you!). I was finally able to get the install to complete by a combination of:
1. setting the number of virtualbox cpus down to 2 (i.e. the number of cpus, not cores)
2. redirecting all output from the install to files
I'm still very puzzled by this buffering issue. We can run multiple vnc sessions on this guest quite OK (this is a shared machine for a whole class) - not fast, but not unacceptably slow. I can't understand how the I/O load of one install in a terminal session - certainly spewing out a fair bit of output, but not unbelievable amounts - could be so much worse than multiple vnc sessions.
Anyway, it's largely solved, so thank you for all your help with this. If you hear any rumours about the smp issues being solved in subsequent versions, it would be great if it could be posted - I'd really like to be able to devote more resources to the teaching machine.
In the scenario I'm clearest about, I boot the machine and open a terminal for a large install I want to do. I start the terminal and system monitor, and everything looks normal, four cpus heavily loaded from parallel installation tasks (the same problem happens if I run the install in synchronous mode), plenty of available memory etc. After an hour or so, though, when I come back I see the following symptoms:
.the install job is frozen (different place every time). If I click on the terminal it will probably progress a few steps, but then freeze again. After a couple more times, the terminal freezes permanently
.the system monitor is also frozen. If I click on it, it continues on as if nothing had happened (but the cpu usage has dropped to close to zero). There's plenty of physical memory still available, and zero swap showing
.The virtualbox network icon is blinking furiously - much more than in its unfrozen state - though I'm not running anything obvious that would cause this
.I can open new terminal windows (i.e. the gui is OK), but they are also frozen - can't type anything into them
.In this state, the shutdown dialogue appears to work OK, but the shutdown also freezes, so I can only stop the machine by a hard stop
.On reboot, there is very little in the logs. At around the time I guess the freeze started, I see sometimes see some ntpd messages in the /var/log/messages (no_sys_peer and spike_detect, not sure if they're abnormal), then /var/log/messages is completely empty until I reboot (even if it's overnight)
.On the host, while the job is running successfully, I see around 350% cpu use, which is about what I expect (4 cpus allocated to the guest). Once it has frozen, the host sees around 4% cpu use from the guest.
I think the only thing I'm doing unusual network-wise is using bridged networking (I need a fixed IP address so my students can login remotely); this is using a separate physical network adapter to the host. The same configuration has worked perfectly well before, and there's nothing in the host logs about network problems, so I doubt it's the network configuration (besides, it's hard to see how a network-based problem would freeze a terminal running in the guest console).
Oh, and I guess the writethrough of swap to a raw disk partition is unusual too; don't think it's implicated here, though, because the system monitor seems to indicate that swap space hasn't been used by the time the crash occurs.
Any thoughts on how to go about further diagnosing this would be greatly appreciated. At the moment, I haven't the faintest idea where to look next. Previously, I saw some messages in the guest logs about fprintd that didn't look good. Since I don't have a fingerprint scanner, I removed fprintd. This didn't seem to fix anything. so I'm stuck, and term starts today....
Best Wishes
Bob
System setup:
Host:
Hardware: dual 4-core Intel X5472, 16GB
Allocated to guest: 4 cpus, 8GB memory
OS: Fedora 13 64 bit (fully updated)
VirtualBox: 3.2.8
VTX entabled
nested paging disabled
PAE disabled
video memory: 64MB, 1 monitor, 3D/2D/remote display disabled
Main install disk: fixed size normal disk, 44GB over ext4 partition, on lvm per normal fedora install
Swap: writethrough to raw partition (16GB)
Network: adapter 1, PCnet-FAST III (Bridged, host eth1 <-> guest eth0)
Guest:
Fedora 13 64 bit (fully updated)
Resolution: it turned out to be a combination of two separate problems, very slow terminal I/O (presumably some kind of problem in the virtualisation of I/O, though I'm not too clear why this affected even console use) combined with random crashes due to smp. So perryg was right (thank you!). I was finally able to get the install to complete by a combination of:
1. setting the number of virtualbox cpus down to 2 (i.e. the number of cpus, not cores)
2. redirecting all output from the install to files
I'm still very puzzled by this buffering issue. We can run multiple vnc sessions on this guest quite OK (this is a shared machine for a whole class) - not fast, but not unacceptably slow. I can't understand how the I/O load of one install in a terminal session - certainly spewing out a fair bit of output, but not unbelievable amounts - could be so much worse than multiple vnc sessions.
Anyway, it's largely solved, so thank you for all your help with this. If you hear any rumours about the smp issues being solved in subsequent versions, it would be great if it could be posted - I'd really like to be able to devote more resources to the teaching machine.