Network problems with VirtualBox 3.1.4 on OpenSolaris

Discussions related to using VirtualBox on Solaris hosts.
Post Reply
randshuntzinger
Posts: 23
Joined: 14. Aug 2008, 19:18

Network problems with VirtualBox 3.1.4 on OpenSolaris

Post by randshuntzinger »

I have server running OpenSolaris Developer build 133 with 4 Solaris and OpenSolaris guests running. I'm using VirtualBox 3.1.4 on the hosts and guests (SUNWvboxguest). Three of these guests use bridged networking over Crossbow VNIC's (virtual network interfaces) and the 4th is using NAT. Here is what I'm observing:
tray1$ scp -p /netopt/solaris.systems/intel/indy school:/tmp/x
indy                 100% |*****************************|   156 KB    00:00
tray1$ scp -p school:/tmp/x /tmp
x                      0% |                             |     0       --:-- ETA^CKilled by signal 2.

tray1$
I was able to use scp to copy a 156KB file from tray1 to school (A virtualbox guest); however, when I try to copy the file back it hangs. I've found that I can transfer small files back:
tray1$ ssh school cat /etc/release
                       Solaris 10 10/09 s10x_u8wos_08a X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 16 September 2009
tray1$
but anything greater than a few KB will hang every time. As far as I can tell this only hangs on VirtualBox guests. I can use a Solaris zone with a VNIC and it is fine. If I transfer from the guest out (pushing from the guest instead of pulling from outside), it still hangs. As long as the file is coming from the guest and going out it hangs. I do not have this problem on the NAT connection.

If we login to the guest and try transferring out it appears that some text may get transferred before the connection stalls:
indy$ scp /tmp/indy tray1:/tmp/x
Password: 
indy                  92% |**************************   |   144 KB  - stalled -^CKilled by signal 2.

indy$ scp /tmp/indy tray1:/tmp/x
Password: 
indy                  99% |**************************** |   156 KB  - stalled -^CKilled by signal 2.

indy$ ^C
indy$ scp /tmp/indy tray1:/tmp/x
Password: 
indy                  92% |**************************   |   144 KB  - stalled -Read from remote host tray1: Connection timed out
indy                  92% |**************************   |   144 KB  - stalled -^CKilled by signal 2.

indy$

This problem seems to be independent of the release of the guest OS. Indy is an OpenSolaris build 133 machine and school is running Solaris 10. Both have this problem. VirtualBox with bridged networking seems to be needed for the problem to occur. The NAT guest is OK.

This is all rather bizarre. Has anyone else observed anything similar?

-- Rand
Ramshankar
Oracle Corporation
Posts: 793
Joined: 7. Jan 2008, 16:17

Re: Network problems with VirtualBox 3.1.4 on OpenSolaris

Post by Ramshankar »

So just copying anything more than a few KB stalls network transfer on the guest when using bridged networking on OpenSolaris b133. Right? I'll try and reproduce this behaviour tomorrow.
Oracle Corp.
jimklimov
Posts: 83
Joined: 7. Jul 2009, 08:28
Primary OS: OpenSolaris other
VBox Version: PUEL
Guest OSses: Linux, OSOL, Windows

Re: Network problems with VirtualBox 3.1.4 on OpenSolaris

Post by jimklimov »

I've had very similar behavior on physical Solaris 10u4 with its bundled IPFilter. There was something about TCP window size mismatch, fixed in opensource IPFilter 4.1.13. IIRC I built and used the opensource branch of IPFilter current at that time (approx 4.1.28) to make the problem go away. I did not see it with other setups, newer or older though, and we use IPFilter extensively, including the "native" one in SXCE snv_129.

A similar-looking problem happened at a remote site with MTU mismatch at an intermediate ISP (i.e. most of the net used 1500 bytes and they only passed 1492 bytes). The problem was reproducible with stalled scp's, web page downloads, and even by forcing the shell to flood with data (listing large directories, cat'ing files or even pressing and holding SPACE in prstat or top to refresh as fast as they could). Tuning that rack's firewall to issue smaller MTU'd packets to the internet and be kind to fragmented packets fixed the problem.

My point:
1) If your host and/or guests do use IPFilter (including just loading the module and service), try to disable that and see if it helps. Whatever the result, you'd have some more ideas to pursue.
2) Also try tuning some IPF-related settings, like hardware checksum offload (disable).
3) Maybe some similar problem creeped into VBox networking, and/or OpenSolaris between SXCE 129 (last I used) and OSOL 133?

PS: you do use same MACs in host VNIC configs and VM interfaces, right?

PPS: am I correct to assume that your problems don't happen for scp's within the physical machine - between guests and their host, just between guests?
randshuntzinger
Posts: 23
Joined: 14. Aug 2008, 19:18

Re: Network problems with VirtualBox 3.1.4 on OpenSolaris

Post by randshuntzinger »

jimklimov wrote:I've had very similar behavior on physical Solaris 10u4 with its bundled IPFilter. There was something about TCP window size mismatch, fixed in opensource IPFilter 4.1.13. IIRC I built and used the opensource branch of IPFilter current at that time (approx 4.1.28) to make the problem go away. I did not see it with other setups, newer or older though, and we use IPFilter extensively, including the "native" one in SXCE snv_129.

A similar-looking problem happened at a remote site with MTU mismatch at an intermediate ISP (i.e. most of the net used 1500 bytes and they only passed 1492 bytes). The problem was reproducible with stalled scp's, web page downloads, and even by forcing the shell to flood with data (listing large directories, cat'ing files or even pressing and holding SPACE in prstat or top to refresh as fast as they could). Tuning that rack's firewall to issue smaller MTU'd packets to the internet and be kind to fragmented packets fixed the problem.

My point:
1) If your host and/or guests do use IPFilter (including just loading the module and service), try to disable that and see if it helps. Whatever the result, you'd have some more ideas to pursue.
2) Also try tuning some IPF-related settings, like hardware checksum offload (disable).
3) Maybe some similar problem creeped into VBox networking, and/or OpenSolaris between SXCE 129 (last I used) and OSOL 133?

PS: you do use same MACs in host VNIC configs and VM interfaces, right?

PPS: am I correct to assume that your problems don't happen for scp's within the physical machine - between guests and their host, just between guests?
Thank you for your comments. I checked and IPF is disabled everywhere - so that is not it. I suspect that #3 may be the case but I'm not sure enough that I was ready to file a bug.

Yes - I used the same MAC's in the host VNIC configs and the VM interfaces. If this were not the case, I doubt the networking would have worked at all.

With respect to within the physical machine. Things are much better but I did have hangs between two of the guests (an OpenSolaris (now b134) guest and a Solaris 10 guest). Things still hang reliably when going to an outside host. This is breaking one of my maintenance scripts rather badly. I'm also having the (now) b134 guests hang up periodically. I wonder if that is related to the effects of the network issues over time. For now, I think it better to consider only the provlem stated here.

Hopefully the person who said they'd try to recreate this was successful.

-- Rand
Ramshankar
Oracle Corporation
Posts: 793
Joined: 7. Jan 2008, 16:17

Re: Network problems with VirtualBox 3.1.4 on OpenSolaris

Post by Ramshankar »

I'm really sorry for the late reply, I've been caught up with other things. This does not sound like a problem I've experienced recently.

Can you please open a ticket on http://www.virtualbox.org/report/13 with the relevant VBox.log files, ifconfig and dladm output from your host.

The quickest way is to help us solve such non-trivial setup problems is to give us the simplest (quickest) way to reproduce (i.e. how many minimum VMs required to run in parallel, and so forth).

Thank you.
Oracle Corp.
randshuntzinger
Posts: 23
Joined: 14. Aug 2008, 19:18

Re: Network problems with VirtualBox 3.1.4 on OpenSolaris

Post by randshuntzinger »

It has been some time since I posted my original problem. I think that I may now know what the problem is. It appears to affect both VirtualBox and xVM on OpenSolaris. The problem seems to relate to MTU sizes. Here is what I have.

Host machine: Dell 1950, bnx0 network interface, MTU 1496 (not 1500 and I'm not sure why)
Four VirtualBox guests, 3 with bridged networks on Crossbow VNIC's (which are the relevant ones).

There was a MTU mismatch between the VB guests (1500) and the bnx0 interface (1496). This caused problems with connections between the two VB guests (even though they both had the same MTU) and outgoing transfers from the VBox guest to outside hosts. Incoming connections worked OK and since that was the most frequent case things seemed to work fine.

If I reduced the MTU on the VB guest interface the problem seems to go away. I'm no expert but shouldn't the VBox guest be able to negotiate a MTU between itself and the host which would work? Is this a bug or not? I had a similar problem on xVM but not with OpenSolaris ipkg zones.

Anyway, I have a workaround. Now I'll see which of the many problems I've had on VirtualBox this will resolve.

-- Rand
Post Reply