NAT Network Creates Spurious Translation for IPSEC VPN

Discussions related to using VirtualBox on Windows hosts.
Post Reply
ImCallingYouDreamer
Posts: 3
Joined: 22. Mar 2023, 23:33

NAT Network Creates Spurious Translation for IPSEC VPN

Post by ImCallingYouDreamer »

I'm running a VPN Client from within a Windows 11 VM running on a Windows 11 host using Virtual Box Version 7.0.6 r155176. Windows machines are patched and up-to-date as of 20230322.

My VPN Client is connecting using IPSEC to a test VPN server on a local network. Once the VPN session is established, after from 15 to 50 minutes, the session will disconnect. If I change the Dead Peer Detection timer within the VPN client from 30 seconds check interval, 5 retries, 15 seconds between retries, to 15 seconds check interval, 5 retries, 7 seconds between retries, the connection will stay up for hours, but may still fail.

IPSEC VPN traffic first moves over UDP on port 500 (normally 500 source to 500 destination for IKE) and once negotiated, the traffic moves over UDP port 4500 (normally 4500 source to 4500 destination). When the traffic passes through a NAT the NAT device will create a port translation and the source port on the outside of the NAT will have a random high value. After collecting Wireshark captures I have concluded, that the NAT Network interface on my VirtualBox VM will create a new port translation for IPSEC traffic moving across UDP port 4500 in the middle of an existing port translation. When the NAT Network does this, it will stop delivering some or all traffic to the VM and the IPSEC VPN client will timeout and drop the connection.

I've checked VMware Workstation 15.7 with the same Windows 11 installation on the same host and this problem does not occur. All traffic runs as a single UDP stream and remains a single stream for the duration of the connection. For VirtualBox, the new port translation creates a new UDP stream on the outside of the NAT.

When we configure the VPN client for reduced timeouts as indicated above, we delay this problem. But this is not a problem with the VPN Client from what I can tell. I've included the Wireshark captures of traffic in the VM and traffic on the host. There is no other traffic between these hosts at the time of the failure. There is an ARP broadcast and reply both on the host and within the VM. They occur at roughly the same time, but they are occurring frequently before and after this failure.

The host is at address 10.30.3.102, the VPN device is at address 10.30.3.100. The VM is at address 192.168.133.4. The time is Eastern Daylight time. This is a deal-breaker for me in using VirtualBox.
Attachments
VM UDP 500 4500 Single Stream.zip
Wireshark capture on VM (ip.src_host==192.168.133.4 && ip.dst_host==10.30.3.100 ) || (ip.src_host==10.30.3.100 && ip.dst_host==192.168.133.4)
(160.37 KiB) Downloaded 2 times
Host UDP 500 4500 Split Stream.zip
Wireshark capture on Host (ip.src_host==10.30.3.102 && ip.dst_host==10.30.3.100 ) || (ip.src_host==10.30.3.100 && ip.dst_host==10.30.3.102)
(170.87 KiB) Downloaded 2 times
Virtualbox files.zip
Configuration and log files
(99.36 KiB) Downloaded 2 times
fth0
Volunteer
Posts: 5668
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by fth0 »

I have a theory explaining your issue:

I think that the VirtualBox NAT implementation uses a UDP hold time of 20 seconds for the NAT mapping. Since the VPN client also sends its NAT keepalive packets every 20 seconds, it can happen that the NAT mapping runs into its timeout. In your Wireshark traces, you can see inter-packet intervals of slightly more than 20 seconds at ~1080, ~1200 and ~1500 seconds from the beginning, and at the last point in time the NAT mapping gets removed.

Reducing the DPD interval to 15 seconds prevented that from happening. If you can also reduce the NAT keep alive interval to 15 seconds, you could even turn off DPD if it's not needed. Your VPN client seems to have a well-behaving DPD implementation that only sends DPD requests when there has been no data received for the last DPD timeout interval, so you could reduce the DPD interval to 10 seconds without generating too much extra traffic.
ImCallingYouDreamer
Posts: 3
Joined: 22. Mar 2023, 23:33

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by ImCallingYouDreamer »

Interesting. I though of this and looked for a means by which I could configure the timeout in the VirtualBox NAT engine. Did not find one, however. I never imagined the timeout would be on the order of 20 seconds. I'm using TheGreenBow VPN client for VPN connections to lots of different clients on lots of different network installations. It has worked well with the default settings for NAT Keepalive. Unfortunately, I would have to open a support case to see if they have a registry entry that would change the NAT Keepalive frequency.

However, we also use lots of differing VPN clients for other customers, all of those would live in a VM like this one. So changing the application would only partially solve the problem since another VPN client might not be configurable at all and the cost of performing custom configuration on each user's VM installation for the particular customers they serve, would be excessive. I couldn't find a configuration item for NAT Keepalive frequency in the VirtualBox documentation. If you know of one, I'd prefer to change that since it would make the behavior equivalent to VMware workstation and we have 20 years of experience with that and it works well.

Thanks!
fth0
Volunteer
Posts: 5668
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by fth0 »

It looks like you've found a bug in VirtualBox:

The VirtualBox NAT engine tries to ensure a (not configurable) 21 seconds timeout, citing RFC 3948 (UDP Encapsulation of IPsec ESP Packets) with its 20 seconds keep-alive as the background reason in the source code. But the implementation uses a 3-seconds timer elapsing 7 times that is global for all UDP sessions, leading to a possible timeout between 18 and 21 seconds after each data packet.
fth0
Volunteer
Posts: 5668
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by fth0 »

I've created ticket 21560, including a proposal for an easy bugfix.
ImCallingYouDreamer
Posts: 3
Joined: 22. Mar 2023, 23:33

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by ImCallingYouDreamer »

Thanks fth0. I appreciate the effort on understanding this issue. However, I'm an old engineer and I'm constantly thinking about margins. If the client is sending a 20 second keep-alive signal to the server in an attempt to prevent any intervening NAT translation from expiring. Then as the builder of a firewall, it would be foolish of me to make my NAT translation timer 20 seconds. The margin for error is zero and it's bad practice to build something with zero margin. A construction built with zero margins is brittle and tends to break easily. That margin helps if the client clock was not exactly accurate or stable and tended to vary for any of a host of reasons (such as it is on a VM and got suspended for an extended time period).

So if I were building a firewall with a NAT translation engine, making the NAT translation timer 21 seconds gives me a 1 second margin. That's not a great margin either. This makes me think about what the concerns of the firewall builder would be when building a NAT translation engine. The main issues are memory consumption and source port resource exhaustion with some consideration for the CPU requirements associated with managing a large number of translations. If the NAT translation timer was infinite, the number of translations would grow without bound and the engine would consume all available resources and break the firewall it was a part of. So we'd want to choose a timer that gives us great margin while not risking port exhaustion or memory exhaustion.

Lots of firewalls use a UDP timer of 30 seconds. That allows the 20 second keep-alive signals a 10 second margin should there be delays. Windows firewall, which must also allow reflexive UDP traffic, has a UDP timeout of 65 seconds (it seems). Unlike a firewall that serves mixed traffic from a large number of devices, the Windows firewall can afford to have a much larger margin of 45 seconds without the threat of port exhaustion or memory exhaustion because there are a finite number of windows applications on the machine that can open UDP connections. And if an application breaks and creates thousands of connections, other applications will be affected on the machine, but other users elsewhere on the network will remain unaffected (because the upstream hardware firewall will be more aggressive in protecting its resources). I think this is true for the VirtualBox NAT installation as well. So I would argue for a NAT translation timer of around 65 seconds because the implications for VMs behind the NAT are minor and the value is hard coded and unchangeable, unlike the equivalent parameter in a hardware firewall.

Thanks
fth0
Volunteer
Posts: 5668
Joined: 14. Feb 2019, 03:06
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Linux, Windows 10, ...
Location: Germany

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by fth0 »

Thank you for sharing your thoughts, apreciated! :)

You and me seem to have a similar background. I'm also aware of the common (non-DNS) UDP timeouts of 30, 90 and 300 seconds (e.g. by CISCO). I'd also agree that the 5% margin (1 of 20 seconds) is quite small. On the other hand, on todays multi-gigabit networks, 1 second can also be regarded as a long time. But I'll agree that VirtualBox is not a high performance network participant. ;)

Since I don't have much influence on the VirtualBox development, I just proposed the bugfix to make the implementation do what it was meant to do by its author. Additionally, I heard that the days of the currently used LWIP implementation in VirtualBox are counted. When I get to see the new implementation, I'll pay attention to the timeouts ...
scottgus1
Site Moderator
Posts: 20965
Joined: 30. Dec 2009, 20:14
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Windows, Linux

Re: NAT Network Creates Spurious Translation for IPSEC VPN

Post by scottgus1 »

fth0 wrote:1 second can also be regarded as a long time.
0.68 seconds is an eternity. For an android, that is, according to Commander Data... :lol:
Post Reply