Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Discussions about using Linux guests in VirtualBox.
Post Reply
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by ams_tschoening »

I'm dealing with some strange problem for some time now: After some days of runtime, one of my VMs seems to become slower on CPU-heavy tasks. One example were this happens is reading virus signature databases in ClamD, either by simply restarting the daemon, sending the signal USR2 to read signatures again or because the configured signature check timeout has elapsed.

After restarting the VM, reading virus databases is a fast operation, it takes about ~35 seconds and is pretty constant if repeated. After some days of runtime "something" happens which makes loading those signatures a very slow operation, up to the point were it takes 15 or even 20 minutes(!) if the VM additionally handles what it normally needs to do as well during daytime. In the night it's a bit faster, maybe half the time, but it's still a lot of minutes vs. without that "something" happened it's always far less than a minute.

My problem is that I don't find what that "something" is that causes those problems. But after that strange event happened, it doesn't only influence signature loading of ClamD, one can only see the problem very good with this scenario, but seems to influence everything that is CPU-bound. I have the feeling like there's some handbrake in effect on the CPUs: Whenever something CPU-bound is in progress, all other processes seem to accumulate as well, putting a very high load on the system, making it slow, up to the point at which one is not able to use simple cursor key navigation in e.g. Midnight Commander (mc) anymore. Restarting Apache Tomcat serving multiple different web applications triggers that effect as well after that "something" happened, restarting takes way more time than before.

Those effects can easily be seen in htop:
htop 01.png
htop 01.png (75.82 KiB) Viewed 4592 times
That high load is only because of the ClamD process, normally it's not that high, especially as the requests to Tomcat are served pretty fast normally. Once ClamD finishes, the overall load is much lower again. Additionally recognize that ClamD takes >100% CPU, which is normally not the case, because reading signatures is only done by one CPU. The next picture is interesting as well:
htop 02.png
htop 02.png (74.73 KiB) Viewed 4592 times
After the former requests have been processed by Tomcat, the load on all CPUs drops, ClamD gets back to what looks like normal with ~100%. But it isn't, ClamD takes too long, it was already working for minutes, and the other top processes like htop itself shouldn't create such a high load as well. Without ClamD running it's ~2-3%.

So it seems like things that are only short to process are getting slower, but stay "fast enough", while everything that consumes a lot of CPU, like ClamD or Tomcat, gets very slow and makes other processes slower as well. This can even be seen in the logs of ClamD, it starts reloading fast and becomes slower:

Code: Select all

	
Tue May  1 11:56:26 2018 -> Reading databases from /var/lib/clamav
Tue May  1 11:57:01 2018 -> Database correctly reloaded (10566159 signatures)
Tue May  1 19:11:07 2018 -> Reading databases from /var/lib/clamav
Tue May  1 19:11:47 2018 -> Database correctly reloaded (10566159 signatures)
Wed May  2 00:51:15 2018 -> Reading databases from /var/lib/clamav
Wed May  2 00:51:53 2018 -> Database correctly reloaded (10578504 signatures)
Wed May  2 03:41:56 2018 -> Reading databases from /var/lib/clamav
Wed May  2 03:42:31 2018 -> Database correctly reloaded (10579770 signatures)
Wed May  2 20:45:32 2018 -> Reading databases from /var/lib/clamav
Wed May  2 20:46:07 2018 -> Database correctly reloaded (10579770 signatures)
Thu May  3 00:52:29 2018 -> Reading databases from /var/lib/clamav
Thu May  3 00:53:08 2018 -> Database correctly reloaded (10584928 signatures)
Thu May  3 03:42:07 2018 -> Reading databases from /var/lib/clamav
Thu May  3 03:42:46 2018 -> Database correctly reloaded (10586235 signatures)
Thu May  3 08:52:18 2018 -> Reading databases from /var/lib/clamav
Thu May  3 08:53:06 2018 -> Database correctly reloaded (10586235 signatures)
Fri May  4 01:00:30 2018 -> Reading databases from /var/lib/clamav
Fri May  4 01:01:53 2018 -> Database correctly reloaded (10586721 signatures)
Fri May  4 03:42:43 2018 -> Reading databases from /var/lib/clamav
Fri May  4 03:44:01 2018 -> Database correctly reloaded (10588026 signatures)
[...]
Sat May  5 00:56:17 2018 -> Reading databases from /var/lib/clamav
Sat May  5 00:59:48 2018 -> Database correctly reloaded (10589668 signatures)
Sat May  5 03:47:01 2018 -> Reading databases from /var/lib/clamav
Sat May  5 03:53:47 2018 -> Database correctly reloaded (10590874 signatures)
Sat May  5 13:40:49 2018 -> Reading databases from /var/lib/clamav
Sat May  5 13:56:33 2018 -> Database correctly reloaded (10590874 signatures)
Sun May  6 01:00:20 2018 -> Reading databases from /var/lib/clamav
Sun May  6 01:09:27 2018 -> Database correctly reloaded (10597394 signatures)
Sun May  6 03:51:45 2018 -> Reading databases from /var/lib/clamav
Sun May  6 03:59:11 2018 -> Database correctly reloaded (10598555 signatures)
To make things even worse, I was unable to reproduce the problems on a very similar VM with pretty much the same hard- and software settings. I'm using ClamD with the same version, settings and signatures in 3 other VMs with the same OS etc., but different load, software etc., and the problem doesn't occur in those, even though ClamD reloads almost every hour in those, so this could have been spotted in the logs far easier. Additionally, when the VM is slow, there's no heavy I/O load (iostat), no heavy context switches (mpstat), the VM-host itself is not exhausting resources and the problem has not been solved by re-creating the VM from scratch and installing a new OS. I'm pretty sure that it's not a pure performance bottleneck as well, because 1. the problem starts happening after some event only, everything is fast before, and 2. I tried to reproduce the problem using a VM with far less resources and it didn't occur.

The VM itself is Ubuntu 16.04, 8 vCPUs, 48 GBs of RAM. The VM-host is Ubuntu 16.04 with 2 Intel(R) Xeon(R) CPU X5675 @ 3.07 GHz with Hyperthreading enabled, so a total of 24 logical CPUs, and 148 GBs of RAM. Normally those are enough resources to serve my apps fast. The hypervisor used is VirtualBox 5.2.10.

Any more ideas how to debug this, what could be the "something" creating the trouble? Thanks!
JEBjames
Posts: 58
Joined: 26. Jan 2017, 18:27
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: Centos, Ubuntu, Debian, Various Windows
Contact:

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by JEBjames »

This may be unrelated...

I noticed updating clamav's definitions can take a few seconds...to a very very long time to complete.

ClamAv's default config randomly resolved db.local.clamav.net to the ip of a "local" mirror. For me, it resolved to several different IPs. All were good, except one which was very very slow.

"netstat anp | grep clam" when it's slow. See if it's stuck communicating to one of the ClamAV update servers.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by ams_tschoening »

Sounds like you didn't see high CPU consumption like I do, right? Because latencies in updating the signatures itself can be seen in logs as well, but in my scenario the files itself have already been updated and really only loading them into memory afterwards takes that long after some runtime. An it's not only ClamAV, but other CPU-bound tasks as well, like restarting Tomcat or sorting and comparing lots of data in PostgreSQL etc. I have the feeling that the more I/o-waits are involved, the less I have a problem and instead really if things get CPU-bound only the problems gets more serious.

So thanks for the idea, but I doubt that latencies during sig updates is the problem for me.
JEBjames
Posts: 58
Joined: 26. Jan 2017, 18:27
Primary OS: MS Windows other
VBox Version: OSE other
Guest OSses: Centos, Ubuntu, Debian, Various Windows
Contact:

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by JEBjames »

I doubt it's your case either. Just thought it was worth a guess. I wasted too much time on a weird clamd/freshclam log jam with ultra slow mirrors.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

VM becomes slow after some days of runtime with 48 GB of RAM, not with 6 GB.

Post by ams_tschoening »

Hi all,

I'm dealing with a problem for some weeks now which results in a very slow VM-guest after the VM ran for some days.
[ModEdit: Threads have been merged]
"slow" means that CPU-bound operations take more time than before and as well that those operations seem to accumulate over time. Reloading ClamD-signatures for example takes ~35 seconds and 100 % on one core normally, which increases to 1 minute and more without any other load, but can easily take 10 or 15 minutes with some other load. That other load might be database queries by some web app, creating 100 % load on a core in itself already. It seems that without the problem both operations simply process as fast as the CPU is capable to, while with the problem both CPU-bound tasks get slower in itself and at the same time raise the overall load on the system. Every other little operation like "htop" or such creates an unnormal high load as well then. Additionally, processes like ClamD with 100 % load on one core normally are now show as creating 150 % load or more. Which in theory, and as ClamAV-people said, is impossible for reloading signatures because that is simply not multi-threaded. So it seems that some overhead is introduced which reduces overall system performance heavily. At the same time, neither the VM host itself or other VMs on the same host suffer from any performance problems.

This happened with a guest-OS of UB 14.04 LTS in the past and as well with 16.04 LTS after a fresh new install including recreating the VM and such. I think I was able to track this down to one difference: If the VM is used with 48 GB of RAM the problem occurs after some days of runtime, if it is used with only 6 GB of RAM it doesn't. I'm very sure that the amount of RAM really is the only difference in both cases, the workload tested is the same and provided by some automatically running tests using Jenkins and signature updates by ClamD. It is very likely that the problem doesn't occur with at least 8 GB of RAM as well, because I have another VM with such memory not showing the problem, but I don't know currently what the upper limit of RAM is until the problem occurs. It's pretty time consuming to test this, because the problem doesn't exist right from the start, it starts happening at some time.

My server is a HP DL380 G7 with 2 Intel Xeon X5675 @ 3,07 GHz with 144 GB of RAM, evenly distributed over all sockets and RAM slots. It runs UB 16.04 LTS, hosts the VMs on ZFS and the VM tested has 8 vCPUs and either 48 GB of RAM or 6 assigned. The server's resources should be more than enough for my needs, the former used G6 was a bit slower with a bit less RAM and didn't show these problems. And without the problem occurring with 48 GB of RAM, the VM behaves as expected as well. I'm pretty much certain that there's no swapping or memory overcommitting in the host:

Code: Select all

top - 11:49:38 up 28 days, 13:54,  1 user,  load average: 0.26, 0.33, 0.35
Tasks: 904 total,   1 running, 899 sleeping,   0 stopped,   4 zombie
%Cpu(s):  0.1 us,  0.5 sy,  0.0 ni, 99.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 14853158+total,  5032192 free, 13115475+used, 12344644 buff/cache
KiB Swap:  5852156 total,  5852144 free,       12 used. 11533812 avail Mem
Do you have any idea what could create the problem here?

I'm currently looking at NUMA vs. "Node Interleaving", but am somewhat sure that NUMA is enabled. Additionally, from what I've read, performance impact might be around 20 % or even 40 %, but not that heavy that some processes like connecting to the database time out entirely. Additionally I've read that in most cases one should simply not deal with NUMA-specifics at all, but keep OS-defaults and let the kernel decide where to schedule which thread etc. I don't need the last bit of performance anyway, it's only that currently things get unacceptable slow after some time.

Code: Select all

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 72477 MB
node 0 free: 14758 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 72572 MB
node 1 free: 11046 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
$ dmesg | grep -i numa
[    0.000000] NUMA: Node 0 [mem 0x00000000-0xdfffffff] + [mem 0x100000000-0x121fffffff] -> [mem 0x00000000-0x121fffffff]
[    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
$ sysctl -a | grep numa_
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
What makes me wonder is that 48 GB of RAM isn't that much memory at all, I've read other users running into problems only after more than 128 GB have been assigned and developers telling that they successfully tested with 1 TB of RAM. So I'm open for ideas, thanks!
socratis
Site Moderator
Posts: 27330
Joined: 22. Oct 2010, 11:03
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Win(*>98), Linux*, OSX>10.5
Location: Greece

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by socratis »

I didn't see the need for a new thread, on the contrary, your previous description ties in with your new message and provides context. I have therefore merged your two threads.

I can't test your setup, for obvious reasons, so I can't comment on the actual problem.
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by ams_tschoening »

socratis wrote:I didn't see the need for a new thread, on the contrary, your previous description ties in with your new message and provides context.
I simply hoped to get more ideas/attention with a more concrete focus of my problem I wasn't aware of when creating the first thread.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

What is the largest RAM size for a VM you used WITHOUT enabling "largepages" in VirtualBox?

Post by ams_tschoening »


[ModEdit; Threads merged again, because 1) it's the same VM/problem, 2) because the OP asked that the largepages question be answered in the 1st thread (this one), in the mailing list.]
I'm suffering from heavy performance issues more detailed explained in another thread already. From what I've tested so far, there's a direct relationship to the amount of memory assigned to a VM, problem occurs with 48 GB of RAM, doesn't with 6, and it seems with the setting "largepages" of VirtualBox itself. Current tests show that with 48 GB of RAM in the VM and "largepages" enabled, the problem doesn't seem to occur as well. That is interesting, because that setting doesn't seem to be enabled by default, the docs only tell about ~5 % improvement, not that it's necessary at all for decent performance at some RAM size, and additionally there are circumstances in which "largepages" is ignored by VirtualBox altogether.
00:00:42.866663 PGMR3PhysAllocateLargePage: allocating large pages takes too long (last attempt 103 ms; nr of timeouts 11); DISABLE
https://www.virtualbox.org/attachment/t ... .log#L1154

So currently in my opinion it seems unclear under which circumstances "largepages" is not only suggested, but required by VirtualBox to operate properly. To distinguish that, one needs to know which RAM sizes for VMs were used in the past without "largepages" and which weren't because of performance issues like those I'm seeing.
klaus
Oracle Corporation
Posts: 1111
Joined: 10. May 2007, 14:57

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by klaus »

Where's your VBox.log? It's declared a vital resource in all troubleshooting docs. Pointing to a log from a somewhat related ticket on a different host OS isn't going to help much. Would be much better if you provided a log from the setup where you observe problems.

IIRC on Linux the use of large pages is disabled, because they're too unreliable, in the sense that it depends too much on the exact kernel version if they work properly at all, not only straight after system boot. The issue is that it can take minutes (!) to allocate a single large page on a busy Linux system. We've seen cases where it started swapping despite having plenty of free memory (would just need a plan to defragment/relocate memory to free up a large page).

I wonder if your observation is just deteriorating memory management performance of the guest Linux kernel (which can be aggravated by the Meltdown mitigation stuff, re-shuffling the page tables all the time - something you didn't mention at all) with large amounts of memory, postponed by the VirtualBox default of allocating VM memory lazily. This lazy memory allocation can take a long time with Linux VMs, as Linux touches new memory pages only if needed. It can take many hours to fill the filesystem cache enough to reach 48GB.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by ams_tschoening »

Could you please stop merging my threads for no reason?! I DID NOT ask to answer to the linked thread on the mailing list! The question for "largepages" IS independently useful and should NOT be merged here! Please recreate my thread regarding "largepages" and stop mixing independent topics. I carefully think of the question I ask and WHERE I ask them. With your merges you make it unnecessary hard for me to get the answers I'm looking for.
[...]2) because the OP asked that the largepages question be answered in the 1st thread (this one), in the mailing list.
That is simply NOT true.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by ams_tschoening »

klaus wrote:Where's your VBox.log?
And that's exactly why the question about "largepages" shouldn't have been merged here: I was interested in people using "largepages" and neither of the "VBox.log" I have are relevant to THAT question. The merge target-thread was already lost, no one cared anymore or asked for a log file, that's why none has been uploaded.
klaus wrote:It's declared a vital resource in all troubleshooting docs.
I was asking for experiences and facts of other users regarding one aspect of VirtualBox VMs to understand that particular feature better.
klaus wrote:Pointing to a log from a somewhat related ticket on a different host OS isn't going to help much.
Of course it is very helpful to understand the particular feature better I was asking about and that was the whole purpose of the link.
klaus wrote:Would be much better if you provided a log from the setup where you observe problems.
That might be true for the thread where my question has been merged into, but that didn't get much attention anymore anyway. That's why I was asking some other, more specialized question. It is not my fault that topics get mixed now, the question for "largepages" was accurate and independent enough for an own topic.
klaus wrote:IIRC on Linux the use of large pages is disabled, because they're too unreliable, in the sense that it depends too much on the exact kernel version if they work properly at all, not only straight after system boot.
But it's not disabled, it's only not enabled by default and there's absolutely no hint in the docs that one shouldn't use it at all. The opposite is the case, it's even recommended to use it for maximum performance.
If nested paging is enabled, the VirtualBox hypervisor can also use large pages to reduce TLB usage and overhead. This can yield a performance improvement of up to 5%.
But there's no hint anywhere that/if it's required to use "largepages" with some VMs.
klaus wrote:The issue is that it can take minutes (!) to allocate a single large page on a busy Linux system. We've seen cases where it started swapping despite having plenty of free memory (would just need a plan to defragment/relocate memory to free up a large page).
That sounds like a good answer for one of the questions I asked on the specialized developer mailing list, but "somebody" thought it was a good idea to discuss everything here in that now mixed-up thread:
ams_tschoening wrote:If "largepages" is necessary to get usable performance for some amount
of RAM, why is it allowed to fail while continuing the VM?
Of course my question on the mailing list was mentioning additional relevant aspects of that decision.
klaus wrote:I wonder if your observation is just deteriorating memory management performance of the guest Linux kernel (which can be aggravated by the Meltdown mitigation stuff, re-shuffling the page tables all the time - something you didn't mention at all) with large amounts of memory, postponed by the VirtualBox default of allocating VM memory lazily.
That's one of the reasons I was asking for experiences of other users with/without large memory-VMs with/without "largepages". Because that problem would occur always and not only in my setup than. And I already found other people claiming that 128 GB of RAM worked for them, but without telling if they used "largepages" or not.
klaus wrote:This lazy memory allocation can take a long time with Linux VMs, as Linux touches new memory pages only if needed. It can take many hours to fill the filesystem cache enough to reach 48GB.
I could easily "instantly" fill the cache using "tar" on some files with roughly ~20 GB in size and though my problem only happens after a day or such. In production it takes quite longer because I as well guess that the cache is filled slower.
Last edited by ams_tschoening on 28. May 2018, 18:11, edited 2 times in total.
socratis
Site Moderator
Posts: 27330
Joined: 22. Oct 2010, 11:03
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Win(*>98), Linux*, OSX>10.5
Location: Greece

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by socratis »

ams_tschoening wrote:That is simply NOT true.
Can you think of any good reason why I would lie about something like that? Please check the thread you're in (the thread number, t=87904). Then check what you listed in the mailing list as reference, on your second email, where you just added the references. Then check the numbers again...
[color=#BF6000]In the email to the vbox-dev mailing list, you[/color] wrote:Thanks for your time!

[1]: viewtopic.php?f=3&t=87904
[2]: https://www.virtualbox.org/manual/ch10. ... stedpaging
[3]: https://www.virtualbox.org/ticket/16518#comment:2

Mit freundlichen Grüßen,
As for the merging of the thread, I told you again, you might have a legit question about the large pages, but you're isolating parts of the same problem in separate threads. It is my impression that the whole problem belongs in one thread, because it deals with the same fundamental question.
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
ams_tschoening
Posts: 24
Joined: 24. May 2017, 16:30

Re: Why does my VM become slower during CPU-heavy tasks after some days of runtime?

Post by ams_tschoening »

socratis wrote:Can you think of any good reason why I would lie about something like that?
Of course, you simply misunderstood things.
socratis wrote: Please check the thread you're
Could you please QUOTE the exact text in which I asked to answer here?! You can't. The links you quoted are referenced within my e-mail as the source for claims I make and that is standard on ALL mailing lists. If you think otherwise, you are simply wrong and have completely misunderstood things.
socratis wrote: As for the merging of the thread, I told you again, you might have a legit question about the large pages, but you're isolating parts of the same problem in separate threads.
That was by purpose, because the question for "largepages" is independently useful and I don't wanted people to care about my problem in the other thread too much. I could have not mentioned/linked my problem at all and it STILL MAKES SENSE on its own! The link was just a side note because some people care for additional context and to prevent the "Which performance problem?"-answers.
socratis wrote: It is my impression that the whole problem belongs in one thread, because it deals with the same fundamental question.
You are wrong, like you misinterpreted the links in my mail already. You are simply "hiding" my chance to get answers for specialized questions. Just look at how many people answered to this thread before and think of if they scroll down all the initial text and images to find some very unrelated question... It's as well interesting to see that while you did merge my second, far more specialized and RAM-oriented question to the first thread some days ago already, you didn't care even mentioning that it might be a good idea to provide some logs. It's all about some very strange editing policy decisions, not the questions itself I ask for.
Post Reply