Frequent freeses when using a raw disk

DoubleHP · Post by **DoubleHP** » 25. Nov 2016, 15:14

Hello.

I wanted to try using a raw disk; host is using sda2, and I wanted guest to use sda3; so I dclared a raw disk.

Once or twice a day, guest freeses, and I am now certain it's related to the raw disk. I asked cron to touch a file every 5mn, on every mount point it has; when freeses happens, it can not write to it's /, but it still can write on NFS mount points. It's not a kernel panic: some process are still alive for 3mn to 1h (apache, munin, and other network daemons can answer requests for some time).

I have not been able to make any correlation with any service, daemon, or process. Freeses happen at random times in the day.

Other guests work fine.

I have not changed the position of sda3 in the disk after installing the guest. I wonder if there is a bug in VB, or, which reason could make the guest remount the partition RO. I have checked my disk 10 times, it's good (i mean, the physical layer is 100% good, ok, checked). It's NOT a disk problem.

During bug, no message come on screen (RDP).

I need help to track the bug, and fix it.

I have also checked that the vmdk file points to correct addresses.

I am running out of ideas on how to track this; all I can state 100% certain is that it's not a disk faiure. I have not done yet the destructive test, but I will do it only after triple checking all other possibilities.

- how to duplicate the kernel logs to my NFS mount point ? maybe ask syslog to write logs to two files ?
- make a list of disk operations ?

Once in my life, I have found an email in a MBR, on a very fresh install; it was a classic installaion; I have never understood how I could find an email header in my MBR ... I mean, various bugs could cause uninimaginable strange situations. I could face a similar issue where for some reason, my guest is trying to write out of sda3.

Does VB keep track of disk requests ?

Post by **Perryg** » 25. Nov 2016, 15:59

I would start with the host if it were me. Things like a green drive that powers down if idle, power settings that could suspend the drive during non-use, Etc.

DoubleHP · Post by **DoubleHP** » 26. Nov 2016, 00:45

1: it's a true SSD, so no default setting would ever turn it down

2: it's very busy

3: it never gone in sleep mode ever in the last 3 years.

Post by **Perryg** » 26. Nov 2016, 00:55

I would need to see the guests log file after a freeze as well as dmesg from the host after the same freeze to see if it shows the actual error.

DoubleHP · Post by **DoubleHP** » 26. Nov 2016, 01:53

Host: lockd errors are from a time I was using NFS; NFS is now abandonned, in favour of sshfs:

Code: Select all

[279665.065880] lockd: cannot monitor leon-03
[279676.692376] lockd: cannot monitor leon-03
[279688.784214] lockd: cannot monitor leon-03
[279908.345565] HPET: Using timer above configured range: 3
[279908.677390] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[279911.289201] hid-generic 0003:0463:FFFF.0003: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[279918.321043] HPET: Using timer above configured range: 3
[279918.321061] HPET: Using timer above configured range: 3
[279918.903495] HPET: Using timer above configured range: 3
[280001.014276] lockd: cannot monitor leon-03
[280016.855961] lockd: cannot monitor leon-03
[280027.052183] lockd: cannot monitor leon-03
[280037.224525] lockd: cannot monitor leon-03
[280047.416920] lockd: cannot monitor leon-03
[280057.569842] lockd: cannot monitor leon-03
[280067.730280] lockd: cannot monitor leon-03
[280077.866811] lockd: cannot monitor leon-03
[280088.010343] lockd: cannot monitor leon-03
[280098.155184] lockd: cannot monitor leon-03
[280108.331571] lockd: cannot monitor leon-03
[280118.483529] lockd: cannot monitor leon-03
[281642.596866] lockd: cannot monitor leon-03
[282418.975235] HPET: Using timer above configured range: 3
[282419.274487] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[282421.902358] hid-generic 0003:0463:FFFF.0004: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[282429.773071] HPET: Using timer above configured range: 3
[282429.773089] HPET: Using timer above configured range: 3
[282430.375396] HPET: Using timer above configured range: 3
[282845.369816] nfsd: peername failed (err 107)!
[283932.405755] nfsd: peername failed (err 107)!
[283985.350814] nfsd: peername failed (err 107)!
[284024.690382] nfsd: peername failed (err 107)!
[303673.569681] VBoxNetFlt: Failed to allocate packet buffer, dropping the packet.
[420219.328766] HPET: Using timer above configured range: 3
[421213.488445] HPET: Using timer above configured range: 3
[935603.280633] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[935603.284079] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1022153.534123] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1022153.536542] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1108680.516130] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1108680.517685] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1195221.432277] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1195221.435449] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1281767.431249] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1281767.433466] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1303919.697878] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[1303922.334760] hid-generic 0003:0463:FFFF.0005: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[1303952.481379] HPET: Using timer above configured range: 3
[1368299.550769] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1368299.551967] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1941135.127698] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[1941137.744113] hid-generic 0003:0463:FFFF.0006: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[1941150.069046] HPET: Using timer above configured range: 3

On host, I have mounted sda3 RO; to try to debug; but started before I mounted it.

Code: Select all

# mount | grep sda3
/dev/sda3 on /mnt/sda3 type ext4 (ro,relatime,data=ordered)

USB UPS messages are about guest 4; it's also freesing, but more rarely (less that once a week; and probably due to timer bugs; we will dig this later). Both guests restarted a few hours ago. In the end, there is no line left about disk issues.

(i am too young in here to be allowed to give links; change the spaces by dot a slashes)

Guest 3: the last crash seems related to timer:
slexy org view s20I6iOBQZ
Crash detector took 6h to make it reboot.

so let me also paste the two previous logs; because one of the is related to disk:
slexy org view s2fmF0MW6P
slexy org view s21KsUfLP8
Crash detector took about half an hour to trigger reboot.

What happens in the last 3 mn of the logs is irrelevant; my host has a hard job trying to detect the crash of guests, and restart them as nicely as possible. It's so frequent I had to automate the process.

***

If i don't install NTP on all guests, some will shift a bit. Installing it seems to make VB bug a bit more often. It may depend on the kernel version of guest, and if VB drivers are installed.

Post by **Perryg** » 26. Nov 2016, 02:08

I would prefer a new dmesg and log after a freeze and one that has not been mounted RO, but I do see something that you need to address'
"EXT4-fs error (device sda3): ext4_dx_find_entry

inode #131418: block 4: comm updatedb.mlocat: Directory hole found" usually indicates a hardware issue. You should preform a full fsck and see if it indicates anything and or a smart test. When posting log files use the attach feature and upload them here. Compress if they are too large but I really need them here.

DoubleHP · Post by **DoubleHP** » 26. Nov 2016, 10:23

That forum really hates me. Yesterday, could not insert URLs because I am too young; now, can't attache 4 files at the same time; I hope I can attach the 4th one in a new message. Anyway, have you seen I edited my message and inserted desguised links ?

SMART is good

Code: Select all

SMART overall-health self-assessment test result: PASSED
[...]
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0002   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0002   100   100   000    Old_age   Always       -       30623
 12 Power_Cycle_Count       0x0002   100   100   000    Old_age   Always       -       137
171 Program_Fail_Count      0x0002   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0002   100   100   000    Old_age   Always       -       0
173 Avg_Write/Erase_Count   0x0002   100   100   000    Old_age   Always       -       578
174 Unexpect_Power_Loss_Ct  0x0002   100   100   000    Old_age   Always       -       106
187 Reported_Uncorrect      0x0002   100   100   000    Old_age   Always       -       0
230 Perc_Write/Erase_Count  0x0002   100   100   000    Old_age   Always       -       1926
232 Perc_Avail_Resrvd_Space 0x0003   100   100   005    Pre-fail  Always       -       0
234 Perc_Write/Erase_Ct_BC  0x0002   100   100   000    Old_age   Always       -       2636
241 Total_LBAs_Written      0x0002   100   100   000    Old_age   Always       -       18029031516
242 Total_LBAs_Read         0x0002   100   100   000    Old_age   Always       -       5938804581
[...]
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               90%     30622         -
# 2  Short offline       Aborted by host               90%     29519         -
# 3  Short offline       Aborted by host               90%     29519         -
# 4  Short offline       Completed without error       00%      6763         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing

And, you did not understand my explanation; ext4 messages on the host are due to the fact ... host has mounted sda3 RO, and occur at the precise moment when the guest reboots, and does it's RW fsck on the partition; due to caching, and structure of ext4 and of Linux drivers, the host is not aware the guest has modified the filesystem (and it does; I see it in the boot logs in RDP; the crashes are so hard, and so frequent, that during each reboot of the guest, it has several errors to fix). But if I umount it from host, and shutdown the guest, and perform several consecutive cold fsck from host, then, the partition becomes sane. Errors are due to the fact ext* are not designed to be mounted simultaneously by several kernels at the same time.

A full badblock test would need to shutdown the guest for a day, and make a full copy of the partition; I Could do it, but I am certain it would not find any error. I will do it before the end of the year, if no other explanation can be found.

DoubleHP · Post by **DoubleHP** » 26. Nov 2016, 10:27

Could not attach 4th file to previous message, I was about to get angry; but now I can. This is the log that should talk about disk issues;

VBox.log.gz should be current uncrashed log (probably useless to attach it)
VBox.log.1.gz or VBox.log.2.gz (or both) may be about a TIMER issue.
VBox.log.3.gz should be about disk issue.

DoubleHP · Post by **DoubleHP** » 26. Nov 2016, 10:35

There is no "new" dmesg for host, since it never reboots. It has now 22d uptime. In the meanwhile, guest has an average uptime of 0.5 d over the last 4 weeks (munin).

Post by **Perryg** » 26. Nov 2016, 15:28

How many other guests do you run at the same time? Also is the host used for anything else?
Another thing I see which might be a problem is you are running the distro fork of VirtualBox and we do not support that since they can and do modify the source code.

DoubleHP · Post by **DoubleHP** » 26. Nov 2016, 23:24

Perryg wrote:How many other guests do you run at the same time?

3 guests for now; maybe more if I can't fix this issue and may spread services amongst dedicated hosts.

Perryg wrote:Also is the host used for anything else?

Else than what ?

Perryg wrote:Another thing I see which might be a problem is you are running the distro fork of VirtualBox and we do not support that since they can and do modify the source code.

I have found a bug in VB the first week I used VB, and the bug was from upstream code. There is less than 1% chance my disk issue is related to Debian specific code. Timer issue could have a very large variety of cause; but it also has a large variety of solutions ... So if you don't wanna help, I convert my partition into classic virtual disk and we forget about bugs.

Post by **Perryg** » 26. Nov 2016, 23:36

So if you don't wanna help

I don't think I said anything about not wanting to help. All I said is we do not support forks and gave you the reason why. So back to the discussion. It could be that you are requesting way too much from your host. You are assigning all of the hosts cores to the guest which is usually a bad idea since it will not leave any resources for the host to do its thing including the host side of the VBox code. If you add anything else besides the one that is using all of the cores you will have issue like you are seeing because of timing issues and disk IO which is also tied to the resources available. Anyway I don't know anything else to explain so I will monitor the situation and wait for others to see if they can piece this all out.

DoubleHP · Post by **DoubleHP** » 27. Nov 2016, 01:02

Perryg wrote: It could be that you are requesting way too much from your host. You are assigning all of the hosts cores to the guest which is usually a bad idea

Which is why I am doing that.

I have assigned ... all my cores with an execution cap of 85%.

And crash times do not match at all the busy times of the machine; in fact, 50% of crashes happen during iddle moments of the day, with more than 180% iddle CPU (amongst 400% total time).

So, crash times of guest has no statistical relation with load or iddle of host.

***

Copy done in 2mn, topic closed for now. Either guest will keep crashing, and it will not be related to raw mode, or, it will stop crashing due to disk, and I don't mind anymore.

https://nfolamp.wordpress.com/2010/06/1 ... boxmanage/

Post by **socratis** » 27. Nov 2016, 01:15

<Off topic>
Do you mind adding an "i" between the "m" and the "n" when you're referring to minutes? It's not too much. Pet peeve... Thanks.
</Off topic>

virtualbox.org

Frequent freeses when using a raw disk

Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk

Re: Frequent freeses when using a raw disk