Frequent freeses when using a raw disk
Frequent freeses when using a raw disk
Hello.
I wanted to try using a raw disk; host is using sda2, and I wanted guest to use sda3; so I dclared a raw disk.
Once or twice a day, guest freeses, and I am now certain it's related to the raw disk. I asked cron to touch a file every 5mn, on every mount point it has; when freeses happens, it can not write to it's /, but it still can write on NFS mount points. It's not a kernel panic: some process are still alive for 3mn to 1h (apache, munin, and other network daemons can answer requests for some time).
I have not been able to make any correlation with any service, daemon, or process. Freeses happen at random times in the day.
Other guests work fine.
I have not changed the position of sda3 in the disk after installing the guest. I wonder if there is a bug in VB, or, which reason could make the guest remount the partition RO. I have checked my disk 10 times, it's good (i mean, the physical layer is 100% good, ok, checked). It's NOT a disk problem.
During bug, no message come on screen (RDP).
I need help to track the bug, and fix it.
I have also checked that the vmdk file points to correct addresses.
I am running out of ideas on how to track this; all I can state 100% certain is that it's not a disk faiure. I have not done yet the destructive test, but I will do it only after triple checking all other possibilities.
- how to duplicate the kernel logs to my NFS mount point ? maybe ask syslog to write logs to two files ?
- make a list of disk operations ?
Once in my life, I have found an email in a MBR, on a very fresh install; it was a classic installaion; I have never understood how I could find an email header in my MBR ... I mean, various bugs could cause uninimaginable strange situations. I could face a similar issue where for some reason, my guest is trying to write out of sda3.
Does VB keep track of disk requests ?
I wanted to try using a raw disk; host is using sda2, and I wanted guest to use sda3; so I dclared a raw disk.
Once or twice a day, guest freeses, and I am now certain it's related to the raw disk. I asked cron to touch a file every 5mn, on every mount point it has; when freeses happens, it can not write to it's /, but it still can write on NFS mount points. It's not a kernel panic: some process are still alive for 3mn to 1h (apache, munin, and other network daemons can answer requests for some time).
I have not been able to make any correlation with any service, daemon, or process. Freeses happen at random times in the day.
Other guests work fine.
I have not changed the position of sda3 in the disk after installing the guest. I wonder if there is a bug in VB, or, which reason could make the guest remount the partition RO. I have checked my disk 10 times, it's good (i mean, the physical layer is 100% good, ok, checked). It's NOT a disk problem.
During bug, no message come on screen (RDP).
I need help to track the bug, and fix it.
I have also checked that the vmdk file points to correct addresses.
I am running out of ideas on how to track this; all I can state 100% certain is that it's not a disk faiure. I have not done yet the destructive test, but I will do it only after triple checking all other possibilities.
- how to duplicate the kernel logs to my NFS mount point ? maybe ask syslog to write logs to two files ?
- make a list of disk operations ?
Once in my life, I have found an email in a MBR, on a very fresh install; it was a classic installaion; I have never understood how I could find an email header in my MBR ... I mean, various bugs could cause uninimaginable strange situations. I could face a similar issue where for some reason, my guest is trying to write out of sda3.
Does VB keep track of disk requests ?
-
- Site Moderator
- Posts: 34369
- Joined: 6. Sep 2008, 22:55
- Primary OS: Linux other
- VBox Version: OSE self-compiled
- Guest OSses: *NIX
Re: Frequent freeses when using a raw disk
I would start with the host if it were me. Things like a green drive that powers down if idle, power settings that could suspend the drive during non-use, Etc.
Re: Frequent freeses when using a raw disk
1: it's a true SSD, so no default setting would ever turn it down
2: it's very busy
3: it never gone in sleep mode ever in the last 3 years.
2: it's very busy
3: it never gone in sleep mode ever in the last 3 years.
-
- Site Moderator
- Posts: 34369
- Joined: 6. Sep 2008, 22:55
- Primary OS: Linux other
- VBox Version: OSE self-compiled
- Guest OSses: *NIX
Re: Frequent freeses when using a raw disk
I would need to see the guests log file after a freeze as well as dmesg from the host after the same freeze to see if it shows the actual error.
Re: Frequent freeses when using a raw disk
Host: lockd errors are from a time I was using NFS; NFS is now abandonned, in favour of sshfs:
On host, I have mounted sda3 RO; to try to debug; but started before I mounted it.
USB UPS messages are about guest 4; it's also freesing, but more rarely (less that once a week; and probably due to timer bugs; we will dig this later). Both guests restarted a few hours ago. In the end, there is no line left about disk issues.
(i am too young in here to be allowed to give links; change the spaces by dot a slashes)
Guest 3: the last crash seems related to timer:
slexy org view s20I6iOBQZ
Crash detector took 6h to make it reboot.
so let me also paste the two previous logs; because one of the is related to disk:
slexy org view s2fmF0MW6P
slexy org view s21KsUfLP8
Crash detector took about half an hour to trigger reboot.
What happens in the last 3 mn of the logs is irrelevant; my host has a hard job trying to detect the crash of guests, and restart them as nicely as possible. It's so frequent I had to automate the process.
***
If i don't install NTP on all guests, some will shift a bit. Installing it seems to make VB bug a bit more often. It may depend on the kernel version of guest, and if VB drivers are installed.
Code: Select all
[279665.065880] lockd: cannot monitor leon-03
[279676.692376] lockd: cannot monitor leon-03
[279688.784214] lockd: cannot monitor leon-03
[279908.345565] HPET: Using timer above configured range: 3
[279908.677390] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[279911.289201] hid-generic 0003:0463:FFFF.0003: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[279918.321043] HPET: Using timer above configured range: 3
[279918.321061] HPET: Using timer above configured range: 3
[279918.903495] HPET: Using timer above configured range: 3
[280001.014276] lockd: cannot monitor leon-03
[280016.855961] lockd: cannot monitor leon-03
[280027.052183] lockd: cannot monitor leon-03
[280037.224525] lockd: cannot monitor leon-03
[280047.416920] lockd: cannot monitor leon-03
[280057.569842] lockd: cannot monitor leon-03
[280067.730280] lockd: cannot monitor leon-03
[280077.866811] lockd: cannot monitor leon-03
[280088.010343] lockd: cannot monitor leon-03
[280098.155184] lockd: cannot monitor leon-03
[280108.331571] lockd: cannot monitor leon-03
[280118.483529] lockd: cannot monitor leon-03
[281642.596866] lockd: cannot monitor leon-03
[282418.975235] HPET: Using timer above configured range: 3
[282419.274487] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[282421.902358] hid-generic 0003:0463:FFFF.0004: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[282429.773071] HPET: Using timer above configured range: 3
[282429.773089] HPET: Using timer above configured range: 3
[282430.375396] HPET: Using timer above configured range: 3
[282845.369816] nfsd: peername failed (err 107)!
[283932.405755] nfsd: peername failed (err 107)!
[283985.350814] nfsd: peername failed (err 107)!
[284024.690382] nfsd: peername failed (err 107)!
[303673.569681] VBoxNetFlt: Failed to allocate packet buffer, dropping the packet.
[420219.328766] HPET: Using timer above configured range: 3
[421213.488445] HPET: Using timer above configured range: 3
[935603.280633] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[935603.284079] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1022153.534123] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1022153.536542] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1108680.516130] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1108680.517685] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1195221.432277] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1195221.435449] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1281767.431249] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1281767.433466] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1303919.697878] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[1303922.334760] hid-generic 0003:0463:FFFF.0005: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[1303952.481379] HPET: Using timer above configured range: 3
[1368299.550769] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1368299.551967] EXT4-fs error (device sda3): ext4_dx_find_entry:1524: inode #131418: block 4: comm updatedb.mlocat: Directory hole found
[1941135.127698] usb 1-2.3: reset low-speed USB device number 4 using xhci_hcd
[1941137.744113] hid-generic 0003:0463:FFFF.0006: hiddev0,hidraw0: USB HID v1.10 Device [MGE UPS SYSTEMS PULSAR] on usb-0000:00:14.0-2.3/input0
[1941150.069046] HPET: Using timer above configured range: 3
Code: Select all
# mount | grep sda3
/dev/sda3 on /mnt/sda3 type ext4 (ro,relatime,data=ordered)
(i am too young in here to be allowed to give links; change the spaces by dot a slashes)
Guest 3: the last crash seems related to timer:
slexy org view s20I6iOBQZ
Crash detector took 6h to make it reboot.
so let me also paste the two previous logs; because one of the is related to disk:
slexy org view s2fmF0MW6P
slexy org view s21KsUfLP8
Crash detector took about half an hour to trigger reboot.
What happens in the last 3 mn of the logs is irrelevant; my host has a hard job trying to detect the crash of guests, and restart them as nicely as possible. It's so frequent I had to automate the process.
***
If i don't install NTP on all guests, some will shift a bit. Installing it seems to make VB bug a bit more often. It may depend on the kernel version of guest, and if VB drivers are installed.
-
- Site Moderator
- Posts: 34369
- Joined: 6. Sep 2008, 22:55
- Primary OS: Linux other
- VBox Version: OSE self-compiled
- Guest OSses: *NIX
Re: Frequent freeses when using a raw disk
I would prefer a new dmesg and log after a freeze and one that has not been mounted RO, but I do see something that you need to address'
"EXT4-fs error (device sda3): ext4_dx_find_entry inode #131418: block 4: comm updatedb.mlocat: Directory hole found" usually indicates a hardware issue. You should preform a full fsck and see if it indicates anything and or a smart test. When posting log files use the attach feature and upload them here. Compress if they are too large but I really need them here.
"EXT4-fs error (device sda3): ext4_dx_find_entry inode #131418: block 4: comm updatedb.mlocat: Directory hole found" usually indicates a hardware issue. You should preform a full fsck and see if it indicates anything and or a smart test. When posting log files use the attach feature and upload them here. Compress if they are too large but I really need them here.
Re: Frequent freeses when using a raw disk
That forum really hates me. Yesterday, could not insert URLs because I am too young; now, can't attache 4 files at the same time; I hope I can attach the 4th one in a new message. Anyway, have you seen I edited my message and inserted desguised links ?
SMART is good
And, you did not understand my explanation; ext4 messages on the host are due to the fact ... host has mounted sda3 RO, and occur at the precise moment when the guest reboots, and does it's RW fsck on the partition; due to caching, and structure of ext4 and of Linux drivers, the host is not aware the guest has modified the filesystem (and it does; I see it in the boot logs in RDP; the crashes are so hard, and so frequent, that during each reboot of the guest, it has several errors to fix). But if I umount it from host, and shutdown the guest, and perform several consecutive cold fsck from host, then, the partition becomes sane. Errors are due to the fact ext* are not designed to be mounted simultaneously by several kernels at the same time.
A full badblock test would need to shutdown the guest for a day, and make a full copy of the partition; I Could do it, but I am certain it would not find any error. I will do it before the end of the year, if no other explanation can be found.
SMART is good
Code: Select all
SMART overall-health self-assessment test result: PASSED
[...]
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0002 100 100 000 Old_age Always - 30623
12 Power_Cycle_Count 0x0002 100 100 000 Old_age Always - 137
171 Program_Fail_Count 0x0002 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0002 100 100 000 Old_age Always - 0
173 Avg_Write/Erase_Count 0x0002 100 100 000 Old_age Always - 578
174 Unexpect_Power_Loss_Ct 0x0002 100 100 000 Old_age Always - 106
187 Reported_Uncorrect 0x0002 100 100 000 Old_age Always - 0
230 Perc_Write/Erase_Count 0x0002 100 100 000 Old_age Always - 1926
232 Perc_Avail_Resrvd_Space 0x0003 100 100 005 Pre-fail Always - 0
234 Perc_Write/Erase_Ct_BC 0x0002 100 100 000 Old_age Always - 2636
241 Total_LBAs_Written 0x0002 100 100 000 Old_age Always - 18029031516
242 Total_LBAs_Read 0x0002 100 100 000 Old_age Always - 5938804581
[...]
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 90% 30622 -
# 2 Short offline Aborted by host 90% 29519 -
# 3 Short offline Aborted by host 90% 29519 -
# 4 Short offline Completed without error 00% 6763 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
A full badblock test would need to shutdown the guest for a day, and make a full copy of the partition; I Could do it, but I am certain it would not find any error. I will do it before the end of the year, if no other explanation can be found.
- Attachments
-
- VBox.log.2.gz
- (28.78 KiB) Downloaded 8 times
-
- VBox.log.1.gz
- (33.58 KiB) Downloaded 10 times
-
- VBox.log.gz
- (24.71 KiB) Downloaded 11 times
Re: Frequent freeses when using a raw disk
Could not attach 4th file to previous message, I was about to get angry; but now I can. This is the log that should talk about disk issues;
VBox.log.gz should be current uncrashed log (probably useless to attach it)
VBox.log.1.gz or VBox.log.2.gz (or both) may be about a TIMER issue.
VBox.log.3.gz should be about disk issue.
VBox.log.gz should be current uncrashed log (probably useless to attach it)
VBox.log.1.gz or VBox.log.2.gz (or both) may be about a TIMER issue.
VBox.log.3.gz should be about disk issue.
- Attachments
-
- VBox.log.3.gz
- (62.25 KiB) Downloaded 9 times
Re: Frequent freeses when using a raw disk
There is no "new" dmesg for host, since it never reboots. It has now 22d uptime. In the meanwhile, guest has an average uptime of 0.5 d over the last 4 weeks (munin).
- Attachments
-
- dmesg.gz
- (20.11 KiB) Downloaded 10 times
-
- Site Moderator
- Posts: 34369
- Joined: 6. Sep 2008, 22:55
- Primary OS: Linux other
- VBox Version: OSE self-compiled
- Guest OSses: *NIX
Re: Frequent freeses when using a raw disk
How many other guests do you run at the same time? Also is the host used for anything else?
Another thing I see which might be a problem is you are running the distro fork of VirtualBox and we do not support that since they can and do modify the source code.
Another thing I see which might be a problem is you are running the distro fork of VirtualBox and we do not support that since they can and do modify the source code.
Re: Frequent freeses when using a raw disk
3 guests for now; maybe more if I can't fix this issue and may spread services amongst dedicated hosts.Perryg wrote:How many other guests do you run at the same time?
Else than what ?Perryg wrote:Also is the host used for anything else?
I have found a bug in VB the first week I used VB, and the bug was from upstream code. There is less than 1% chance my disk issue is related to Debian specific code. Timer issue could have a very large variety of cause; but it also has a large variety of solutions ... So if you don't wanna help, I convert my partition into classic virtual disk and we forget about bugs.Perryg wrote:Another thing I see which might be a problem is you are running the distro fork of VirtualBox and we do not support that since they can and do modify the source code.
-
- Site Moderator
- Posts: 34369
- Joined: 6. Sep 2008, 22:55
- Primary OS: Linux other
- VBox Version: OSE self-compiled
- Guest OSses: *NIX
Re: Frequent freeses when using a raw disk
I don't think I said anything about not wanting to help. All I said is we do not support forks and gave you the reason why. So back to the discussion. It could be that you are requesting way too much from your host. You are assigning all of the hosts cores to the guest which is usually a bad idea since it will not leave any resources for the host to do its thing including the host side of the VBox code. If you add anything else besides the one that is using all of the cores you will have issue like you are seeing because of timing issues and disk IO which is also tied to the resources available. Anyway I don't know anything else to explain so I will monitor the situation and wait for others to see if they can piece this all out.So if you don't wanna help
Re: Frequent freeses when using a raw disk
Which is why I am doing that.Perryg wrote: It could be that you are requesting way too much from your host. You are assigning all of the hosts cores to the guest which is usually a bad idea
I have assigned ... all my cores with an execution cap of 85%.
And crash times do not match at all the busy times of the machine; in fact, 50% of crashes happen during iddle moments of the day, with more than 180% iddle CPU (amongst 400% total time).
So, crash times of guest has no statistical relation with load or iddle of host.
***
Copy done in 2mn, topic closed for now. Either guest will keep crashing, and it will not be related to raw mode, or, it will stop crashing due to disk, and I don't mind anymore.
https://nfolamp.wordpress.com/2010/06/1 ... boxmanage/
-
- Site Moderator
- Posts: 27329
- Joined: 22. Oct 2010, 11:03
- Primary OS: Mac OS X other
- VBox Version: PUEL
- Guest OSses: Win(*>98), Linux*, OSX>10.5
- Location: Greece
Re: Frequent freeses when using a raw disk
<Off topic>
Do you mind adding an "i" between the "m" and the "n" when you're referring to minutes? It's not too much. Pet peeve... Thanks.
</Off topic>
Do you mind adding an "i" between the "m" and the "n" when you're referring to minutes? It's not too much. Pet peeve... Thanks.
</Off topic>
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.