Reviving and returning to this thread after having to set it aside for other priorities ...
While running said vm, (often with a second vm also up and running) everything on the vm and the host seems to grind to a halt. This doesn't start immediately, but after the vm has been up for a while and I've done some fairly intensive work on it .. moving data files within an oracle database, taking database backups with rman, etc. But when it hits, everything goes south. Most other apps on the host (Outlook, Word, Firefox) will start going into "Not responding" state, thing like notepad and putty are very slow to respond to even a keystroke, etc. Even launching task manager can take a couple of minutes (yes, minutes) to complete. Even longer to get Resource Monitor up. When up, they don't show any particular stress on memory or CPU. "System idle" still shows over 90% of the CPU. But a VBox task will be completely saturating the disk - with any where from 500k to over 10million total bytes/sec being reported. Eventually the vm spews out this: (recovered on the next restart from /var/log/messages.1)
Code: Select all
May 22 08:19:55 vbdwprod avahi-daemon[3266]: Server startup complete. Host name is vbdwprod.local. Local service cookie is 3055222426.
May 22 09:42:51 vbdwprod kernel: INFO: task syslogd:2446 blocked for more than 120 seconds.
May 22 09:43:21 vbdwprod kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 22 09:43:21 vbdwprod kernel: syslogd D ffff8100261413d8 0 2446 1 2449 2412 (NOTLB)
May 22 09:43:21 vbdwprod kernel: ffff81006444dd88 0000000000000082 ffff8100199f5e50 ffffffff88032ca2
May 22 09:43:21 vbdwprod kernel: 0000000000001000 0000000000000009 ffff81006e13c080 ffff8100490d2860
May 22 09:43:21 vbdwprod kernel: 0000046eae4222ac 0000000000003515 ffff81006e13c268 00000000261413d8
May 22 09:43:21 vbdwprod kernel: Call Trace:
May 22 09:43:21 vbdwprod kernel: [<ffffffff88032ca2>] :jbd:journal_dirty_data+0x1fa/0x205
May 22 09:43:21 vbdwprod kernel: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
May 22 09:43:21 vbdwprod kernel: [<ffffffff800a28bf>] autoremove_wake_function+0x0/0x2e
May 22 09:43:21 vbdwprod kernel: [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
May 22 09:43:21 vbdwprod kernel: [<ffffffff8002fcdc>] __writeback_single_inode+0x1d9/0x318
May 22 09:43:21 vbdwprod kernel: [<ffffffff800e2c72>] do_readv_writev+0x26e/0x291
May 22 09:46:24 vbdwprod kernel: [<ffffffff800f5a4d>] sync_inode+0x24/0x33
May 22 09:49:54 vbdwprod kernel: [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
May 22 09:49:59 vbdwprod kernel: [<ffffffff8005030e>] do_fsync+0x52/0xa4
May 22 09:49:59 vbdwprod kernel: [<ffffffff800e34f7>] __do_fsync+0x23/0x36
May 22 09:50:16 vbdwprod kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
May 22 09:54:22 vbdwprod kernel:
May 22 10:00:22 vbdwprod kernel: sd 2:0:0:0: SCSI error: return code = 0x06000000
May 22 10:02:10 vbdwprod kernel: end_request: I/O error, dev sdc, sector 2438208
May 22 10:02:10 vbdwprod kernel: INFO: task syslogd:2446 blocked for more than 120 seconds.
May 22 10:02:13 vbdwprod kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 22 10:02:13 vbdwprod kernel: syslogd D ffff8100261413d8 0 2446 1 2449 2412 (NOTLB)
May 22 10:02:13 vbdwprod kernel: ffff81006444dd88 0000000000000082 ffff81002c779430 ffffffff88032ca2
May 22 10:02:13 vbdwprod kernel: 0000000000001000 0000000000000009 ffff81006e13c080 ffff81000826a080
May 22 10:02:13 vbdwprod kernel: 000004c8359f3ce6 0000000000017f9b ffff81006e13c268 00000000261413d8
May 22 10:02:16 vbdwprod kernel: Call Trace:
May 22 10:02:16 vbdwprod kernel: [<ffffffff88032ca2>] :jbd:journal_dirty_data+0x1fa/0x205
May 22 10:02:16 vbdwprod kernel: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
May 22 10:02:16 vbdwprod kernel: [<ffffffff800a28bf>] autoremove_wake_function+0x0/0x2e
May 22 10:02:16 vbdwprod kernel: [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
May 22 10:02:17 vbdwprod kernel: [<ffffffff8002fcdc>] __writeback_single_inode+0x1d9/0x318
May 22 10:02:17 vbdwprod kernel: [<ffffffff800e2c72>] do_readv_writev+0x26e/0x291
May 22 10:02:17 vbdwprod kernel: [<ffffffff800f5a4d>] sync_inode+0x24/0x33
May 22 10:02:18 vbdwprod kernel: [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
May 22 10:02:24 vbdwprod kernel: [<ffffffff8005030e>] do_fsync+0x52/0xa4
May 22 10:02:24 vbdwprod kernel: [<ffffffff800e34f7>] __do_fsync+0x23/0x36
May 22 10:02:24 vbdwprod kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
May 22 10:02:24 vbdwprod kernel:
At this point, even issuing an ACPI shutdown from the vm's window menu is painfully slow - an operation that normally completes in less than a minute taking 10 minutes or more. But once the vm is down, all other operations resume. No more "not responding", etc.
Does this look like some sort of "disk" corruption, as earlier discussed?
If so, would this be logical corruption of the data in the disk file - not seen as corruption by the host?
Or would it be "physical" corruption that could be detected as such by the host?
In ether case, would/could the act of zipping the files for a periodic backup involve some internal error checking that would effectively correct it? Would cloning the vm possibly involve some internal error checking that would effectively correct it?
If there is corruption and the internal error checking of creating a zip does not correct it, then my backups are pretty much worthless, as they would also have the corruption.