SCSI error: return code = 0x0600000

edstevens · Post by **edstevens** » 12. Mar 2014, 19:02

Hardware: dell optiplex 9010 w/ i5-3570; 8gb ram
Host os: Windows 7 Pro, sp1, 64bit
Vbox 4.2.16 r86992
Guest OS: OL 5.6

[root@vbdwprod var]# uname -a
Linux vbdwprod.vbdomain 2.6.18-238.1.1.0.1.el5 #1 SMP Thu Jan 20 23:52:48 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

Starting a couple of days ago, while working with my vm, things will seem to grind to a halt, then I get the following message:

Code: Select all

Message from syslogd@ at Wed Mar 12 11:42:14 2014 ...
vbdwprod kernel: journal commit I/O error

After which my Oracle database crashes. there are messages that fly by on the VBox vm console, mentioning setting file systems to read only. I am able to log on as root and look at /var/log/messages. The last few lines are

Code: Select all

Mar 12 10:34:05 vbdwprod smartd[3306]: Monitoring 0 ATA and 0 SCSI devices
Mar 12 10:34:05 vbdwprod smartd[3309]: smartd has fork()ed into background mode. New PID=3309.
Mar 12 10:34:05 vbdwprod avahi-daemon[3267]: Server startup complete. Host name is vbdwprod.local. Local service cookie is 3066625204.
Mar 12 11:38:54 vbdwprod kernel: sd 0:0:0:0: SCSI error: return code = 0x06000000
[root@vbdwprod var]#

The only recovery I know of at this point is to shut down the vm and revert back to the most recent snapshot.
This system has been in use daily for several months with no issues, then suddenly this starts appearing.
Any clues appreciated.

Post by **mpack** » 12. Mar 2014, 19:26

Sounds like disk corruption affecting the guest drive. Or, given that you're using snapshots, perhaps you're running out of host disk space?

The only real cure is (a) find and repair any errors on the host, (b) restore the VM from a full backup taken before the corruption occurred. And make sure you have enough host disk space of course.

Incidentally: snapshots are not backups. They contain no redundancy, so the only way you can eliminate corruption from a snapshot chain is to revert to an earlier state - losing data - and only hope the lost data includes the corrupted parts.

edstevens · Post by **edstevens** » 13. Mar 2014, 19:54

Thanks for the reply.

Regarding host disk corruption .. taking your cue, I ran chkdsk. I had to leave before it was complete, so didn't see the final report. For the time I could watch it, there were some points when it seemed to be taking an inordinate amount of time. Perhaps suggesting it had detected a problem area and was making repairs - or at least fencing off bad sectors?

Regarding falling back to previous snapshots, losing the most recent (in case there is corruption in the differential files for the most/more recent snapshots: That's do-able. The 3 most recent in particular won't really cause me much loss.

Regarding snapshots not being backups: understood. Never intended them to be a real backup. I've taken a few zips of the entire directory for the VM's files, but it's been a while and pretty irregular. I'm looking to script that on a regular basis. Seems Windows (the host os) doesn't have a native, command-line zip utility, so that won't be the no-brain operation it would be in *nix.

I guess that given the nature of the problem -- not being able to absolutely repeat the error on demand -- if something does NOT fix it, it will take a little while to know. If something DOES fix it, we'll never really know .. just that I'll eventually get to a point where it seems to be fixed.

Anyway, thanks for the feedback and suggestions.

Post by **mpack** » 14. Mar 2014, 00:38

Unfortunately, chkdsk taking an unusually long time for a disk check is not a good sign, especially if accompanied by repetitive clicking noises, and doesn't usually mean that errors are being repaired. IME it means that a whole disk image backup would be a really good idea around now. Was this a standard filesystem test or did you tell it to do a thorough surface scan?

edstevens · Post by **edstevens** » 22. May 2014, 18:37

Reviving and returning to this thread after having to set it aside for other priorities ...

While running said vm, (often with a second vm also up and running) everything on the vm and the host seems to grind to a halt. This doesn't start immediately, but after the vm has been up for a while and I've done some fairly intensive work on it .. moving data files within an oracle database, taking database backups with rman, etc. But when it hits, everything goes south. Most other apps on the host (Outlook, Word, Firefox) will start going into "Not responding" state, thing like notepad and putty are very slow to respond to even a keystroke, etc. Even launching task manager can take a couple of minutes (yes, minutes) to complete. Even longer to get Resource Monitor up. When up, they don't show any particular stress on memory or CPU. "System idle" still shows over 90% of the CPU. But a VBox task will be completely saturating the disk - with any where from 500k to over 10million total bytes/sec being reported. Eventually the vm spews out this: (recovered on the next restart from /var/log/messages.1)

Code: Select all

May 22 08:19:55 vbdwprod avahi-daemon[3266]: Server startup complete. Host name is vbdwprod.local. Local service cookie is 3055222426.
May 22 09:42:51 vbdwprod kernel: INFO: task syslogd:2446 blocked for more than 120 seconds.
May 22 09:43:21 vbdwprod kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 22 09:43:21 vbdwprod kernel: syslogd       D ffff8100261413d8     0  2446      1          2449  2412 (NOTLB)
May 22 09:43:21 vbdwprod kernel:  ffff81006444dd88 0000000000000082 ffff8100199f5e50 ffffffff88032ca2
May 22 09:43:21 vbdwprod kernel:  0000000000001000 0000000000000009 ffff81006e13c080 ffff8100490d2860
May 22 09:43:21 vbdwprod kernel:  0000046eae4222ac 0000000000003515 ffff81006e13c268 00000000261413d8
May 22 09:43:21 vbdwprod kernel: Call Trace:
May 22 09:43:21 vbdwprod kernel:  [<ffffffff88032ca2>] :jbd:journal_dirty_data+0x1fa/0x205
May 22 09:43:21 vbdwprod kernel:  [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
May 22 09:43:21 vbdwprod kernel:  [<ffffffff800a28bf>] autoremove_wake_function+0x0/0x2e
May 22 09:43:21 vbdwprod kernel:  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
May 22 09:43:21 vbdwprod kernel:  [<ffffffff8002fcdc>] __writeback_single_inode+0x1d9/0x318
May 22 09:43:21 vbdwprod kernel:  [<ffffffff800e2c72>] do_readv_writev+0x26e/0x291
May 22 09:46:24 vbdwprod kernel:  [<ffffffff800f5a4d>] sync_inode+0x24/0x33
May 22 09:49:54 vbdwprod kernel:  [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
May 22 09:49:59 vbdwprod kernel:  [<ffffffff8005030e>] do_fsync+0x52/0xa4
May 22 09:49:59 vbdwprod kernel:  [<ffffffff800e34f7>] __do_fsync+0x23/0x36
May 22 09:50:16 vbdwprod kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
May 22 09:54:22 vbdwprod kernel: 
May 22 10:00:22 vbdwprod kernel: sd 2:0:0:0: SCSI error: return code = 0x06000000
May 22 10:02:10 vbdwprod kernel: end_request: I/O error, dev sdc, sector 2438208
May 22 10:02:10 vbdwprod kernel: INFO: task syslogd:2446 blocked for more than 120 seconds.
May 22 10:02:13 vbdwprod kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 22 10:02:13 vbdwprod kernel: syslogd       D ffff8100261413d8     0  2446      1          2449  2412 (NOTLB)
May 22 10:02:13 vbdwprod kernel:  ffff81006444dd88 0000000000000082 ffff81002c779430 ffffffff88032ca2
May 22 10:02:13 vbdwprod kernel:  0000000000001000 0000000000000009 ffff81006e13c080 ffff81000826a080
May 22 10:02:13 vbdwprod kernel:  000004c8359f3ce6 0000000000017f9b ffff81006e13c268 00000000261413d8
May 22 10:02:16 vbdwprod kernel: Call Trace:
May 22 10:02:16 vbdwprod kernel:  [<ffffffff88032ca2>] :jbd:journal_dirty_data+0x1fa/0x205
May 22 10:02:16 vbdwprod kernel:  [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
May 22 10:02:16 vbdwprod kernel:  [<ffffffff800a28bf>] autoremove_wake_function+0x0/0x2e
May 22 10:02:16 vbdwprod kernel:  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
May 22 10:02:17 vbdwprod kernel:  [<ffffffff8002fcdc>] __writeback_single_inode+0x1d9/0x318
May 22 10:02:17 vbdwprod kernel:  [<ffffffff800e2c72>] do_readv_writev+0x26e/0x291
May 22 10:02:17 vbdwprod kernel:  [<ffffffff800f5a4d>] sync_inode+0x24/0x33
May 22 10:02:18 vbdwprod kernel:  [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
May 22 10:02:24 vbdwprod kernel:  [<ffffffff8005030e>] do_fsync+0x52/0xa4
May 22 10:02:24 vbdwprod kernel:  [<ffffffff800e34f7>] __do_fsync+0x23/0x36
May 22 10:02:24 vbdwprod kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
May 22 10:02:24 vbdwprod kernel:

At this point, even issuing an ACPI shutdown from the vm's window menu is painfully slow - an operation that normally completes in less than a minute taking 10 minutes or more. But once the vm is down, all other operations resume. No more "not responding", etc.

Does this look like some sort of "disk" corruption, as earlier discussed?
If so, would this be logical corruption of the data in the disk file - not seen as corruption by the host?
Or would it be "physical" corruption that could be detected as such by the host?

In ether case, would/could the act of zipping the files for a periodic backup involve some internal error checking that would effectively correct it? Would cloning the vm possibly involve some internal error checking that would effectively correct it?
If there is corruption and the internal error checking of creating a zip does not correct it, then my backups are pretty much worthless, as they would also have the corruption.

Post by **mpack** » 22. May 2014, 18:51

I thought the evidence for host disk corruption (actually, disk damage) was already pretty clear from the earlier discussion, I don't really understand the need to hash it further.

You can run tools which can mark bad areas of the disk for non-use, but personally I would scrap the drive and get a replacement I can trust. My lost time and data are much more expensive that even a good quality OEM drive.

Creating a zip doesn't involve any data correction. All that a CRC ensures is that you can detect (but not correct) if the data that comes out is not the same as what went in.

edstevens · Post by **edstevens** » 23. May 2014, 14:50

I'll admit I was grasping at straws in asking if zip would do any error correction as part of his crc check. I didn't think it did, but you never know what you don't know.

I again ran a Windows disk error check, with the "Automatically fix file errors option enabled" and "scn for and attempt recovery of bad sectors" and it found no issues.

Cost of a new disk may not be much, but the machine belongs to my employer. If I can't demonstrate a disk issue (outside of what a vm is reporting) a new one is a non-starter. I work for a non-profit, so every penny requires a good justification, and a lot of people around here don't see the value in my having a private computer lab - in particular the person I have to go through to get any work done on my workstation.

Thanks anyway. If this persists, I guess I'll just have to rebuild the vm.

edstevens · Post by **edstevens** » 6. Jun 2014, 19:58

I really (really!) hate to beat a dead horse, but something just isn't ringing true here. I understand what you are saying about the reported error message indicating a corrupt disk, however

- running disk repair on the host indicated no issues.
- I am now seeing the same behavior on another host machine. Again, disk repair there shows no issues.

It seems that if it were due to a corruption of the host OS disk, that the odds of it occurring on two different machines owned by the same person would be vanishingly small. Also if it were due to actual corruption of the host disk, I'd expect it to manifest in other areas, not just within a guest OS. It seems if it were some internal corruption of host files that represent the guest disks, the odds of it showing across multiple guests ona one host system as well as across multiple host systems would be .. well ... I think I have a better chance of winning multiple jackpot lotteries in one tax year AND getting struck by lightning multiple times.

I'll admit that if it were due to some inherent bug in VBox, the odds of it NOT haveing been reported by others would also be vanishingly small.

To recap some other observations ..
- When this issue occurs, my first indication is that everything on the host os comes to a grinding halt. Almost all other apps (FF, Word, Outlook, etc all begin to show 'not responding'. Keystroke input and echo in putty sessions to my physical servers is noticeably delayed. Even getting Task Manager started to get a view of what's going on can take 5 minutes or more.

- This general system slowdown ONLY occurs just prior to the guest os throwing the earlier reported error. As soon as I'm able to shut down the guest, everything else goes back to normal.

- When this is occurring, and I finally get Task Manager and then Resource Monitor up, there is no reported stress on CPU, which is what I'd expect with all the apps going to 'not responding'. There is no reported stress on the network. The only thing that looks stressed is disk activity, with it all going to VBox and reporting total i/o as easily in excess of 1 million B/sec.

virtualbox.org

SCSI error: return code = 0x0600000

SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000

Re: SCSI error: return code = 0x0600000