Save disk space during backups?

This is for discussing general topics about how to use VirtualBox.
Post Reply
Cubytus32
Posts: 40
Joined: 28. Aug 2013, 03:41

Save disk space during backups?

Post by Cubytus32 »

Hello all,

I have multiple VMs running off a RAID0 external HDD. As such, the risk of drive failure is increased, and I want to make regular backups of all the machines. Almost all the virtual HDDs are in VDI format. However, with the current setup (mostly VDI virtual HDDs), the rsync-based backup takes a great deal of space on the backup server.

I would like to know what would be the way to automatically register changes in the VMs in the smallest possible differencing image.

The documentation isn't THAT clear, and abundantly mentions snapshots as a way to create differencing images, but I'd like this to be automated. The doc also mentions Immutable images, but the same issue would appear as the differencing image grows larger.

Would VMDK disk format with 2G strip work? Or, how can I achieve a low file size without hindering performance too much?
socratis
Site Moderator
Posts: 27329
Joined: 22. Oct 2010, 11:03
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Win(*>98), Linux*, OSX>10.5
Location: Greece

Re: Save disk space during backups?

Post by socratis »

'rsync' works at the file level. VDI is a single file that represents a guest's hard drive. At a sector level, not at a file level. So when something changes in the guest [1] pretty much the whole file has to be backed up. Switching to split VMDK, might save you a couple of "splits", but that's not sure either.

Don't even think about snapshots. They are not your usual snapshots, they depend on each other as a chain [2]. Same for immutable with differencing images (actually it's the same idea).

How about running 'rsync' from within the guest? You can have your cake and eat it too.

[1]: Actually nothing initiated from the user has to change to the guest. Even the fact that it is running, system logs, temp files, etc will modify the sectors, even if there is no file modification per se.

[2]: Read the following nice explanation about differencing disks and snapshots (which are based on the concept of differencing disks) and you'll pretty easily figure out why they can be really bad.
ChipMcK in a [url=https://forums.virtualbox.org/posting.php?mode=quote&f=1&p=276859#pr276859]recent post[/url] wrote:When a virtual disk is first created for a new virtual machine, it is considered as the base disk for the guest - data for the guest is read from and written to that disk image.

The differencing disk records changes sector-by-sector to the whole disk image, not changes to any file in the disk. VirtualBox does not know what file system is employed on the disk image and therefore can not access any individual file of/on the disk image; only the guest OS is aware of that information.

First SnapShot creates a differencing disk for read/write access while the base disk becomes read-only - as the guest modifies its data, the data is written to the differencing disk and the base disk is untouched.

Second SnapShot creates another, new, differencing disk for read/write access while the first differencing disk becomes read-only along with the base disk.

Subsequent SnapShots create additional differencing disks, with the preceding differencing disk joining the hierarchy (pecking order/chain) of read-only disks.

Keep in mind that access to/from the virtual disks is sector-by-sector, not file-by-file.

When the guest requests that a sector be read, the latest SnapShot is read first. If the sector is not found there (Sector-Not-Found is returned), the next SnapShot in the chain (youngest to oldest), until the base virtual disk is reached. Then the sector on/in the base virtual disk is either read or Sector-Not-Found is returned.
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
Cubytus32
Posts: 40
Joined: 28. Aug 2013, 03:41

Re: Save disk space during backups?

Post by Cubytus32 »

That is indeed the issue, I don't want the full VDI to be backed up each time.

Running rsync in each guest would be great if I had the power to run the 8-10 VMs at the same time, but I don't, and some of them don't even have native rsync capability. More, I don't want to be dependent upon the guest supporting rsync, and I typically do the backup at night, unattended.

With fragmentation, I guess the "splits" would grow unnecessarily large, but still smaller than backing up a full HDD. Would there be another solution?
socratis
Site Moderator
Posts: 27329
Joined: 22. Oct 2010, 11:03
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Win(*>98), Linux*, OSX>10.5
Location: Greece

Re: Save disk space during backups?

Post by socratis »

Not unless you do it from within the guest.
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
Cubytus32
Posts: 40
Joined: 28. Aug 2013, 03:41

Re: Save disk space during backups?

Post by Cubytus32 »

The thing is most VMs are already backed up with iDrive, still is like to have a complete backup of all virtual drives, snapshots and settings if ever the external drive goes bad.

If using stripes, once a stripe goes over 2GB, does it stay unchanged while new changes are recorded in a new snapshot?

Would making the disk "immutable" + striped allow such space savings?
socratis
Site Moderator
Posts: 27329
Joined: 22. Oct 2010, 11:03
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Win(*>98), Linux*, OSX>10.5
Location: Greece

Re: Save disk space during backups?

Post by socratis »

By "stripe" I take it that you mean the split-variant form of VMDK. Just so that we are on the same page.

I don't know if it's going to be better and frankly I don't think that anyone besides hardcore hard drive software engineers that invented the hard drive write algorithms could tell you. Even that it's stretching it. You'll never know which sector the OS is going to decide to write to. The only real solution would be to split the hard drive in block sizes that mirror your HD's internal structure and backup only the changed chunks.

I'd really hate to be there during the restore (knock on wood).
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
mpack
Site Moderator
Posts: 39134
Joined: 4. Sep 2008, 17:09
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Mostly XP

Re: Save disk space during backups?

Post by mpack »

The "split2g" variant is not about snapshots, it's only about breaking the host file into smaller chunks, i.e. so chunks can be stored on a FAT drive. Each chunk is as changable as it would be when attached to a humungous image, if the OS writes to that part of the disk image. The fact that the file is chunkified makes no difference to anything really, except at the user level (it opens the possibility of the user screwing things up by using the host to delete parts of a disk image, failing to copy all parts of a disk image when moving the VM - etc etc).

It won't make any difference to the volume of data which needs to be backed up.
scottgus1
Site Moderator
Posts: 20945
Joined: 30. Dec 2009, 20:14
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Windows, Linux

Re: Save disk space during backups?

Post by scottgus1 »

I do love talking about backups, wish I'd been in here at the start.

I rsync a 280GB vdi to two offsite backup locations each week. I used to try to rsync the entire file but almost inevitably there would be a network glitch or some such, and the rsync would have to start all over again. I never could figure out if there was a way to resume; all the info the rsync forums gave me to try to pick up where the previous run left off didn't work. So I went to uncompressed-zipping the vdi into 25GB chunks and rsyncing the zips. A glitch in the rsyncing only results in 25GB of data to try again instead of all 280GB. (Boy I used to hate it when it crashed at 97%.) I have a script which reassembles the zips and SHA256-hashes the restored file as a sort of remote file-compare.

The pertinent point of the above novel is that in each chunk I see some several hundred MB of change each week. I suspect that splitting you vdi into split-vmdk would be like spilt-zipping my vdi and result in the same situation - a good portion of change will migrate throughout all the files. Of course it doesn't cost anything but time to try. Just clone the vdi to a split-vmdk and see what happens. But expecting smaller changes in the backup might not happen if the vdi is split.

Do keep in mind that host-initiated backups taken while the guest is running (the idrive?) leave the backup of the guest in a "crash-consistent" just-pulled-the-plug state, since the guest didn't shut down properly before the backup was taken, and the guest didn't know it was being backed up, so no shadow-copy-like processes prevented backing up files that were still being written to. Restoring might be harder, especially if restored to a different host.

Also remember that the backup needs to be restorable. Be sure to try to test the restore for any backup routine to see if it actually works. (Consider Socratis' "I'd really hate to be there during the restore" as an experienced warning.) This method for a shut-down guest: Moving a VM re-interpreted as "Backing Up a VM", is the recommended way to back up a Virtualbox guest. You can test the validity of the backup with File-Compare (FC command). Some host software can back up a full folder on the host then keep track of incremental changes in the files. Run that on the shut-down guest's folder. You may be able to run such software from the command line, and put tests in the batch file to only run the backup on shut-down guests. If you have a Windows host, see Dynamic Windows CMD to run Vboxmanage on all guests to get a start on the batch file. Running a backup on a live guest really requires in-the-guest software to be backed up properly. If you want to keep using rsync, I have used Deltacopy on Windows hosts and guests, up to Win10, and that's where I got the rsync-on-Windows commands to run rsync directly on the command line on the Windows 7 host that handles the 280GB vdi.
Cubytus32
Posts: 40
Joined: 28. Aug 2013, 03:41

Re: Save disk space during backups?

Post by Cubytus32 »

I know that it's bad practice to try to back up an actively accessed file. I theory a working backup could be made if all the disks and execution states were captured at the exact same nanosecond, but that's outside the scope of my post.

For this reason, I want to back up only shutdown or saved VMs, not running ones. The main advantage being that if a disk within the RAID0 array fails, I can simply turn on the backup server, clone the VDI back to the new disk, change its name to the same one recognized by VirtualBox, and restart from the last backup without further handling.

If I back up from within a VM, it has to support rsync (not always the case), and that would also lose saved snapshots I keep before playing different scenarios on these VMs. In case a restore is needed, I would have to reconfigure the VM again (I assume).

"Moving a VM" page is good, but from what I understand, requires more manual work. Ideally, I would like to issue only one command when I know the backup server is online.
Some host software can back up a full folder on the host then keep track of incremental changes in the files. Run that on the shut-down guest's folder.
Isn't rsync designed to do just that? Or does it do it inefficiently?
scottgus1
Site Moderator
Posts: 20945
Joined: 30. Dec 2009, 20:14
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Windows, Linux

Re: Save disk space during backups?

Post by scottgus1 »

You can back up a saved VM, but it isn't recommended, since the ability to start that saved-state guest is tied to the specific host it came from and the version of Virtualbox it was saved under. An upgrade of the VB version or a different host will require deleting the saved state, leaving the guest data in a "power-pulled" state on restart.

Backing up within a live guest is possible without rsync: you can simply copy changed files to a shared folder, too. Or run other in-the-guest software. What guests are you running that aren't Linux (rsync usually built-in) or Windows (Deltacopy-rsync-capable)? Backing up within a live guest will only give you the latest state of the data, but it does allow you to recover with less lost data. A backup within the live guest along with a shut-down backup can get you really close to where you were before. Of course if you're not backing up live guests then the suggestion would be moot.

"Moving a VM"s suggestion is a simple folder-copy while the guest is shut down. But don't lose sight of the forest for the trees - folder copies are scriptable, as well as the file-compares to confirm good backups, as well as logging of successful backups and re-attempts of failed backups. My backup routine for the office host which runs Virtualbox guests including the office domain controller, file server, and LOB apps, is all scripted. It runs automatically at 12:30 Sunday morning, launched from the Task Scheduler, while I'm asleep, and reports to me after I wake up, and at other times through the day, how the backup routine to two other Virtualbox-capable office PCs, and two offsite backups in my and my boss's houses have progressed. All without lifting a finger, virtual or physical.

Knowing when the backup server is available is as simple as checking for the existence of a known file in a shared folder.

I'm not skilled enough with rsync to tell if it can take incremental backups of a large file. My limited experience is that if a portion of the file has changed, the whole file is backed up.
Cubytus32
Posts: 40
Joined: 28. Aug 2013, 03:41

Re: Save disk space during backups?

Post by Cubytus32 »

scottgus1 wrote: Backing up within a live guest is possible without rsync: you can simply copy changed files to a shared folder, too. Or run other in-the-guest software. What guests are you running that aren't Linux (rsync usually built-in) or Windows (Deltacopy-rsync-capable)?
Hmm, Android, Chrome OS, Mikrotik OS (Also called CHR), XPenology, NAS4free, some other command-line-only Linuxes I keep for learning purposes.

I tend to keep most of these VMs for testing purposes, rarely for productivity. I assume they are here to be discarded if they go awry — but only when I decide it, and one by one, not because the HDD holding them decides to die!

Maybe I should ask the question more simply: knowing the virtual HDDs reside on an external drive, and machine settings are on the internal drive, how would I back up all of this so I re-create the VMs on any VirtualBox-capable computer without requiring multiples TBs of backup space?
socratis
Site Moderator
Posts: 27329
Joined: 22. Oct 2010, 11:03
Primary OS: Mac OS X other
VBox Version: PUEL
Guest OSses: Win(*>98), Linux*, OSX>10.5
Location: Greece

Re: Save disk space during backups?

Post by socratis »

I did some research about 'rsync' and I think I should revise my thinking around this issue. Here's what I found. From "The rsync algorithm" (https://rsync.samba.org/tech_report/):
The algorithm identifies parts of the source file which are identical to some part of the destination file, and only sends those parts which cannot be matched in this way.
That to me sounds like 'incremental' backup at a file level, so not the whole file has to be copied, only the chunks that have changed. Also of note are two options in 'rsync':
--inplace
      This  causes  rsync  not  to create a new copy of the file and then
      move it into place.  Instead  rsync  will  overwrite  the  existing
      file,  meaning  that  the rsync algorithm can't accomplish the full
      amount of network reduction it might be able to otherwise (since it
      does  not  yet try to sort data matches).  One exception to this is
      if you combine the option  with  --backup,  since  rsync  is  smart
      enough to use the backup file as the basis file for the transfer.

      This  option is useful for transfer of large files with block-based
      changes or appended data, and also on systems that are disk  bound,
      not network bound.

      The  option  implies  --partial (since an interrupted transfer does
      not  delete  the  file),  but  conflicts  with  --partial-dir   and
      --delay-updates.  Prior to rsync 2.6.4 --inplace was also incompat-
      ible with --compare-dest and --link-dest.

      WARNING: The file's data will be in an  inconsistent  state  during
      the  transfer  (and  possibly afterward if the transfer gets inter-
      rupted), so you should not use this option to update files that are
      in  use.   Also note that rsync will be unable to update a file in-
      place that is not writable by the receiving user.

-W, --whole-file
      With  this  option  the incremental rsync algorithm is not used and
      the whole file is sent as-is instead.  The transfer may  be  faster
      if  this  option  is used when the bandwidth between the source and
      destination machines is higher than the bandwidth  to  disk  (espe-
      cially  when  the "disk" is actually a networked filesystem).  This
      is the default when both the source and destination  are  specified
      as local paths.
Note that if you prefix an option with 'no' it negates its meaning, so if you use "--no-whole-file" or "-no-W" takes only the chunks of the file that have changed. I believe this to be the default according to the description of the feature.

See a good discussion at http://superuser.com/questions/576035/d ... at-need-to

I don't have time right now to test this, but within the week I might run a test or two.
Do NOT send me Personal Messages (PMs) for troubleshooting, they are simply deleted.
Do NOT reply with the "QUOTE" button, please use the "POST REPLY", at the bottom of the form.
If you obfuscate any information requested, I will obfuscate my response. These are virtual UUIDs, not real ones.
scottgus1
Site Moderator
Posts: 20945
Joined: 30. Dec 2009, 20:14
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Windows, Linux

Re: Save disk space during backups?

Post by scottgus1 »

My experience with rsync show what you've seen, Socratis. Rsync has an ability to crawl through the files in the source and destination, and send just the changed parts of the file. Less network bandwidth for a backup. But the final file still exists in full.
scottgus1
Site Moderator
Posts: 20945
Joined: 30. Dec 2009, 20:14
Primary OS: MS Windows 10
VBox Version: PUEL
Guest OSses: Windows, Linux

Re: Save disk space during backups?

Post by scottgus1 »

Cuby, that's quite a list of non-rsync-capable OS's! Point taken, you may not be able to run typical in-the-guest backup software on those.

You mention the setup of your guests:
virtual HDDs reside on an external drive, and machine settings are on the internal drive
This isn't the recommended way to configure a guest, although there are reasons for doing so. The reason for not keeping a guest in this way is that since the vdi is not in the same folder as the .vbox file (the "recipe" for the guest, so to speak), the .vbox must lay out the exact disk and path to the vdi. In the event the guest needs to be restored to different hardware, or in the case of an external drive the drive letter or designation changes, then the guest will not be able to boot up until the exact drive and path are recreated. When a guest drive is in the same folder as the .vbox, relative paths are used and the guest can be put anywhere and will always find its disks.

Put simply, it is easier for restoring and backup purposes to put the .vbox and the .vdi in the same folder.

That said, if one keeps careful notes on the drive letter/designation and path to the vdi and any snapshots for a particular guest, it is almost as easy to restore, just have to recreate the paths. Or you could modify the .vbox file and change the paths before you register the guest - it is an XML text file with a .vbox extension, and can be edited in Wordpad.

Since your guests are not laid out in the "all in the same folder" configuration, the "Moving a VM" backup method will not work for you. But simple file-copies and folder-copies of the guest folder and the vdi's will still work as a backup, along with notes on where everything goes, will be just as good.
Cubytus32
Posts: 40
Joined: 28. Aug 2013, 03:41

Re: Save disk space during backups?

Post by Cubytus32 »

scottgus1 wrote: This isn't the recommended way to configure a guest, although there are reasons for doing so. The reason for not keeping a guest in this way is that since the vdi is not in the same folder as the .vbox file (the "recipe" for the guest, so to speak), the .vbox must lay out the exact disk and path to the vdi. In the event the guest needs to be restored to different hardware, or in the case of an external drive the drive letter or designation changes, then the guest will not be able to boot up until the exact drive and path are recreated. When a guest drive is in the same folder as the .vbox, relative paths are used and the guest can be put anywhere and will always find its disks.
I had a look in my drive and my assumption was wrong. I have a .vdi, .vbox and .vbox-prev at the root of each VM folder on the external drive.
That said, if one keeps careful notes on the drive letter/designation and path to the vdi and any snapshots for a particular guest, it is almost as easy to restore, just have to recreate the paths. Or you could modify the .vbox file and change the paths before you register the guest - it is an XML text file with a .vbox extension, and can be edited in Wordpad.
When moving from the previous, single-drive I used, I simply cloned the original partition to the RAID0 array, and kept the exact same name as the previous drive. It worked, and I guess I could rename the external drive and re-register the .vbox file?

So basically the backup method I was trying to optimize is already the best one? Keeps the big, fat VDIs, copies them in their entirety, and skip unchanged files? I thought I'd be able to keep different snapshots, not just one or two at best.
Post Reply