Revolving around the core of technology
When discussing file-level backup types, there are two main school of thoughts: Differential and Incremental.
Differential backups start by first making a full backup, and then each backup after that transfers only what has changed until you decide to do a full backup again. For example, if I make a full backup on Monday and I run a differential backup on Thursday, it will contain any changes I have made between Monday and Thursday. If I then decide to do another Differential backup on Sunday, this will contain all changes made from Monday to Sunday. This resets when I do a full backup.
Incremental backups improve on this idea. With incremental backups, I start with a full backup. Every backup after that only transfers what has changed since the last backup, be that a full backup or an incremental backup. This means that if I run a full backup on Monday, my Thursday incremental backup is only Monday to Thursday, and my Incremental backup on Sunday is only Thursday to Sunday.
Simply put, differential backups are more encompassing, take longer to backup, and are quick to restore whereas incremental backups are more precise, are quicker to backup, but take longer to restore.
Step 1 | Step 2 | Step 3 | |
---|---|---|---|
rdiff | Calculate Delta | Transfer Delta | N/A |
rsync | Calculate Delta | Transfer Delta | Merge Delta |
Of course there are many other differences between the two but the general concept holds. As you can see from above they are virtually identical with the exception of the final step. However, this final step is very important when you understand what ramifications it has. If you look at the destination of both of these backups, you will see very different results.
An rdiff backup does not merge the delta during backup, it does it during restore. This means that on the destination, you will only contain delta files. With rsync backups the delta is merged at time of backup, meaning that the destination will contain complete files that you can open and use.
For the purpose of this article I will compare metrics between three backup scenarios: rdiff backup, rsync backup, and an rsync backup over HTTP. I feel as if rsync over HTTP is worth looking at, since it opens up many possibilities for improving backups as a whole. The metrics that we will be using are as follows: Manageability, Useability, Reliability, Security and Performance. With all of these, we will assume that a generic software will be using these algorithms.
rdiff | rsync | rsync over HTTP | |
---|---|---|---|
Configuration | Source Configurable | Source & destination configurable | Source & destination configurable |
User Management | Limited to no User management | Limited to no User management | High User management |
Destination Differences | Incomplete Destination Files | Complete Destination Files | Complete Destination Files |
Versioning | Versioning by Default | Versioning with Custom Scripts | Versioning via Software |
With rdiff, you will be able to configure a source to select which files to backup, and direct them to a storage destination. When the files get to the destination they are simply stored. With rsync you are able to configure both the source and destination. With rsync over HTTP you can configure both of these via a web portal.
A benefit of rdiff over rsync is that rdiff will always maintain versions of a file, whereas rsync always maintains the most recent version. With rsync over HTTP there are methods that can be implemented to add versioning as a feature.
rdiff | rsync | rsync over HTTP | |
---|---|---|---|
Network Requirement | Can run over a slow network | Can run over a slow network | Can run over a slow network |
Flexibility | Not Flexible | Flexible | Very Flexibile |
Portability | Source Files not Portable | Portable Source Files | Portable Source and Destination Files |
File Interactiveness | Can Only interact from Source | Can Interact from Source and Destination from their locations | Can Interact from Source and Destination from any location |
Synchronization | No Synchronization | Synchronization Possible | Multi-level Synchronization |
When it comes to useability there are many factors. For an average user, both rdiff and rsync are fairly similar when it comes to triggering backups, from say a command line. With a software that uses these you get some more features and ease of use for the average user, but what the users can do is different. rdiff is not as flexible as rsync specifically because of the lack of destination receiver that rsync requires, which is also why rdiff does not have the ability to merge at destination.
The biggest downfall for rdiff in this category is the loss of synchronization. With rdiff, one side has meaningful and useable files, whereas the other has only the delta files. With rsync, we can keep A and B identical across a network; with rsync over HTTP you add the possibility of keeping multiple machines in sync with each other.
rdiff | rsync | rsync over HTTP | |
---|---|---|---|
File Corruption | Corruption is a Major Issue | Corruption is a Minor Issue | Corruption is a Minor Issue |
Half Finished Backups | Can Recognize Partial Backups | Can Recognize Partial Backups | Can Recognize Partial Backups |
With respect to reliability, there is a major concern with rdiff vs. rsync. The entire point of a backup is so that you have the ability to restore data when needed. If your backup software maintains versions of files then this is even more true. The restore process for both algorithms is different, again due to the merging process.
When you restore via rdiff, the source machine pulls the necessary files from the destination machine and rebuilds the file starting from the base, and increasing in version numbers until it reaches the most recent copy of a file. The rsync algorithm does not do versioning by default, since it merges the deltas at the destination when they get there. When using software that uses rsync over HTTP there are features available that provide the ability to maintain versions of files as well, with the difference being that restoration happens in reverse than rdiff does.
For example, say with all three scenarios, we backup a previously backed up file 5 times with changes and now we need to restore the most recent copy. However, in this situation version 2 was corrupt; you can see this effect below, remember that rsync by itself does not maintain these versions.
Green = We can successfully restore this
Red = We can not restore this
Blue = Can only restore this if it is the most current version.
rdiff | rsync | rsync over HTTP |
---|---|---|
Original File | New Current File | Original File |
Version 1 | New Current File | Version 1 |
Version 2 | New Current File | Version 2 |
Version 3 | New Current File | Version 3 |
Version 4 | New Current File | Version 4 |
Most Current File | New Current File | Most Current File |
This may sound a bit confusing, but think of the restore process for each as follows:
In most cases, you will not need to restore an original copy of a file, but a version in between. In this situation, the longer a backup goes on for the more reliable rsync over HTTP becomes since you can still retrieve files after a corruption occurs.
rdiff | rsync | rsync over HTTP |
---|---|---|
SSH | SSH | No SSH |
No SSL | SSL | SSL |
No Authentication / 2FA | Authentication / No 2FA | Authentication + 2FA |
Security is a major concern with backups, especially with data breaches becoming more and more commonplace. The typical rdiff and rsync backups don't provide much security during the transfer, but allow you to secure both machines in however way you wish. Rsync over http on the other hand has a variety of additional security measures, specifically on the fly. There are many methods of transferring data over HTTP with rsync, and software that utilize this provide a multitude of increased security measures, such as requiring SSL, needing 2FA on accounts, and using built in aes encryption.
Performance is a slightly different metric compared to the others. Backup time alone, rdiff should be faster than both rsync and rsync over HTTP. Restoration time should be faster with rsync and rsync over HTTP than rdiff since rdiff still has to merge, so comparatively they are very similar.
Overall, the choice in which algorithm, or software, to use when backing up files is different for everyone. Some situations may not need the advantages of rsync over HTTP, such as a completely closed network backup. However, once you need to introduce a network into the picture, the overall benefits and scalability provided with rsync over HTTP is unmatched.