Notes for Week 8

  1. Most software is distributed via CD-ROM, which constitutes its own backup. The lifetime of a manufactured CD-ROM is almost certainly longer than the period of usefulness of its contents, and even CD-Rs are expected to last up to 20 years when kept away from extremes of temperature and sunlight. It therefore behooves the system administrator to segregate the files (whenever possible) which are created or modified after software installation is complete. By carefully backing up system configuration files and ensuring that all non-system files reside in /home (for example), the job of determining what needs to be backed up is made much easier.
  2. All backups should be multi-generational: you should never have just one copy of the backup media (disk or tape). If a system failure should occur while writing to your only backup, both the backup and the original are both potentially lost. If you back up to, for example, flash drives, keep some number (n) of them and number them; after using #1, use #2 the next time you back up, #3 the time after that, etc. After using #n, the next backup will be on #1, and so on. This not only protects previous backups while creating new ones, but also gives you a recent file history, so that problems can be tracked, or files deleted since the last backup rotation can be restored.
  3. The amount of disk space taken up by files modified or created since software installation may be larger than you want to back up every day. In that case, you can back up incrementally on a daily basis, backing up only those files that have been modified since the previous backup, and then do a complete backup on, for example, a weekly basis. When using incremental backups, your partial backups form one multi-generational set, and your complete backups form another. Typically, complete backups might be to tape, CD-R or DVD-R, while incremental backups might be to flash drive or tape. Floppy disks are not stable enough to be relied upon for backup purposes.
  4. Backups can be made remotely via network connection, but keep in mind that network transmission is not foolproof: CRC and checksums are good, but errors still go through undetected (roughly one in three million ethernet packets will have an undetected error, given that one bit error is undetected in two to the N for CRC-N). Off-site storage of complete backups are a good idea, since the possibility of fire or natural disaster is always present, but the odds of multiple sites incurring simultaneous loss are extremely small.
  5. The tar command is used to make backups; tar output is called a "tarball" (with typical suffix ".tar"), and if compressed, is called a "zipped tarball" (with typical suffix ".tar.gz" or ".tgz"). Several variants of the tar command are useful; note that all use the z option to compress the backup (the j option will do slightly better compression but takes considerably longer):

    • tar -czf /mnt/backup/home.tgz /home/* (to backup the contents of home to the device mounted at /mnt/backup)
    • tar -czf backup.tgz /home/* (to backup the contents of /home to a file on disk, which can later be copied to the backup media; this gives an added measure of redundancy, if enough disk space is available)
    • tar -czf files.tgz -T list_of_files.txt (to backup those files listed in the file list_of_files.txt)
    • tar -czPf backup.tgz /home/* (-P retains leading "/" in the filename, so that restored files may overwrite the originals; note that to extract the files into their original locations, you will need -xzPf)
    • tar -tzf backup.tgz (to list those files on the backup)
    • tar -xvzf backup.tgz -T list_of_files.txt (to extract some files from the backup)
    • tar -xzpf backup.tgz (extracts permissions as well as file contents)
    • tar -xzmf backup.tgz (updates the modification time stamp on the file to the current time)
    • tar -xjf backup.tar.bz2 (extracts a tarball compressed using bzip2)
    The v(erbose) option above causes tar to list the files that are being (un)tarred.
  6. The find command can be used to create a list of files for input to another program (ie., tar). It has many options, but a few of the most useful are demonstrated here:

    • find starting-directory -name pattern
      Here, starting-directory is the place in the directory tree at which to begin the search (the search will continue through the entire directory sub-tree from this vertex), and pattern is a file name pattern (either a full file or directory name, or one that uses wild cards, ie. "*.mpg" or "X*").
    • find starting-directory -iname pattern
      In this example, the file name search will be case insensitive.
    • find starting-directory -newer path
      Here path is a file or directory, and only those files are found which have been modified more recently than path.

      The newer option is very convenient in designing backup scripts, in conjunction with the touch command: if you "touch /root/.backup" at the end of each backup, "find / -newer /root/.backup" will find all files and directories modified since the last backup.

  7. Any of the above find commands will find only files (and not directories) if "-type f" is used. This is helpful in backup scripts, since a directory which has only been modified with a single file deletion will be found without using "-type f", but such a directory obviously does not need backing up. On the other hand, a full backup will want to include directories, so that all existing directories are recreatable, even if they are empty.

    Any of the find commands will produce output more like "ls -l" by using the option "-ls". This is useful if information about the found files is needed.

  8. - often means stdin, allowing commands to be piped:
    find / -newer /root/.backup -type f | tar -czf backup.tgz -T -
    Here, the files modified since the last backup will be placed in the gzipped tarball "backup.tgz". The "-T" option to tar tells it to get the list of files to place in the tarball from a file, and "-T -" specifies that stdin is the file to be used. By piping stdout from the find command (which is of course the list of file names) into stdin for the tar command, we avoid the necessity of creating a temporary file on disk with the list of file names to be backed up. Note that stdin and stdout (as well as stderr) are "streams": a stream of data can only be scanned once (the water in a stream passes by only once; it never returns as long as you ignore evaporation and rain). This means that some tar options won't work, since the list of file names cannot be rescanned (ie., -W to verify the tarball).

    Since tar stores path names it is sometimes hard to remember exactly what filename to use when extracting a specific file. You can use a pipe to make the job easier:

    tar -tzf backup.tgz | grep -e 'filename' | tar -xvzf backup.tgz -T -
    will list the table of contents of the tarball, grep the one with the filename you are looking for (but with the path information tar needs to find it) and extract that file.
  9. There are two commands which can be used in or out of scripts to compare files: diff, which compares two text files, and cmp, which compares two binary files. They are extremely useful when checking to see what has changed between (for instance) a configuration file and its backup copy, or to see if two binaries (programs) are the same.

    Note that if you write a file to disk and then use cmp to see if it has been written correctly, you must first umount and re-mount the disk. This guarantees that you are not checking the actual file on the disk, and not just the copy in cache.

  10. You can use md5sum to detect when files have been modified (after transmission across a network, or if searching for possible intrusion). "md5sum (filename) > (filename).md5sum" will compute the checksum and "md5sum -c (filename).md5sum" will check the file against the previously computed checksum.
  11. EXERCISES for Week 8:

    1. Write and debug a shell script to backup your home directory to the media of your choice. Be sure to handle all mounting and unmounting. The script should also check the backup tarball against the original to make sure that it was backed up correctly.

      Submit the script to your instructor.

    2. Find all of the files in /var which have been changed since you installed your system.
    3. Once a hacker breaks into your system, he or she immediately replaces several key programs with modified versions which will hide the hacker's tracks. These include login, ls and ps. Compute checksums for these files and verify them. Why would you do this for a root filesystem which is mounted read-only?


©2008, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.