TA: the Transparent Archivist

(Version 1.9)

Copyright © 2005 David Flater.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program.  If not, see <https://www.gnu.org/licenses/>.

THE TA SOFTWARE DISTRIBUTION IS AVAILABLE FROM: https://flaterco.com/files/ta/

Contents

  1. What TA is
  2. Requirements
  3. Summary of supported file systems
  4. Limitations of supported file systems
  5. Limitations of TA
  6. Installation
  7. Building archives
  8. Burning to disc
  9. Mounting archives
  10. Validating archives
  11. Restoring archives
  12. Adding hashes to other things
  13. The -l switch
  14. Windows portability problems
  15. Troubleshooting
  16. Accessing data from obsoleted formats
  17. Support
  18. Changelog

What TA is

TA, the Transparent Archivist, is a front-end program that reduces the fuss and muss involved in building archives under Linux.  It is "transparent" in that the archives that it produces can be examined with ls and restored with cp.  You do not need TA to recover an archive built by TA.

TA is not a disc burning program.  TA builds disc images but does not burn them.  The value added by TA is:

  1. automatically determines where to break archives across disc boundaries ("simple disc spanning");
  2. automatically stores and validates SHA-512 hashes for all files;
  3. simplifies the process of building disc images.

TA offers eight choices of file system for the archives:  three variants of ISO 9660, three variants of ext2, squashfs, and UDF.

Requirements

Not all packages are required in all modes (e.g., you don't need squashfs-tools unless you are making squashfs discs).

Package / programVersion tested
Linux kernel6.11.2
GNU coreutils (cp and rm)9.5
e2fsprogs (mke2fs and tune2fs)1.46.5
libmagic5.41
GNU find4.8.0
libdstr1.0
mhash (libmhash)0.9.9.9
XZ Utils (liblzma)5.2.5
cdrtools (mkisofs)3.02a09 *
udftools (mkudffs)2.3
util-linux (losetup, mount)2.37.4
squashfs-tools (mksquashfs)4.5

* In certain Linux distributions, mkisofs is a symbolic link to or simply replaced by a program called genisoimage.  This is not cdrtools but a forked project known as cdrkit.  Genisoimage has issues and its use with TA is not supported.

To run TA you must have plenty of free disk space.  In addition to the room needed for the final images, iso, isoj, isorr and squashfs require room for a temporary copy of all of the files on a given disc.

To build ext2, le2a, le2f and udf archives you must have sufficient privileges to mount and unmount file systems.

Preserving ownership on archived files or archiving files with unfriendly permissions requires TA to be run as root.

Your kernel must support whatever file systems you are using.

Summary of supported file systems

IdentifierDescription
isojISO-9660:1988 Level 1 extended by Joliet and deep directories.  Unless you need compatibility with Windows XP, avoid this legacy file system.
isoISO-9660:1999 version 2, a standard for CD-ROMs and other optical media.
isorrISO-9660:1999 version 2 plus Rock Ridge protocol, an extension to preserve more of the features of Unix file systems.
ext2A normal Linux file system.  Although ext4 now is prevalent, ext2 suffices for disc images and avoids needless overhead.
le2aLUKS, ext2 plus 256-bit AES (standard encryption).
le2fLUKS, ext2 plus 256-bit Twofish (deprecated encryption).  Retained only to avoid a major version bump of TA.
squashfsCompressed Linux file system.
udfUniversal Disk Format, a standard for DVDs and Blu-rays.

Limitations of supported file systems

The limits stated below for the various file systems are not standard or theoretical limits but actual limits determined by testing under Linux.  Of course, the same disc may be read differently by a different operating system or a different version of Linux.

"8-bit agnostic" file name encoding means that file names are recorded as strings of 8-bit characters with no translation.  Whether your ambient codeset is ISO 8859-1, UTF-8, or whatever, that is what goes on the disc.  As long as you read the disc in the same context in which it was recorded, all file names should survive intact.

IdentifierFile name encodingFile name length limitFile size limitYear limitsFile ownership, permissions, symbolic links?Compression?Windows problems
isojUTF-16*206 B (103 char.)(232 − 2) B = 4294967294 B1902–2037NoNoLower file size limit, dates messed up
iso8-bit agnostic207 B> 9 GiB1902–2037NoNoCharset mismatch, forbidden characters, lower file size limit, dates messed up
isorr8-bit agnostic248 B> 9 GiB1902–2037YesNoRock Ridge ignored, names truncated, charset mismatch, forbidden characters, lower file size limit, dates messed up
ext28-bit agnostic255 B> 9 GiB1902–2445YesNoNo support
le2a8-bit agnostic255 B> 9 GiB1902–2445YesNoNo support
le2f8-bit agnostic255 B> 9 GiB1902–2445YesNoNo support
squashfs8-bit agnostic255 B> 9 GiB1970–2105YesxzNo support
udfUTF-8254 B> 9 GiB1–65535YesNoUnreliable, dates messed up

* The following characters, which are allowed in Linux file names, are lost in translation to Joliet:  *:;?\

Limitations of TA

TA does not handle any file types other than regular files, directories, and symbolic links.

TA does not preserve empty directories.

TA does not preserve timestamps on directories.

TA does not handle files that are too big to fit on the target media in one piece.  TA does not split files and it does not reorder files when filling up discs.

TA does not handle file system race conditions.  If a file to be archived changes while TA is running, its hash will be wrong.  If a file to be archived is deleted while TA is running, TA will exit with an error.

You may not archive a file in the root directory called ta-hashes.txt because that is where TA stores the hashes.

Installation

TA is packaged with the popular and portable GNU automake, so all usual GNU tricks should work.  Help on configuration options can be found in the INSTALL file or obtained by entering ./configure --help.

Normally, one should only need to do the following to compile and install the programs ta, tahash, and taval:

$ ./configure
$ make
$ su
# make install

The distribution includes source for a program called createfile that is useful for testing TA.  It is not normally built.  If for some reason you want to build it, configure with --enable-test-progs.

Building archives

Usage: ta [options] discsize filesystem workingdir imagedir src [src...]

Options:
-l      Tweak ext2/le2a/le2f to maximize usable space.
-nornd  Don't initialize encrypted volumes with random data.
-p      Force file permissions to reasonable defaults.
-r src  Replicate file src in *every* image.
-w      Wait for confirmation after completing each image.

discsize:  cd74, cd80, dvd+r, dvd+rdl, bd-r, or an arbitrary size specified in
  bytes.
filesystem:  iso, isoj, isorr, ext2, le2a, le2f, squashfs or udf.
workingdir:  for ext2, le2a, le2f or udf this is just a mount point that we can
  use.  For others, this must be an existing, empty directory that we can fill
  up and then wipe clean.
imagedir:  disc images will be written here, overwriting any files that are in
  the way.  Make sure it is on a file system that can handle big files if you
  are creating DVD-sized images.
src:  stuff to archive.  Should usually be a directory, but you can do single
  files if you want.

cd74 and cd80 refer to 74- and 80-minute CD-Rs or RWs.  dvd+r and dvd+rdl refer to DVD+R/RW and DVD+R DL.  Standard capacities are not available for DVD-R/RW or DVD-R DL.  bd-r is single-layer Blu-ray.

The -p option will set the permissions on directories and executable files to rwxr-xr-x and on non-executable files to rw-r--r--.  (Although the archive will be read-only, making files unwritable by owner creates more trouble than it is worth.)  For iso and isoj this option has no effect.

The -w option is useful if you have inadequate disk space to store all of the images being produced.  TA will wait for you to burn and delete the previous image before beginning the next one.

The -l option is useful if you need a few more megabytes to fit a few large files onto a DVD.  See details below.

The -nornd option will speed up the creation of encrypted volumes for le2a and le2f file systems at the cost of not obfuscating the location of encrypted data on a less-than-full volume.

The translation of src paths into paths within the image is done more or less the way that tar does it:  /mumble/foo (absolute) and mumble/foo (relative) both translate to mumble/foo in the image.  However, references to "." are removed from the final file names, and references to ".." are not allowed.

TA leaves the disc images in the directory that you specify as imagedir.  Disc images are named image001.iso, image002.iso, and so forth.  Even non-iso images are called .iso because anything else can confuse disc-burning applications.

If you are building DVD images, imagedir must be on a file system that can support files larger than 4 GiB (i.e., not vfat).

It is not a good idea to do other work on the side while TA is archiving.  If you modify a file in TA's list, the hash will be wrong and the file will not validate.  If you delete a file in TA's list, TA will fail.

Burning to disc

You can use whatever disc burning application you like to burn the images to disc.  Following are sample commands that seem to work under Linux.  Your mileage may vary.

Target mediaBurning command
CD-Rcdrecord -v dev=/dev/cdrom -dao image001.iso
CD-RWcdrecord -v blank=fast dev=/dev/cdrom -dao image001.iso
DVD+R/RW/DLgrowisofs -dvd-compat -Z /dev/dvd=image001.iso
BD-Rgrowisofs -Z /dev/dvd=image001.iso

(N.B., the sao and dao options to cdrecord are completely equivalent.)

Mounting archives

In most cases, a Gnome or KDE based desktop should figure out how to mount a disc automatically.  However, when the iso file system is used, discs must be mounted with the map=o option to avoid case-smashing file names:

iso: mount -t iso9660 -o ro,map=o /dev/cdrom /mnt
isoj, isorr: mount -t iso9660 -o ro /dev/cdrom /mnt
ext2: mount -t ext2 -o ro /dev/cdrom /mnt
udf: mount -t udf -o ro /dev/cdrom /mnt
squashfs: mount -t squashfs -o ro /dev/cdrom /mnt
le2a, le2f: MAPNAME=`date +%N` # Pick a unique map name
cryptsetup -r luksOpen /dev/cdrom $MAPNAME
mount -t ext2 -o ro /dev/mapper/$MAPNAME /mnt

Unmounting, all file systems:  umount /mnt
le2a, le2f only:  cryptsetup luksClose $MAPNAME

Validating archives

To validate a disc, mount the disc on some directory (/mnt in this example) and do taval /mnt.  Taval will check the hashes on all regular files.

For a second opinion on the validity of a given file, you can manually compare the contents of /ta-hashes.txt with the output of gpg --print-md sha512.

Taval only checks the contents of files that are listed in /ta-hashes.txt.  It does not ensure that the dates, permissions, or other metadata were correctly preserved, nor does it notice if other files were added.

Using MD5 hashes

As a convenience, taval can also validate an archive against a file of MD5 hashes that was produced by some other program, e.g., md5sum.  To validate an archive against MD5 hashes instead of ta-hashes.txt, use the -md5 switch of taval:  taval -md5 md5file /mnt.

Each line of the MD5 hashfile must be 32 bytes of data, two spaces, and a filename:

d41d8cd98f00b204e9800998ecf8427e  null

Restoring archives

Since the archives are completely transparent, you can go directly to the disc(s) and directories that you want if you are in a hurry to retrieve something specific.

mount -o ro /dev/cdrom /mnt
cd /mnt
ls

Otherwise, repeat for each disc:

GLOBIGNORE=".:.." # Needed for * to match hidden files
mount -o ro /dev/cdrom /mnt
cp -a /mnt/* /
umount /mnt

If not running as root, you might have to change some permissions in order to get all of the files to copy in.  When done, delete the extraneous file /ta-hashes.txt.

Adding hashes to other things

Sometimes it is handy to generate hashes without getting involved in making disc images.  You can do this with tahash dir, and the directory's contents can subsequently be validated with taval dir.

The -l switch

The -l switch causes TA to create ext2/le2a/le2f file systems with options tweaked to maximize usable space.  On a single-sided DVD+R, this saves about 78.8 MiB and reduces the overhead of an empty file system to a mere 460 KiB.  However, it limits the number of files that can be placed on a single disc to around 566 (576 inodes).

The command used is mke2fs -m 0 -N 1 -O none,sparse_super2,filetype -I 256 ….  As usual, no space is reserved for super-user, and the lost+found directory is removed.

Windows portability problems

The behaviors of non-Linux operating systems are no longer tested.  The following issues that were found with Windows XP may or may not persist in more recent versions of Windows.

Charset mismatch:  Windows XP interprets agnostic characters according to its own default code page, which unfortunately is usually cp437.  In theory, you should be able to say chcp 1252 and then access a disc encoded as ISO 8859-1 with no trouble.  In practice, that doesn't work.  The problem is avoided by using Joliet or UDF, which specify unambiguous encodings for file names.

Lower file size limit:  For ISO 9660 discs, the file size limit under Windows XP is (232 − 2048) B = 4294965248 B.  Files larger than this produce an "Input/Output error" on attempt to open.

Dates messed up:  Dates in the archive are often wrong by an hour (apparently Daylight Savings Time run amok), and years before 1980 are not supported.

Forbidden characters:  Windows XP has different rules for what characters are legal in file names.  Files whose names contain an asterisk, backslash, or question mark appear to be inaccessible under Windows XP.  Files whose names contain a colon or semicolon are accessible from a Cygwin command line, but they cannot be opened in Windows Explorer.  If Joliet is used, forbidden characters are suppressed; consequently, the files are accessible under Windows but they do not validate because their names were changed.

Unreliable:  Windows XP sometimes has problems reading UDF discs.

Troubleshooting

Permission denied

Even when -p is used, TA can fail with "permission denied" if a source directory is not writable by owner.  This problem is a consequence of how GNU cp propagates permissions and is not efficiently fixable in TA.  Running TA as root avoids the problem.

Validation errors—files have incorrect names on disc

Cause #1:  Wrong mount options for iso format.  To prevent file names from being case-smashed, you must mount with the option map=o; e.g., mount -t iso9660 -o ro,map=o /dev/cdrom /mnt.  This does not apply to isoj or isorr.

Cause #2:  If your working directory is on a vfat partition, using the wrong mount options will result in a corrupt disc.  To prevent short file names from being case-smashed, you must mount the vfat partition with the option shortname=winnt.

Cause #3:  If you are working in a UTF-8 locale and your file names have UTF-8 encoding errors (broken characters), these may get changed to '?' or '_' when making udf or isoj images.  Any affected files will fail validation.  If you wish to record the broken filenames as-is, run TA in an ISO 8859-1 locale, where every string is valid (even if it is not readable).

Cause #4:  The following characters, which are allowed in Linux file names, are lost in translation to Joliet:  *:;?\  Any affected files will fail validation.

Cause #5:  If file name lengths exceed the limits shown above, the names will be truncated, assuming that the files get written at all.

Files all validate, but file names appear wrong on the screen

Your terminal or desktop is using a different codeset than is assumed for the file names being listed.  For isoj and udf, you can enable translation to your current codeset using the iocharset or utf8 mount options; e.g., to read a UDF disc in a Latin-1 locale, mount it with -o ro,iocharset=iso8859-1.  Otherwise, you have to either change the locale of your terminal or desktop to agree with the data or run the file names through iconv to make them readable.

/dev/mapper/TA contains `PGP Secret Sub-key -' data

mke2fs 1.46.5 (30-Dec-2021)
/dev/mapper/TA contains `PGP Secret Sub-key -' data
Proceed anyway? (y,N)

This is a known problem with mke2fs that will not be fixed.  Encrypted data or random filler can be misinterpreted by libmagic, causing a safety check in mke2fs to issue that spurious warning.  Just say y and continue.

Endless head cycling of optical drive when reading large files

The kernel sometimes thrashes the head of an optical drive when reading large files from burned images.  It might have something to do with the format differences made by the -l switch.

The problem can be mitigated by increasing the read_ahead_kb parameter for the optical drive (e.g., /dev/sr0) and, if applicable, the block device used by cryptsetup (e.g., /dev/dm-2), after the disc is mounted:

le2# echo 102400 > /sys/block/sr0/queue/read_ahead_kb
le2# echo 102400 > /sys/block/dm-2/queue/read_ahead_kb

The default of read_ahead_kb is 128 (128 KiB).  Try a larger value, such as 102400 (100 MiB).

Accessing data from obsoleted formats

How to mount an ext2aes disc

Versions of TA prior to version 1.4 supported a file system called ext2aes, which was ext2 plus 256-bit AES encryption via the cryptoloop module of the Linux kernel.  ext2aes has no LUKS header to tell you what it is; you just have to know.

With old versions of the kernel and the mount command, you could mount an ext2aes disc with mount -t ext2 -o ro,encryption=aes /dev/dvd /mnt.  That no longer works since the cryptoloop kernel module has been completely deleted.  An ext2aes disc can still be mounted as follows:

MAPNAME=`date +%N` # Pick a unique map name
cryptsetup -r -c aes-cbc-plain -s 256 -h plain create $MAPNAME /dev/dvd
mount -t ext2 -o ro /dev/mapper/$MAPNAME /mnt

To unmount:

umount /mnt
cryptsetup remove $MAPNAME

How to mount a ziso disc

Versions of TA prior to version 1.5 supported a file system called ziso, which was isorr plus a Linux-specific transparent decompression extension.  In version 1.5, ziso was replaced by squashfs.

ziso archives can be mounted and unmounted using the same commands as isorr.  However, to read ziso archives with transparent decompression, you must have a Linux kernel that was compiled with support for transparent decompression.  As of kernel 6.11.3, the relevant option appears in make menuconfig as File systems → CD-ROM/DVD Filesystems → ISO 9660 CDROM file system support → Transparent decompression extension.

If kernel support is lacking, the content can be non-transparently decompressed using the mkzftree program included in the zisofs-tools package (with the -u option for uncompress).

Support

Any questions, problems, or bug reports for TA should be directed to dave@flaterco.com.

Changelog

To do if there is a future version:

Version 1.9 r2, 2024-12-04:  Updated docs to add nornd to the to-do list and add the head cycling workaround to the troubleshooting section.  Repacked the 1.9 distribution with updated ta.html but no software changes.

Version 1.9, 2024-10-20:

Version 1.8, 2013-02-07:

Version 1.7.2, 2011-09-03:

Version 1.7.1, 2011-04-10:

Version 1.7, 2011-03-04:

Version 1.6.1, 2010-12-24:

Version 1.6, 2010-12-15:

Version 1.5.1, 2010-03-29:

Version 1.5, 2010-03-03:

Version 1.4, 2008-08-11:

Version 1.3.1, 2008-03-06:

Version 1.3, 2008-02-29:

Version 1.2.2, 2008-01-25:

Version 1.2.1, 2006-08-25:

Version 1.2, 2006-08-23:

Documentation rev. 2006-07-23:  Noted UDF troubles with XP.

Documentation rev. 2006-07-22:  Updated troubleshooting info for disk thrashing.  Added -v to CD burning command.

Version 1.1.2, 2006-07-04:

Version 1.1.1, 2006-07-04:

Version 1.1, 2006-07-03:

Documentation rev. 2006-05-27:  Added example command for burning DVD image.  Removed statement about Nero.

Version 1.0, 2006-01-02


Home