skip to Main Content

I got tar archives containing a lot of very small JSON files. Each day I get a new tar archive. Now I want to combine the daily tar archives into a yearly tar archive and compress it. I do that with the following bash script:

tar -cf "/mnt/archive/archive - 2020.tar" --files-from /dev/null
for f in /mnt/data/logs/2020/logs-main-2020-??-??.tar
do
    tar -n --concatenate --file="/mnt/archive/archive - 2020.tar" $f
done

pxz -T6 -c "/mnt/archive/archive - 2020.tar" > "/mnt/archive/archive - 2020.tar.xz"
rm "/mnt/archive/archive - 2020.tar"

This works, but the concatenation of the tar files is getting slower the bigger the main tar gets.

I could use a cat instruction to simply add all tars together, but the resulting archive then contains all the end-of-archive null markers of the original tars. Thus, the resulting tar has to be opened with the -i option, which is not an option for the system working with the resulting tar.

How can I concatenate the tar files without the need of slow tar concats and still create a valid tar without the nulls in-between? Can I do some cat, un-tar, re-tar, compress pipe?

  • I do not have any spacial characters like line newlines in the JSON file names in the input tars
  • I work with GNU tar v1.26 on CentOS 7
  • Each input tar is about 1GB, so keeping them in memory is no option
  • There is no need to check the output tar for duplicate entries. The way the input tars are creates ensures that they are not have duplicated JSON files

2

Answers


  1. A couple of perl-based approaches:

    First, a script using the core Archive::Tar module to read existing tar files and create a new one (Due to limitations of the module, it has to hold the data for the combined destination tar file in memory all at once before writing it; might be an issue with a huge amount of data):

    #!/usr/bin/env perl
    use warnings;
    use strict;
    use feature qw/say/;
    use Archive::Tar;
    
    # First argument is the new tar file to create, rest are ones to
    # copy files from.
    
    die "Usage: $0 DESTFILE SOURCEFILE ...n" unless @ARGV >= 2;
    
    my $destfile = shift;
    my $dest = Archive::Tar->new;
    
    foreach my $file (@ARGV) {
      my $src = Archive::Tar->iter($file) or exit 1;
      say "Adding contents of $file";
      while (my $file = $src->() ) {
        my $name = $file->full_path;
        say "t$name";
        $dest->add_data($name, $file->get_content,
                        { mtime => $file->mtime,
                          size => $file->size,
                          mode => $file->mode,
                          uid => $file->uid,
                          gid => $file->gid,
                          type => $file->type,
                          devmajor => $file->devmajor,
                          devminor => $file->devminor,
                          linkname => $file->linkname
                        })
          or exit 1;
      }
    }
    
    $dest->write($destfile) or exit 1;
    say "Wrote $destfile";
    

    Usage:

    perl tarcat.pl "/mnt/archive/archive - 2020.tar" /mnt/data/logs/2020/logs-main-2020-??-??.tar
    

    Or a one-liner using Archive::Tar::Merge (Install through your OS package manager if provided, or favorite CPAN client; not sure about its memory limitations):

    perl -MArchive::Tar::Merge -e '
        Archive::Tar::Merge->new(dest_tarball => $ARGV[0],
                                 source_tarballs => [ @ARGV[1..$#ARGV] ])->merge
    ' "/mnt/archive/archive - 2020.tar" /mnt/data/logs/2020/logs-main-2020-??-??.tar
    
    Login or Signup to reply.
  2. without the nulls in-between

    This is the main problem. We need to determine exactly how many zeros we need to chop off the end. And then, we can simply use cat to concatenate the remaining data.

    Unfortunately, there is no sure way to determine the actual TAR file data end without reading the TAR archive from the beginning. But for each file inside the TAR, it is enough if we know the size so that we can simply skip over it. This speeds up processing the archive a lot! This is some short python code, I extracted from my pet project ratarmount. There are many different TAR format flavors but this should work for most of them. To be even more generic the base-256 format would have to be supported, too.

    import io
    import sys
    
    with open(sys.argv[1], 'rb') as file:
        while True:
            blockContents = file.read(512)
            if len(blockContents) < 512:
                sys.exit(1)
    
            # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_01
            # > At the end of the archive file there shall be two 512-byte blocks filled with binary zeros,
            # > interpreted as an end-of-archive indicator.
            if blockContents == b"" * 512:
                blockContents = file.read(512)
                if blockContents == b"" * 512:
                    print(file.tell() - 2 * 512)
                    sys.exit(0)
                sys.exit(1)
    
            rawSize = blockContents[124 : 124 + 12].strip(b"")
            # TODO This might fail for non-POSIX GNU tar base 256 encoded sizes
            #      https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions
            size = int(rawSize, 8) if rawSize else 0
            file.seek(size if size % 512 == 0 else size + ( 512 - size % 512 ), io.SEEK_CUR)
    

    This function will return the size of the TAR archive excluding the zero-byte blocks. We can use this value to truncate the TAR.

    function tarcat()
    {
        local FIND_TAR_FILE_END_SCRIPT
        read -r -d '' FIND_TAR_FILE_END_SCRIPT <<'EOF'
    <COPY PASTE THE ABOVE PYTHON SCRIPT HERE!>
    EOF
    
        local realDataSize
        while [[ "$#" -gt 0 ]]; do
            if [[ "$#" -gt 1 ]]; then
                realDataSize=$( python3 -c "$FIND_TAR_FILE_END_SCRIPT" "$1" )
                if [[ $? -eq 0 ]]; then
                    head -c "$realDataSize" -- "$1"
                fi
            else
                cat -- "$1"
            fi
            shift
        done
    }
    

    This bash function can be used like this:

    for i in $( seq 3 ); do
        echo "foo$i" > "bar$i"
        tar -cf "tar$i.tar" "bar$i"
    done
    
    ls -l tar[0-9].tar
    # -rwx------ 1 user group 10240 Mar 30 00:17 tar1.tar
    # -rwx------ 1 user group 10240 Mar 30 00:17 tar2.tar
    # -rwx------ 1 user group 10240 Mar 30 00:17 tar3.tar
    tar tvlf tar3.tar
    # -rwx------ user/group   5 2022-03-30 00:16 bar3
    
    tarcat tar1.tar tar2.tar tar3.tar > concatenated-without-zeros.tar
    
    ls -l concatenated-without-zeros.tar
    # -rwx------ 1 user group 12288 Mar 30 00:18 concatenated-without-zeros.tar
    tar tvlf concatenated-without-zeros.tar
    # -rwx------ user/group   5 2022-03-30 00:16 bar1
    # -rwx------ user/group   5 2022-03-30 00:16 bar2
    # -rwx------ user/group   5 2022-03-30 00:16 bar3
    

    As can be seen, all three files in the resulting concatenated TAR are readable with tar even without specifying -i and the archive size (12 KiB) is less than the sum of the concatenated archives (30 KiB) because the trailing zero blocks were removed from the first two archives (not from the last because they act as an EOF indicator).

    Be aware that this code has not been extensively tested, yet. You probably can also make tarcat a Python-only script with a bit more work.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search