Concatenate tar files so that the resulting tar can be opened without the -i option - CentOS

user711270
April 15, 2021
88 views
1 vote
2 Answers

I got tar archives containing a lot of very small JSON files. Each day I get a new tar archive. Now I want to combine the daily tar archives into a yearly tar archive and compress it. I do that with the following bash script:

tar -cf "/mnt/archive/archive - 2020.tar" --files-from /dev/null
for f in /mnt/data/logs/2020/logs-main-2020-??-??.tar
do
    tar -n --concatenate --file="/mnt/archive/archive - 2020.tar" $f
done

pxz -T6 -c "/mnt/archive/archive - 2020.tar" > "/mnt/archive/archive - 2020.tar.xz"
rm "/mnt/archive/archive - 2020.tar"

This works, but the concatenation of the tar files is getting slower the bigger the main tar gets.

I could use a cat instruction to simply add all tars together, but the resulting archive then contains all the end-of-archive null markers of the original tars. Thus, the resulting tar has to be opened with the -i option, which is not an option for the system working with the resulting tar.

How can I concatenate the tar files without the need of slow tar concats and still create a valid tar without the nulls in-between? Can I do some cat, un-tar, re-tar, compress pipe?

I do not have any spacial characters like line newlines in the JSON file names in the input tars
I work with GNU tar v1.26 on CentOS 7
Each input tar is about 1GB, so keeping them in memory is no option
There is no need to check the output tar for duplicate entries. The way the input tars are creates ensures that they are not have duplicated JSON files

Answers

A couple of perl-based approaches:

First, a script using the core Archive::Tar module to read existing tar files and create a new one (Due to limitations of the module, it has to hold the data for the combined destination tar file in memory all at once before writing it; might be an issue with a huge amount of data):

#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
use Archive::Tar;

# First argument is the new tar file to create, rest are ones to
# copy files from.

die "Usage: $0 DESTFILE SOURCEFILE ...n" unless @ARGV >= 2;

my $destfile = shift;
my $dest = Archive::Tar->new;

foreach my $file (@ARGV) {
  my $src = Archive::Tar->iter($file) or exit 1;
  say "Adding contents of $file";
  while (my $file = $src->() ) {
    my $name = $file->full_path;
    say "t$name";
    $dest->add_data($name, $file->get_content,
                    { mtime => $file->mtime,
                      size => $file->size,
                      mode => $file->mode,
                      uid => $file->uid,
                      gid => $file->gid,
                      type => $file->type,
                      devmajor => $file->devmajor,
                      devminor => $file->devminor,
                      linkname => $file->linkname
                    })
      or exit 1;
  }
}

$dest->write($destfile) or exit 1;
say "Wrote $destfile";

Usage:

perl tarcat.pl "/mnt/archive/archive - 2020.tar" /mnt/data/logs/2020/logs-main-2020-??-??.tar

Or a one-liner using Archive::Tar::Merge (Install through your OS package manager if provided, or favorite CPAN client; not sure about its memory limitations):

perl -MArchive::Tar::Merge -e '
    Archive::Tar::Merge->new(dest_tarball => $ARGV[0],
                             source_tarballs => [ @ARGV[1..$#ARGV] ])->merge
' "/mnt/archive/archive - 2020.tar" /mnt/data/logs/2020/logs-main-2020-??-??.tar

without the nulls in-between

This is the main problem. We need to determine exactly how many zeros we need to chop off the end. And then, we can simply use cat to concatenate the remaining data.

Unfortunately, there is no sure way to determine the actual TAR file data end without reading the TAR archive from the beginning. But for each file inside the TAR, it is enough if we know the size so that we can simply skip over it. This speeds up processing the archive a lot! This is some short python code, I extracted from my pet project ratarmount. There are many different TAR format flavors but this should work for most of them. To be even more generic the base-256 format would have to be supported, too.

import io
import sys

with open(sys.argv[1], 'rb') as file:
    while True:
        blockContents = file.read(512)
        if len(blockContents) < 512:
            sys.exit(1)

        # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_01
        # > At the end of the archive file there shall be two 512-byte blocks filled with binary zeros,
        # > interpreted as an end-of-archive indicator.
        if blockContents == b"" * 512:
            blockContents = file.read(512)
            if blockContents == b"" * 512:
                print(file.tell() - 2 * 512)
                sys.exit(0)
            sys.exit(1)

        rawSize = blockContents[124 : 124 + 12].strip(b"")
        # TODO This might fail for non-POSIX GNU tar base 256 encoded sizes
        #      https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions
        size = int(rawSize, 8) if rawSize else 0
        file.seek(size if size % 512 == 0 else size + ( 512 - size % 512 ), io.SEEK_CUR)

This function will return the size of the TAR archive excluding the zero-byte blocks. We can use this value to truncate the TAR.

function tarcat()
{
    local FIND_TAR_FILE_END_SCRIPT
    read -r -d '' FIND_TAR_FILE_END_SCRIPT <<'EOF'
<COPY PASTE THE ABOVE PYTHON SCRIPT HERE!>
EOF

    local realDataSize
    while [[ "$#" -gt 0 ]]; do
        if [[ "$#" -gt 1 ]]; then
            realDataSize=$( python3 -c "$FIND_TAR_FILE_END_SCRIPT" "$1" )
            if [[ $? -eq 0 ]]; then
                head -c "$realDataSize" -- "$1"
            fi
        else
            cat -- "$1"
        fi
        shift
    done
}

This bash function can be used like this:

for i in $( seq 3 ); do
    echo "foo$i" > "bar$i"
    tar -cf "tar$i.tar" "bar$i"
done

ls -l tar[0-9].tar
# -rwx------ 1 user group 10240 Mar 30 00:17 tar1.tar
# -rwx------ 1 user group 10240 Mar 30 00:17 tar2.tar
# -rwx------ 1 user group 10240 Mar 30 00:17 tar3.tar
tar tvlf tar3.tar
# -rwx------ user/group   5 2022-03-30 00:16 bar3

tarcat tar1.tar tar2.tar tar3.tar > concatenated-without-zeros.tar

ls -l concatenated-without-zeros.tar
# -rwx------ 1 user group 12288 Mar 30 00:18 concatenated-without-zeros.tar
tar tvlf concatenated-without-zeros.tar
# -rwx------ user/group   5 2022-03-30 00:16 bar1
# -rwx------ user/group   5 2022-03-30 00:16 bar2
# -rwx------ user/group   5 2022-03-30 00:16 bar3

As can be seen, all three files in the resulting concatenated TAR are readable with tar even without specifying -i and the archive size (12 KiB) is less than the sum of the concatenated archives (30 KiB) because the trailing zero blocks were removed from the first two archives (not from the last because they act as an EOF indicator).

Be aware that this code has not been extensively tested, yet. You probably can also make tarcat a Python-only script with a bit more work.

Please signup or login to give your own answer.

Click here to cancel reply.

Concatenate tar files so that the resulting tar can be opened without the -i option – CentOS

Answers