I got tar archives containing a lot of very small JSON files. Each day I get a new tar archive. Now I want to combine the daily tar archives into a yearly tar archive and compress it. I do that with the following bash script:
tar -cf "/mnt/archive/archive - 2020.tar" --files-from /dev/null
for f in /mnt/data/logs/2020/logs-main-2020-??-??.tar
do
tar -n --concatenate --file="/mnt/archive/archive - 2020.tar" $f
done
pxz -T6 -c "/mnt/archive/archive - 2020.tar" > "/mnt/archive/archive - 2020.tar.xz"
rm "/mnt/archive/archive - 2020.tar"
This works, but the concatenation of the tar files is getting slower the bigger the main tar gets.
I could use a cat
instruction to simply add all tars together, but the resulting archive then contains all the end-of-archive null markers of the original tars. Thus, the resulting tar has to be opened with the -i
option, which is not an option for the system working with the resulting tar.
How can I concatenate the tar files without the need of slow tar concats and still create a valid tar without the nulls in-between? Can I do some cat, un-tar, re-tar, compress pipe?
- I do not have any spacial characters like line newlines in the JSON file names in the input tars
- I work with GNU tar v1.26 on CentOS 7
- Each input tar is about 1GB, so keeping them in memory is no option
- There is no need to check the output tar for duplicate entries. The way the input tars are creates ensures that they are not have duplicated JSON files
2
Answers
A couple of perl-based approaches:
First, a script using the core
Archive::Tar
module to read existing tar files and create a new one (Due to limitations of the module, it has to hold the data for the combined destination tar file in memory all at once before writing it; might be an issue with a huge amount of data):Usage:
Or a one-liner using Archive::Tar::Merge (Install through your OS package manager if provided, or favorite CPAN client; not sure about its memory limitations):
This is the main problem. We need to determine exactly how many zeros we need to chop off the end. And then, we can simply use
cat
to concatenate the remaining data.Unfortunately, there is no sure way to determine the actual TAR file data end without reading the TAR archive from the beginning. But for each file inside the TAR, it is enough if we know the size so that we can simply skip over it. This speeds up processing the archive a lot! This is some short python code, I extracted from my pet project ratarmount. There are many different TAR format flavors but this should work for most of them. To be even more generic the base-256 format would have to be supported, too.