skip to Main Content

I have a bunch of files in directories with a file that includes important data like author and title.

/data/unorganised_texts/a-long-story

Many files in the directories, but most importantly each directory includes Data.yaml with contents like this:

Category:
  Name: Space
Author: Jôëlle Frankschiff
References:
  Title: Historical
  Title: Future
Title: A “long” story!

I need to match these lines as variables $category, $author, $title and make an appropriate structure and copy the directory like so:

/data/organised_texts/$category/$author/$title

Here is my attempt in bash, but probably going wrong in multiple places and as suggested would be better in python.

#!/bin/bash
for dir in /data/unorganised_texts/*/
while IFS= read -r line || [[ $category ]]; do
    [[ $category =~ “Category:” ]] && echo "$category" && mkdir /data/organised_texts/$category
[[ $author ]]; do
    [[ $author =~ “Author:” ]] && echo "$Author"
    [[ $title ]]; do
        [[ $title =~ “Title:” ]] && echo "$title" && mkdir /data/organised_texts/$category/$title && cp $dir/* /data/organised_texts/$category/$title/
done <"$dir/Data.yaml"

Here is my bash version, as I was experimenting with readarray and command eval and bash version was important:

ubuntu:~# bash --version
GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

Thanks!

3

Answers


  1. One bash idea:

    unset cat auth title
    
    while read -r label value
    do
        case "${label}" in
            "Category:")  cat="${value}" ;;
            "Author:")    auth="${value}" ;;
            "Title:")     title="${value}" ;;
        esac
    
        if [[ -n "${cat}" && -n "${auth}" && -n "${title}" ]]
        then
            mkdir -p "${cat}/${auth}/${title}"
            # cp ...                                # OP can add the desired `cp` command at this point, or after breaking out of the `while` loop
            break
        fi
    done < Data.yaml
    

    NOTE: assumes none of the values include linefeeds

    Results:

    $ find . -type d
    .
    ./Space
    ./Space/Jôëlle Frankschiff
    ./Space/Jôëlle Frankschiff/A “long” story!
    
    Login or Signup to reply.
    • It looks you have unmatched do-done pairs.
    • The expression [[ $varname ]] will cause a syntax error.
    • mkdir -p can create directories recursively at a time.

    Then would you please try the following:

    #!/bin/bash
    
    shopt -s dotglob                                                # copy dotfiles in the directories as well
    for dir in /data/unorganised_texts/*/; do
        while IFS= read -r line; do                                 # read a line of yaml file in "$dir"
            [[ $line =~ ^[[:space:]] ]] && continue                 # skip indented (starting with a space) lines
            read -r key val <<< "$line"                             # split on the 1st space into key and val
            val=${val////_}                                        # replace slash with underscore, just in case
            if [[ $key = "Category:" ]]; then category="$val"
            elif [[ $key = "Author:" ]]; then author="$val"
            elif [[ $key = "Title:" ]]; then title="$val"
            fi
        done < "$dir/Data.yaml"
    
        destdir="/data/organised_texts/$category/$author/$title"    # destination directory
        if [[ -d $destdir ]]; then                                  # check the duplication
            echo "$destdir already exists. skipped."
        else
            mkdir -p "$destdir"                                     # create the destination directory
            cp -a -- "$dir"/* "$destdir"                            # copy the contents to the destination
    #       echo "/data/organised_texts/$category/$author/$title"   # remove "#" to see the progress
        fi
    done
    
    Login or Signup to reply.
  2. Since the OP was interested in a python solution…

    First lets make some test dirs:

    pushd /tmp
    mkdir t
    pushd t
    mkdir a-long-story
    vim a-long-story/Data.yml # fill in here, or cp.
    mkdir irrelevant_dir
    mkdir -p irrelevant_dir/subdir
    touch notadir
    

    Then a simple python script. Python doesn’t (yet) have an ibuilt yaml parser, so pip install pyyaml is needed before this:

    from pathlib import Path
    from shutil import copytree
    
    from yaml import Loader, load  # pip install pyyaml
    
    
    def parse_yaml(f: Path) -> dict:
        with f.open() as f:
            return load(f, Loader)
    
    
    # ROOT = Path("/data/unorganised_texts")
    ROOT = Path("/tmp/t")
    
    for subdir in (d for d in ROOT.iterdir() if d.is_dir()):
        yamlf = subdir / "Data.yaml"
        if yamlf.is_file():
            print("Processing", yamlf)
            data = parse_yaml(yamlf)
            other_dirs = ROOT / data["Category"]["Name"] / data["Author"]
            other_dirs.mkdir(exist_ok=True, parents=True)
            outdir = other_dirs / data["Title"]
            if outdir.exists():
                print("skipping as would overwrite.")
            else:
                copytree(subdir, outdir)
    

    This code probably doesn’t need any explanation even for someone new to python. But for completeness:

    • we import a stdlib class (Path) and fn (copytree)
    • we import a 3rd party fn (load) and class (Loader)
    • we define a function to parse yaml. This is probably redundant, but it does add a level of commentary, and lets us easily add more logic here if required later.
    • ROOT.iterdir() yields up all the dirs at one level in ROOT. We filter these with a generator comprehension to strip out bare files.
    • if we find a the yaml we’re expecting we make the outdirs, and then if we’re not going to overwrite, copy our current directory into the output.

    There is nothing remotely wrong with doing this in bash. These days I’d have written this python version instead, because a. I know python much better than my very rusty bash, and b. it solves the problem ‘properly’ (e.g. we parse the YAML with a yaml parser), which sometimes makes things more robust.

    Note btw that the type hints are optional and ignored at runtime.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search