skip to Main Content

I have the issue I have some files that have as content multiple key value pairs that I’d like to transform into multiple arrays.

Let me illustrate what I mean with some produced examples. First the content of the files:

# cat content/1.yaml
time: "2020-09-14T22:33:40Z"
id: ed1d4321
name: One
description: 'Here is number "one"
  this is good'

# cat content/2yaml
time: "2021-09-14T22:33:40Z"
id: eg134841
name: Two
description: 'Here is number "two"
  best of all'
newkey: value

In the next step I merge these files together into one blob containing the filenames as well which I want to keep:

# for file in $(ls content/*yaml); do echo filename: $file; cat $file; done
filename: content/1.yaml
time: "2020-09-14T22:33:40Z"
id: ed1d4321
name: One
description: 'Here is number "one"
  this is good'
filename: content/2yaml
time: "2021-09-14T22:33:40Z"
id: eg134841
name: Two
description: 'Here is number "two"
  best of all'
newkey: value

And now the issue begins, how to bring this together into json arrays?

That’s what I came up with up to now:

# for file in $(ls content/*yaml); do echo filename: $file; cat $file; done | jq -Rn '[inputs|split(": ")] | map({(.[0]): .[1]})'
[
  {
    "filename": "content/1.yaml"
  },
  {
    "time": ""2020-09-14T22:33:40Z""
  },
  {
    "id": "ed1d4321"
  },
  {
    "name": "One"
  },
  {
    "description": "'Here is number "one""
  },
  {
    "  this is good'": null
  },
  {
    "filename": "content/2yaml"
  },
  {
    "time": ""2021-09-14T22:33:40Z""
  },
  {
    "id": "eg134841"
  },
  {
    "name": "Two"
  },
  {
    "description": "'Here is number "two""
  },
  {
    "  best of all'": null
  },
  {
    "newkey": "value"
  }
]

That’s is already close but some issues I still have to solve which I don’t find a solution for:

  1. The filenames are not spread into separate arrays.
  2. the time field should not have the escaped quoted strings. I’d like to have a solution that iterates over all fields and would expand these contents out of of the quotes like here as example "time": "2021-09-14T22:33:40Z"
  3. description value is spread over multiple lines and I’d like to see them merged into one value but that’s not what happens as of now, so should look like that: "description": "Here is number "two" best of all. The single quotes should not be kept.

So at the end the outcome should be rather like that:

[
  {
    "filename": "content/1.yaml",
    "time": "2020-09-14T22:33:40Z",
    "id": "ed1d4321",
    "name": "One",
    "description": "Here is number "one"  this is good"
  },
  {
    "filename": "content/2yaml",
    "time": "2021-09-14T22:33:40Z",
    "id": "eg134841",
    "name": "Two",
    "description": "Here is number "two"  best of all",
    "newkey": "value"
  }
]

4

Answers


  1. Chosen as BEST ANSWER

    Ok, I found another answer to that which is not using yq but using python which is most probably installed on lots of machines:

    # for file in $(ls content/*yaml); do (echo filename: $file; cat $file) | python -c 'import yaml; import json; import sys; print(json.dumps(yaml.safe_load(sys.stdin)));' ; done | jq -s
    [
      {
        "filename": "content/1.yaml",
        "time": "2020-09-14T22:33:40Z",
        "id": "ed1d4321",
        "name": "One",
        "description": "Here is number "one" this is good"
      },
      {
        "filename": "content/2yaml",
        "time": "2021-09-14T22:33:40Z",
        "id": "eg134841",
        "name": "Two",
        "description": "Here is number "two" best of all",
        "newkey": "value"
      }
    ]
    
    

  2. This is something probably better suited for yq instead of trying to re-implement a YAML parser.

    Something like this would work:

    yq eval-all -o=json '[{"filename": filename} + .]' *.yaml
    

    resulting in

    [
      {
        "filename": "1.yaml",
        "time": "2020-09-14T22:33:40Z",
        "id": "ed1d4321",
        "name": "One",
        "description": "Here is number "one" this is good"
      },
      {
        "filename": "2.yaml",
        "time": "2021-09-14T22:33:40Z",
        "id": "eg134841",
        "name": "Two",
        "description": "Here is number "two" best of all",
        "newkey": "value"
      }
    ]
    
    Login or Signup to reply.
  3. This is a partial solution — the values are not yet "cleaned". This is left as an excercise to the reader 🙂

    Start with jq --slurp --raw-input:

    # split lines
    split("n")
    # join lines starting with whitespace with previous line
    | reduce .[] as $l (
        null;
        if $l | startswith(" ") then .[-1] += $l else . += [$l] end
    )
    # split on first colon, returning an array of objects like {key: X, value: Y}
    | map(capture("^(?<key>[^:]+):\s*(?<value>.*)$"))
    # combine these simple objects into bigger objects but begin a new objects when encountering "filename"
    | reduce .[] as $e (null; 
        if $e.key == "filename" then . += [{}] else . end
        | .[-1][$e.key] = $e.value
    )
    

    The output is this:

    [
      {
        "filename": "content/1.yaml",
        "time": ""2020-09-14T22:33:40Z"",
        "id": "ed1d4321",
        "name": "One",
        "description": "'Here is number "one"  this is good'"
      },
      {
        "filename": "content/2yaml",
        "time": ""2021-09-14T22:33:40Z"",
        "id": "eg134841",
        "name": "Two",
        "description": "'Here is number "two"  best of all'",
        "newkey": "value"
      }
    ]
    
    Login or Signup to reply.
  4. The following handles one file at a time, and presupposes an invocation of jq using the -R and -s command-line options (jq -Rs). Combining the results for more than one file is left as an exercise.

      def objectify:
        capture("(?<key>[^:]+): *(?<value>.*)")
        | .value = (.value | (fromjson? // .))
        | [.]
        | from_entries;
    
      gsub("n  *"; " ")        # join dangling text
      | . / "n"                # split
      | map(select(length>0)).  # ignore ""
      | map(objectify)          # {key, value}
      | add
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search