Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Can text from a txt file be converted to json using jq tool?

Marco
May 11, 2023
146 views
0 votes
2 Answers

I have a txt file with values obtained by calling the following command recursively: gsutil ls -r gs://bucket-test/** | while IFS= read -r key; do gsutil stat $key; done, it looks like this:

gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip:
    Creation time:          Wed, 21 Dec 2022 10:39:27 GMT
    Update time:            Wed, 21 Dec 2022 10:39:27 GMT
    Storage class:          STANDARD
    Content-Length:         0
    Content-Type:           application/zip
    Hash (crc32c):          AAAAAA==
    Hash (md5):             1B2M2Y8AsgTpgAmY7PhCfg==
    ETag:                   CM30q9XCivwCEAE=
    Generation:             1671619167320653
    Metageneration:         1
gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L:
    Creation time:          Mon, 10 Apr 2023 19:09:41 GMT
    Update time:            Mon, 10 Apr 2023 19:09:41 GMT
    Storage class:          STANDARD
    Content-Disposition:    inline; filename=James_INGREDIENTS_A3.pdf
    Content-Length:         4381797
    Content-Type:           application/pdf
    Hash (crc32c):          GOzitA==
    Hash (md5):             eUSLC/z70gjDB2WQKIPOuQ==
    ETag:                   CLGPvu+BoP4CEAE=
    Generation:             1681153781106609
    Metageneration:         1
gs://bucket-test/prova.pdf:
    Creation time:          Mon, 08 May 2023 15:37:26 GMT
    Update time:            Mon, 08 May 2023 15:40:12 GMT
    Storage class:          STANDARD
    Content-Disposition:    inline; filename=James_KEY_VISUAL_A3.pdf
    Content-Language:       ace
    Content-Length:         15407
    Content-Type:           application/pdf
    Metadata:               
        meta-1:             prova 1
        meta-2:             prova 2
    Hash (crc32c):          ZIrHPA==
    Hash (md5):             oZbD+S8y35spkNozW3hUDA==
    ETag:                   CNDj09OG5v4CEAM=
    Generation:             1683560246604240
    Metageneration:         3

I need to convert the output to json format, splitting by leading spaces and assigning the value present on the first row of each group to the "Key" field, then there may be subfields for example under the "Metadata" value:

{
  "Key": "gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip",
  "Creation time": "Wed, 21 Dec 2022 10:39:27 GMT",
  "Update time": "Wed, 21 Dec 2022 10:39:27 GMT",
  "Storage class": "STANDARD",
  "Content-Length": "0",
  "Content-Type": "application/zip",
  "Hash (crc32c)": "AAAAAA==",
  "Hash (md5)": "1B2M2Y8AsgTpgAmY7PhCfg==",
  "ETag": "CM30q9XCivwCEAE=",
  "Generation": "1671619167320653",
  "Metageneration": "1"
},
{
  "Key": "gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L",
  "Creation time": "Mon, 10 Apr 2023 19:09:41 GMT",
  "Update time": "Mon, 10 Apr 2023 19:09:41 GMT",
  "Storage class": "STANDARD",
  "Content-Disposition": "inline; filename=James_INGREDIENTS_A3.pdf",
  "Content-Length": "4381797",
  "Content-Type": "application/pdf",
  "Hash (crc32c)": "GOzitA==",
  "Hash (md5)": "eUSLC/z70gjDB2WQKIPOuQ==",
  "ETag": "CLGPvu+BoP4CEAE=",
  "Generation": "1681153781106609",
  "Metageneration": "1"
},
{
  "Key": "gs://bucket-test/prova.pdf",
  "Creation time": "Mon, 08 May 2023 15:37:26 GMT",
  "Update time": "Mon, 08 May 2023 15:40:12 GMT",
  "Storage class": "STANDARD",
  "Content-Disposition": "inline; filename=James_KEY_VISUAL_A3.pdf",
  "Content-Language": "ace",
  "Content-Length": "15407",
  "Content-Type": "application/pdf",
  "Metadata": {
    "meta-1": "prova 1",
    "meta-2": "prova 2"
  },
  "Hash (crc32c)": "ZIrHPA==",
  "Hash (md5)": "oZbD+S8y35spkNozW3hUDA==",
  "ETag": "CNDj09OG5v4CEAM=",
  "Generation": "1683560246604240",
  "Metageneration": "3"
}

I tried with this command for an only group but without success:
gsutil stat gs://bucket-test/prova.pdf | printf %s "$(cat)" | jq -R -s 'split("n") | map({key: split(": ")[0], value: split(": ")[1]})'

The json is converted into an array:

[
  {
    "key": "gs://spin8-test/prova.pdf:",
    "value": null
  },
  {
    "key": "    Creation time",
    "value": "         Mon, 08 May 2023 15:37:26 GMT"
  },
  {
    "key": "    Update time",
    "value": "           Mon, 08 May 2023 15:40:12 GMT"
  },
  {
    "key": "    Storage class",
    "value": "         STANDARD"
  },
  {
    "key": "    Content-Disposition",
    "value": "   inline; filename=James_KEY_VISUAL_A3.pdf"
  },
  {
    "key": "    Content-Language",
    "value": "      ace"
  },
  {
    "key": "    Content-Length",
    "value": "        15407"
  },
  {
    "key": "    Content-Type",
    "value": "          application/pdf"
  },
  {
    "key": "    Metadata",
    "value": "              "
  },
  {
    "key": "        meta-1",
    "value": "            prova 1"
  },
  {
    "key": "        meta-2",
    "value": "            prova 2"
  },
  {
    "key": "    Hash (crc32c)",
    "value": "         ZIrHPA=="
  },
  {
    "key": "    Hash (md5)",
    "value": "            oZbD+S8y35spkNozW3hUDA=="
  },
  {
    "key": "    ETag",
    "value": "                  CNDj09OG5v4CEAM="
  },
  {
    "key": "    Generation",
    "value": "            1683560246604240"
  },
  {
    "key": "    Metageneration",
    "value": "        3"
  }
]

Any suggestions? Thanks

Answers

I’ve no idea how you’d use jq to parse regular text to turn it into JSON or if that’s even something jq is designed to do but here’s a start using awk to just handle the input/output you show:

$ cat tst.awk
/^[^[:space:]]/ { if (NR>1) prtRec() }
{ rec = rec $0 RS }
END { prtRec(); print "" }

function prtRec(        lines,numLines,lineNr,line,tag,val,depth) {
    printf "%s", recSep
    recSep = "," ORS

    print "{"

    numLines = split(rec,lines,RS) - 1
    for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
        line = lines[lineNr]
        gsub(/^[[:space:]]+|[[:space:]]+$/,"",line)
        tag = val = line
        if ( lineNr == 1 ) {
            tag = "Key"
            sub(/:$/,"",val)
        }
        else {
            sub(/[[:space:]]*:.*/,"",tag)
            sub(/[^:]+:[[:space:]]*/,"",val)
        }
        printf "  "%s": "%s"%sn", tag, val, (lineNr<numLines ? "," : "")
    }

    printf "}"

    rec = ""
}

$ awk -f tst.awk file
{
  "Key": "gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip",
  "Creation time": "Wed, 21 Dec 2022 10:39:27 GMT",
  "Update time": "Wed, 21 Dec 2022 10:39:27 GMT",
  "Storage class": "STANDARD",
  "Content-Length": "0",
  "Content-Type": "application/zip",
  "Hash (crc32c)": "AAAAAA==",
  "Hash (md5)": "1B2M2Y8AsgTpgAmY7PhCfg==",
  "ETag": "CM30q9XCivwCEAE=",
  "Generation": "1671619167320653",
  "Metageneration": "1"
},
{
  "Key": "gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L",
  "Creation time": "Mon, 10 Apr 2023 19:09:41 GMT",
  "Update time": "Mon, 10 Apr 2023 19:09:41 GMT",
  "Storage class": "STANDARD",
  "Content-Disposition": "inline; filename=James_INGREDIENTS_A3.pdf",
  "Content-Length": "4381797",
  "Content-Type": "application/pdf",
  "Hash (crc32c)": "GOzitA==",
  "Hash (md5)": "eUSLC/z70gjDB2WQKIPOuQ==",
  "ETag": "CLGPvu+BoP4CEAE=",
  "Generation": "1681153781106609",
  "Metageneration": "1"
},
{
  "Key": "gs://bucket-test/prova.pdf",
  "Creation time": "Mon, 08 May 2023 15:37:26 GMT",
  "Update time": "Mon, 08 May 2023 15:40:12 GMT",
  "Storage class": "STANDARD",
  "Content-Disposition": "inline; filename=James_KEY_VISUAL_A3.pdf",
  "Content-Language": "ace",
  "Content-Length": "15407",
  "Content-Type": "application/pdf",
  "Metadata": "",
  "meta-1": "prova 1",
  "meta-2": "prova 2",
  "Hash (crc32c)": "ZIrHPA==",
  "Hash (md5)": "oZbD+S8y35spkNozW3hUDA==",
  "ETag": "CNDj09OG5v4CEAM=",
  "Generation": "1683560246604240",
  "Metageneration": "3"
}

You’d just have to modify it to spot the missing val on the Metadata line and use the increase/decrease of the indent on the subsequent lines to add the necessary additional { and }.

With jq, you can read in raw text using the -R flag, and iterate through the lines using reduce. Start out with an empty array [], then, based on the indentation, add a new item, append to the last one, or append to last one’s .Metadata field. Checking the indentation and parsing the line’s content is done using regular expressions with match and capture, respectively:

jq -Rn '
  reduce (inputs | {
    ind: match("^\s*").length,
    cap: capture("\s*(?<key>.*):(\s+(?<value>.*))?$")
  }) as {$ind, $cap} ([];
    if $ind == 0 then . + [$cap | {key}]
    elif $ind == 4 then last += ([$cap | select(.key == "Metadata").value = {}] | from_entries)
    elif $ind == 8 then last.Metadata += ([$cap] | from_entries)
    else . end
  )
'

This creates a valid JSON array (because without the brackets but with commas in between the items, it wouldn’t be valid JSON):

[
  {
    "key": "gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip",
    "Creation time": "Wed, 21 Dec 2022 10:39:27 GMT",
    "Update time": "Wed, 21 Dec 2022 10:39:27 GMT",
    "Storage class": "STANDARD",
    "Content-Length": "0",
    "Content-Type": "application/zip",
    "Hash (crc32c)": "AAAAAA==",
    "Hash (md5)": "1B2M2Y8AsgTpgAmY7PhCfg==",
    "ETag": "CM30q9XCivwCEAE=",
    "Generation": "1671619167320653",
    "Metageneration": "1"
  },
  {
    "key": "gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L",
    "Creation time": "Mon, 10 Apr 2023 19:09:41 GMT",
    "Update time": "Mon, 10 Apr 2023 19:09:41 GMT",
    "Storage class": "STANDARD",
    "Content-Disposition": "inline; filename=James_INGREDIENTS_A3.pdf",
    "Content-Length": "4381797",
    "Content-Type": "application/pdf",
    "Hash (crc32c)": "GOzitA==",
    "Hash (md5)": "eUSLC/z70gjDB2WQKIPOuQ==",
    "ETag": "CLGPvu+BoP4CEAE=",
    "Generation": "1681153781106609",
    "Metageneration": "1"
  },
  {
    "key": "gs://bucket-test/prova.pdf",
    "Creation time": "Mon, 08 May 2023 15:37:26 GMT",
    "Update time": "Mon, 08 May 2023 15:40:12 GMT",
    "Storage class": "STANDARD",
    "Content-Disposition": "inline; filename=James_KEY_VISUAL_A3.pdf",
    "Content-Language": "ace",
    "Content-Length": "15407",
    "Content-Type": "application/pdf",
    "Metadata": {
      "meta-1": "prova 1",
      "meta-2": "prova 2"
    },
    "Hash (crc32c)": "ZIrHPA==",
    "Hash (md5)": "oZbD+S8y35spkNozW3hUDA==",
    "ETag": "CNDj09OG5v4CEAM=",
    "Generation": "1683560246604240",
    "Metageneration": "3"
  }
]

Demo

Please signup or login to give your own answer.

Click here to cancel reply.