skip to Main Content

I tried

$obj = [System.IO.File]::ReadLines((Convert-Path -LiteralPath names.json)) | ConvertFrom-Json
 
$keys = @() 
 
foreach ($key in $obj.GetEnumerator()) { 
  $keys += $key.Key 
} 
 
Write-Output $keys

But after over 24 hours it had not completed.

I need the key names so I can

  1. Delete irrelevant info and make it smaller
  2. Convert it to csv (the key names are required, otherwise PS just uses the first object and ignores keys which are not present in the first object)

The JSON is a version of this one (though 200 megs smaller): https://kaikki.org/dictionary/All%20languages%20combined/by-pos-name/kaikki_dot_org-dictionary-all-by-pos-name.json

2

Answers


  1. You can Stream the file and process each line, avoiding Get-Content for large files and minimizing the number of cmdlet invocations:

    $allKeys = @{}
    
    [System.IO.File]::ReadLines("names.json") | ForEach-Object {
        $obj = ConvertFrom-Json $_
        $obj.PSObject.Properties.Name | ForEach-Object {
            $allKeys[$_]= $true
        }
    }
    
    $uniqueKeys = $allKeys.Keys
    
    Write-Output $uniqueKeys
    

    This approach uses the .NET ReadLines method to efficiently stream the file and process each line.

    I’ve finally finished downloading your massive JSON file.

    • To process the JSON further, here’s a simple extraction of "glosses" from each line:
    $allGlosses = @()
    
    [System.IO.File]::ReadLines("sample.json") | ForEach-Object {
        $obj = ConvertFrom-Json $_
        $obj.senses.glosses | ForEach-Object {
            $allGlosses += $_
        }
    }
    
    Write-Output $allGlosses
    

    This will give you a list of glosses from each JSON object.

    Login or Signup to reply.
  2. You’re already using the most efficient way (in PowerShell) to parse your line-based JSONL input file (see the bottom section for an explanation).

    It seems that you are simply looking to discover the property names of the [pscustomobject] instances that ConvertFrom-Json parsed the JSON documents into.

    The simplest way to discover the immediate (top-level) properties is via the intrinsic psobject property:

    $propNames = $obj[0].psobject.Properties.Name
    

    Using Get-Member is another option:

    $obj[0] | Get-Member -Type Properties
    

    However, given that the [pscustomobject] instances stored in $obj are object graphs, i.e. each is a hierarchy of nested [pscustomobject]s, you may be interested in discovering the hierarchy of property names:

    • One option is simply to re-convert to JSON, using ConvertTo-Json, taking advantage of the fact that the resulting JSON representation is pretty-printed by default (opt-out requires use of -Compress):

      $obj[0] | ConvertTo-Json -Depth 5
      
      • Note the – unfortunate – need to specify a -Depth argument with a sufficiently high number of levels to serialize, so as to avoid truncation (the default depth is only 2; see this post for details).
    • Another option – informal, for display output only, but less noisy – is to Format-Custom:

      $obj[0] | Format-Custom
      
      • You’ll get a hierarchical display without quoting that is easy to parse visually. As with all Format-* cmdlets, the output is for display only, and not suitable for programmatic processing (see this answer).

      • Format-Custom‘s default depth is 5, so no -Depth argument is required in this case.

      • While usually not necessary to get a sense of the overall structure of an object graph and its property names, note that in general you may need to (temporarily) set the $FormatEnumerationLimit preference variable to ensure that all elements of a collection (array) among the property values are visualized; by default, only 4 elements are, and omitted elements are represented as .

    Sample Format-Custom output (abridged):

    class PSCustomObject
    {
      pos = name
      head_templates = 
        [
          class PSCustomObject
          {
            name = en-proper noun
            expansion = Abib
          }
        ]
        
      etymology_text = From Hebrew אָבִיב (avív, literally “ears of barley”), hence “the season of beginning barley-crop”, because the grains start ripening at that time of year.
      etymology_templates = 
        [
          class PSCustomObject
          {
            name = bor
            args = 
              class PSCustomObject
              {
                1 = en
                2 = he
                lit = ears of barley
              }
            expansion = Hebrew אָבִיב (avív, literally “ears of barley”)
          }
        ]
        
      word = Abib
      lang = English
      lang_code = en
    
       …
    
    }
    

    Based on the properties discovered above, here’s an example with a calculated property that creates CSV data from the input objects:

    # Sample CSV output - limited to 10 in this example.
    $obj | 
      Select-Object -First 10 -Property word, 
                                        lang, 
                                        @{ n='glosses'; e={ $_.senses.glosses } } |
      ConvertTo-Csv -NoTypeInformation
    

    Efficient processing of JSONL files:

    suchislife found the proper term to describe the line-based format of your JSON input data: JSONL; that is, each line of your input file is a self-contained JSON document.

    ConvertFrom-Json has built-in heuristics to support such input:

    • If the first input object (string) is a self-contained JSON document, it is parsed as such, and all subsequent lines are treated the same.

    • However, there are potential pitfalls with respect to empty lines and single-line comments at the start – see this answer.

    Therefore, as shown in your question
    ([System.IO.File]::ReadLines((Convert-Path -LiteralPath names.json)) | ConvertFrom-Json):

    • you can stream a JSONL file’s lines – one by one – directly to ConvertFrom-Json.

      • This is much more efficient than using a ForEach-Object call in which ConvertFrom-Json is invoked once for each line.
    • and to perform this streaming efficiently with a large input file, [System.IO.File]::ReadLines() is preferable.

      • Convert-Path is used to ensure that a full path is passed, because .NET’s working directory usually differs from PowerShell’s (see this answer).

      • While the PowerShell-idiomatic Get-Content -LiteralPath names.json works too, it is – unfortunately – much slower, because each line read is decorated with ETS (Extended Type System) properties.

      • Per GitHub issue #7537, introducing an opt-out of this costly decoration via a -NoExtendedMember switch has been green-lit, but no one has stepped up to implement it yet (as of PowerShell 7.3.10, current as of this writing).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search