I tried
$obj = [System.IO.File]::ReadLines((Convert-Path -LiteralPath names.json)) | ConvertFrom-Json
$keys = @()
foreach ($key in $obj.GetEnumerator()) {
$keys += $key.Key
}
Write-Output $keys
But after over 24 hours it had not completed.
I need the key names so I can
- Delete irrelevant info and make it smaller
- Convert it to csv (the key names are required, otherwise PS just uses the first object and ignores keys which are not present in the first object)
The JSON is a version of this one (though 200 megs smaller): https://kaikki.org/dictionary/All%20languages%20combined/by-pos-name/kaikki_dot_org-dictionary-all-by-pos-name.json
2
Answers
You can Stream the file and process each line, avoiding
Get-Content
for large files and minimizing the number of cmdlet invocations:This approach uses the .NET
ReadLines
method to efficiently stream the file and process each line.I’ve finally finished downloading your massive JSON file.
This will give you a list of glosses from each JSON object.
You’re already using the most efficient way (in PowerShell) to parse your line-based JSONL input file (see the bottom section for an explanation).
It seems that you are simply looking to discover the property names of the
[pscustomobject]
instances thatConvertFrom-Json
parsed the JSON documents into.The simplest way to discover the immediate (top-level) properties is via the intrinsic
psobject
property:Using
Get-Member
is another option:However, given that the
[pscustomobject]
instances stored in$obj
are object graphs, i.e. each is a hierarchy of nested[pscustomobject]
s, you may be interested in discovering the hierarchy of property names:One option is simply to re-convert to JSON, using
ConvertTo-Json
, taking advantage of the fact that the resulting JSON representation is pretty-printed by default (opt-out requires use of-Compress
):-Depth
argument with a sufficiently high number of levels to serialize, so as to avoid truncation (the default depth is only2
; see this post for details).Another option – informal, for display output only, but less noisy – is to
Format-Custom
:You’ll get a hierarchical display without quoting that is easy to parse visually. As with all
Format-*
cmdlets, the output is for display only, and not suitable for programmatic processing (see this answer).Format-Custom
‘s default depth is5
, so no-Depth
argument is required in this case.While usually not necessary to get a sense of the overall structure of an object graph and its property names, note that in general you may need to (temporarily) set the
$FormatEnumerationLimit
preference variable to ensure that all elements of a collection (array) among the property values are visualized; by default, only 4 elements are, and omitted elements are represented as…
.Sample
Format-Custom
output (abridged):Based on the properties discovered above, here’s an example with a calculated property that creates CSV data from the input objects:
Efficient processing of JSONL files:
suchislife found the proper term to describe the line-based format of your JSON input data: JSONL; that is, each line of your input file is a self-contained JSON document.
ConvertFrom-Json
has built-in heuristics to support such input:If the first input object (string) is a self-contained JSON document, it is parsed as such, and all subsequent lines are treated the same.
However, there are potential pitfalls with respect to empty lines and single-line comments at the start – see this answer.
Therefore, as shown in your question
(
[System.IO.File]::ReadLines((Convert-Path -LiteralPath names.json)) | ConvertFrom-Json
):you can stream a JSONL file’s lines – one by one – directly to
ConvertFrom-Json
.ForEach-Object
call in whichConvertFrom-Json
is invoked once for each line.and to perform this streaming efficiently with a large input file,
[System.IO.File]::ReadLines()
is preferable.Convert-Path
is used to ensure that a full path is passed, because .NET’s working directory usually differs from PowerShell’s (see this answer).While the PowerShell-idiomatic
Get-Content
-LiteralPath names.json
works too, it is – unfortunately – much slower, because each line read is decorated with ETS (Extended Type System) properties.Per GitHub issue #7537, introducing an opt-out of this costly decoration via a
-NoExtendedMember
switch has been green-lit, but no one has stepped up to implement it yet (as of PowerShell 7.3.10, current as of this writing).