Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Chunk a JSON Array of Objects until each Array item is of byte length < a Static Threshold

Coldchain9
September 22, 2023
110 views
1 vote
2 Answers

I have a list of dict that follow a consistent structure where each dict has a list of integers. However, I need to make sure each dict has a bytesize (when converted to a JSON string) less than a specified threshold.

If the dict exceeds that bytesize threshold, I need to chunk that dict’s integer list.

Attempt:


import json

payload: list[dict] = [
    {"data1": [1,2,3,4]},
    {"data2": [8,9,10]},
    {"data3": [1,2,3,4,5,6,7]}
]

# Max size in bytes we can allow. This is static and a hard limit that is not variable.
MAX_SIZE: int = 25

def check_and_chunk(arr: list):

    def check_size_bytes(item):
        return True if len(json.dumps(item).encode("utf-8")) > MAX_SIZE else False

    def chunk(item, num_chunks: int=2):
        for i in range(0, len(item), num_chunks):
            yield item[i:i+num_chunks]

    # First check if the entire payload is smaller than the MAX_SIZE
    if not check_size_bytes(arr):
        return arr

    # Lets find the items that are small and items that are too big, respectively
    small, big = [], []

    # Find the indices in the payload that are too big
    big_idx: list = [i for i, j in enumerate(list(map(check_size_bytes, arr))) if j]

    # Append these items respectively to their proper lists
    item_append = (small.append, big.append)
    for i, item in enumerate(arr):
        item_append[i in set(big_idx)](item)
    
    # Modify the big items until they are small enough to be moved to the small_items list
    for i in big:
        print(i)
    # This is where I am unsure of how best to proceed. I'd like to essentially split the big dictionaries in the 'big' list such that it is small enough where each element is in the  'small' result.

Example of a possible desired result:

payload: list[dict] = [
    {"data1": [1,2,3,4]},
    {"data2": [8,9,10]},
    {"data3": [1,2,3,4]},
    {"data3": [5,6,7]}
]

Answers

IIUC, you can use generator to yield the chunks of right size:

import json

payload = [
    {"data1": [1, 2, 3, 4]},
    {"data2": [8, 9, 10]},
    {"data3": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]},
    {"data4": [100, 200, -1, -10, 200, 300, 12, 13]},
]

MAX_SIZE = 25


def get_chunks(lst):
    if len(lst) < 2:
        return lst

    curr, curr_len = [], 0
    for v in lst:
        s = str(v)
        # current length of all numbers + length of current number + number of `, ` + `[]`
        if curr_len + len(s) + 2 * len(curr) + 2 > MAX_SIZE:
            yield curr
            curr = [v]
            curr_len = len(s)
        else:
            curr.append(v)
            curr_len += len(s)

    if curr:
        yield curr


for d in payload:
    for k, v in d.items():
        for chunk in get_chunks(v):
            d = {k: chunk}
            print(f"{str(d):<40} {len(json.dumps(chunk).encode())=:<30}")

Prints:

{'data1': [1, 2, 3, 4]}                  len(json.dumps(chunk).encode())=12                            
{'data2': [8, 9, 10]}                    len(json.dumps(chunk).encode())=10                            
{'data3': [1, 2, 3, 4, 5, 6, 7, 8]}      len(json.dumps(chunk).encode())=24                            
{'data3': [9, 10, 11, 12]}               len(json.dumps(chunk).encode())=15                            
{'data4': [100, 200, -1, -10, 200]}      len(json.dumps(chunk).encode())=24                            
{'data4': [300, 12, 13]}                 len(json.dumps(chunk).encode())=13

My approach starts with the list of integers. I will take one out of the existing list (which I call input_sequence) and place into a new list (output_sequence) until I go over the length limit. At which point, I will back up one number and build the "chunk".

import json
import logging
import pprint
from collections import deque

logging.basicConfig(level=logging.DEBUG)

MAX_SIZE: int = 25


def split(key, input_sequence, limit, out):
    """Split the `input_sequence` into several smaller ones.

    The result will be appended to the `out` list.    
    """
    input_sequence = deque(input_sequence)
    output_sequence = []
    
    while input_sequence:
        # Move an element from input_sequence to output_sequence
        element = input_sequence.popleft()
        output_sequence.append(element)

        # Build the dictionary in bytes
        dict_str = json.dumps({key: output_sequence})
        dict_binary = dict_str.encode("utf-8")
        actual_length = len(dict_binary)
        logging.debug("dict_binary=%r, len=%r", dict_binary, actual_length)

        # If the length is over the limit, then back off one element
        # And produce the result
        if actual_length > limit:
            logging.debug("Over the limit")
            output_sequence.pop()
            input_sequence.appendleft(element)
            out.append({key: output_sequence})
            output_sequence = []

    # Left over
    if output_sequence:
        out.append({key: output_sequence})


def check_and_chunk(arr: list, limit):
    out = []
    for dict_object in arr:
        for key, seq in dict_object.items():
            split(key, seq, limit, out)
    return out


payload: list[dict] = [
    {"data1": [1, 2, 3, 4]},
    {"data2": [8, 9, 10]},
    {"data3": [1, 2, 3, 4, 5, 6, 7]},
    {"data4": list(range(20))},
]

pprint.pprint(check_and_chunk(payload, MAX_SIZE))

Here is the output.

DEBUG:root:dict_binary=b'{"data1": [1]}', len=14
DEBUG:root:dict_binary=b'{"data1": [1, 2]}', len=17
DEBUG:root:dict_binary=b'{"data1": [1, 2, 3]}', len=20
DEBUG:root:dict_binary=b'{"data1": [1, 2, 3, 4]}', len=23
DEBUG:root:dict_binary=b'{"data2": [8]}', len=14
DEBUG:root:dict_binary=b'{"data2": [8, 9]}', len=17
DEBUG:root:dict_binary=b'{"data2": [8, 9, 10]}', len=21
DEBUG:root:dict_binary=b'{"data3": [1]}', len=14
DEBUG:root:dict_binary=b'{"data3": [1, 2]}', len=17
DEBUG:root:dict_binary=b'{"data3": [1, 2, 3]}', len=20
DEBUG:root:dict_binary=b'{"data3": [1, 2, 3, 4]}', len=23
DEBUG:root:dict_binary=b'{"data3": [1, 2, 3, 4, 5]}', len=26
DEBUG:root:Over the limit
DEBUG:root:dict_binary=b'{"data3": [5]}', len=14
DEBUG:root:dict_binary=b'{"data3": [5, 6]}', len=17
DEBUG:root:dict_binary=b'{"data3": [5, 6, 7]}', len=20
DEBUG:root:dict_binary=b'{"data4": [0]}', len=14
DEBUG:root:dict_binary=b'{"data4": [0, 1]}', len=17
DEBUG:root:dict_binary=b'{"data4": [0, 1, 2]}', len=20
DEBUG:root:dict_binary=b'{"data4": [0, 1, 2, 3]}', len=23
DEBUG:root:dict_binary=b'{"data4": [0, 1, 2, 3, 4]}', len=26
DEBUG:root:Over the limit
DEBUG:root:dict_binary=b'{"data4": [4]}', len=14
DEBUG:root:dict_binary=b'{"data4": [4, 5]}', len=17
DEBUG:root:dict_binary=b'{"data4": [4, 5, 6]}', len=20
DEBUG:root:dict_binary=b'{"data4": [4, 5, 6, 7]}', len=23
DEBUG:root:dict_binary=b'{"data4": [4, 5, 6, 7, 8]}', len=26
DEBUG:root:Over the limit
DEBUG:root:dict_binary=b'{"data4": [8]}', len=14
DEBUG:root:dict_binary=b'{"data4": [8, 9]}', len=17
DEBUG:root:dict_binary=b'{"data4": [8, 9, 10]}', len=21
DEBUG:root:dict_binary=b'{"data4": [8, 9, 10, 11]}', len=25
DEBUG:root:dict_binary=b'{"data4": [8, 9, 10, 11, 12]}', len=29
DEBUG:root:Over the limit
DEBUG:root:dict_binary=b'{"data4": [12]}', len=15
DEBUG:root:dict_binary=b'{"data4": [12, 13]}', len=19
DEBUG:root:dict_binary=b'{"data4": [12, 13, 14]}', len=23
DEBUG:root:dict_binary=b'{"data4": [12, 13, 14, 15]}', len=27
DEBUG:root:Over the limit
DEBUG:root:dict_binary=b'{"data4": [15]}', len=15
DEBUG:root:dict_binary=b'{"data4": [15, 16]}', len=19
DEBUG:root:dict_binary=b'{"data4": [15, 16, 17]}', len=23
DEBUG:root:dict_binary=b'{"data4": [15, 16, 17, 18]}', len=27
DEBUG:root:Over the limit
DEBUG:root:dict_binary=b'{"data4": [18]}', len=15
DEBUG:root:dict_binary=b'{"data4": [18, 19]}', len=19
[{'data1': [1, 2, 3, 4]},
 {'data2': [8, 9, 10]},
 {'data3': [1, 2, 3, 4]},
 {'data3': [5, 6, 7]},
 {'data4': [0, 1, 2, 3]},
 {'data4': [4, 5, 6, 7]},
 {'data4': [8, 9, 10, 11]},
 {'data4': [12, 13, 14]},
 {'data4': [15, 16, 17]},
 {'data4': [18, 19]}]

Notes

I use the logging library for debug output. If you want to turn off debugging, replace logging.DEBUG with logging.WARN
I modified the signature of check_and_chunk to add the size limit and not relying on the global var
I use the deque data structure, which behaves like a list, but with faster insert/remove from the left.

Please signup or login to give your own answer.

Click here to cancel reply.