skip to Main Content

I have a list of dict that follow a consistent structure where each dict has a list of integers. However, I need to make sure each dict has a bytesize (when converted to a JSON string) less than a specified threshold.

If the dict exceeds that bytesize threshold, I need to chunk that dict’s integer list.

Attempt:


import json

payload: list[dict] = [
    {"data1": [1,2,3,4]},
    {"data2": [8,9,10]},
    {"data3": [1,2,3,4,5,6,7]}
]

# Max size in bytes we can allow. This is static and a hard limit that is not variable.
MAX_SIZE: int = 25

def check_and_chunk(arr: list):

    def check_size_bytes(item):
        return True if len(json.dumps(item).encode("utf-8")) > MAX_SIZE else False

    def chunk(item, num_chunks: int=2):
        for i in range(0, len(item), num_chunks):
            yield item[i:i+num_chunks]

    # First check if the entire payload is smaller than the MAX_SIZE
    if not check_size_bytes(arr):
        return arr

    # Lets find the items that are small and items that are too big, respectively
    small, big = [], []

    # Find the indices in the payload that are too big
    big_idx: list = [i for i, j in enumerate(list(map(check_size_bytes, arr))) if j]

    # Append these items respectively to their proper lists
    item_append = (small.append, big.append)
    for i, item in enumerate(arr):
        item_append[i in set(big_idx)](item)
    
    # Modify the big items until they are small enough to be moved to the small_items list
    for i in big:
        print(i)
    # This is where I am unsure of how best to proceed. I'd like to essentially split the big dictionaries in the 'big' list such that it is small enough where each element is in the  'small' result.

Example of a possible desired result:

payload: list[dict] = [
    {"data1": [1,2,3,4]},
    {"data2": [8,9,10]},
    {"data3": [1,2,3,4]},
    {"data3": [5,6,7]}
]

2

Answers


  1. IIUC, you can use generator to yield the chunks of right size:

    import json
    
    payload = [
        {"data1": [1, 2, 3, 4]},
        {"data2": [8, 9, 10]},
        {"data3": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]},
        {"data4": [100, 200, -1, -10, 200, 300, 12, 13]},
    ]
    
    MAX_SIZE = 25
    
    
    def get_chunks(lst):
        if len(lst) < 2:
            return lst
    
        curr, curr_len = [], 0
        for v in lst:
            s = str(v)
            # current length of all numbers + length of current number + number of `, ` + `[]`
            if curr_len + len(s) + 2 * len(curr) + 2 > MAX_SIZE:
                yield curr
                curr = [v]
                curr_len = len(s)
            else:
                curr.append(v)
                curr_len += len(s)
    
        if curr:
            yield curr
    
    
    for d in payload:
        for k, v in d.items():
            for chunk in get_chunks(v):
                d = {k: chunk}
                print(f"{str(d):<40} {len(json.dumps(chunk).encode())=:<30}")
    

    Prints:

    {'data1': [1, 2, 3, 4]}                  len(json.dumps(chunk).encode())=12                            
    {'data2': [8, 9, 10]}                    len(json.dumps(chunk).encode())=10                            
    {'data3': [1, 2, 3, 4, 5, 6, 7, 8]}      len(json.dumps(chunk).encode())=24                            
    {'data3': [9, 10, 11, 12]}               len(json.dumps(chunk).encode())=15                            
    {'data4': [100, 200, -1, -10, 200]}      len(json.dumps(chunk).encode())=24                            
    {'data4': [300, 12, 13]}                 len(json.dumps(chunk).encode())=13                            
    
    Login or Signup to reply.
  2. My approach starts with the list of integers. I will take one out of the existing list (which I call input_sequence) and place into a new list (output_sequence) until I go over the length limit. At which point, I will back up one number and build the "chunk".

    import json
    import logging
    import pprint
    from collections import deque
    
    logging.basicConfig(level=logging.DEBUG)
    
    MAX_SIZE: int = 25
    
    
    def split(key, input_sequence, limit, out):
        """Split the `input_sequence` into several smaller ones.
    
        The result will be appended to the `out` list.    
        """
        input_sequence = deque(input_sequence)
        output_sequence = []
        
        while input_sequence:
            # Move an element from input_sequence to output_sequence
            element = input_sequence.popleft()
            output_sequence.append(element)
    
            # Build the dictionary in bytes
            dict_str = json.dumps({key: output_sequence})
            dict_binary = dict_str.encode("utf-8")
            actual_length = len(dict_binary)
            logging.debug("dict_binary=%r, len=%r", dict_binary, actual_length)
    
            # If the length is over the limit, then back off one element
            # And produce the result
            if actual_length > limit:
                logging.debug("Over the limit")
                output_sequence.pop()
                input_sequence.appendleft(element)
                out.append({key: output_sequence})
                output_sequence = []
    
        # Left over
        if output_sequence:
            out.append({key: output_sequence})
    
    
    def check_and_chunk(arr: list, limit):
        out = []
        for dict_object in arr:
            for key, seq in dict_object.items():
                split(key, seq, limit, out)
        return out
    
    
    payload: list[dict] = [
        {"data1": [1, 2, 3, 4]},
        {"data2": [8, 9, 10]},
        {"data3": [1, 2, 3, 4, 5, 6, 7]},
        {"data4": list(range(20))},
    ]
    
    pprint.pprint(check_and_chunk(payload, MAX_SIZE))
    

    Here is the output.

    DEBUG:root:dict_binary=b'{"data1": [1]}', len=14
    DEBUG:root:dict_binary=b'{"data1": [1, 2]}', len=17
    DEBUG:root:dict_binary=b'{"data1": [1, 2, 3]}', len=20
    DEBUG:root:dict_binary=b'{"data1": [1, 2, 3, 4]}', len=23
    DEBUG:root:dict_binary=b'{"data2": [8]}', len=14
    DEBUG:root:dict_binary=b'{"data2": [8, 9]}', len=17
    DEBUG:root:dict_binary=b'{"data2": [8, 9, 10]}', len=21
    DEBUG:root:dict_binary=b'{"data3": [1]}', len=14
    DEBUG:root:dict_binary=b'{"data3": [1, 2]}', len=17
    DEBUG:root:dict_binary=b'{"data3": [1, 2, 3]}', len=20
    DEBUG:root:dict_binary=b'{"data3": [1, 2, 3, 4]}', len=23
    DEBUG:root:dict_binary=b'{"data3": [1, 2, 3, 4, 5]}', len=26
    DEBUG:root:Over the limit
    DEBUG:root:dict_binary=b'{"data3": [5]}', len=14
    DEBUG:root:dict_binary=b'{"data3": [5, 6]}', len=17
    DEBUG:root:dict_binary=b'{"data3": [5, 6, 7]}', len=20
    DEBUG:root:dict_binary=b'{"data4": [0]}', len=14
    DEBUG:root:dict_binary=b'{"data4": [0, 1]}', len=17
    DEBUG:root:dict_binary=b'{"data4": [0, 1, 2]}', len=20
    DEBUG:root:dict_binary=b'{"data4": [0, 1, 2, 3]}', len=23
    DEBUG:root:dict_binary=b'{"data4": [0, 1, 2, 3, 4]}', len=26
    DEBUG:root:Over the limit
    DEBUG:root:dict_binary=b'{"data4": [4]}', len=14
    DEBUG:root:dict_binary=b'{"data4": [4, 5]}', len=17
    DEBUG:root:dict_binary=b'{"data4": [4, 5, 6]}', len=20
    DEBUG:root:dict_binary=b'{"data4": [4, 5, 6, 7]}', len=23
    DEBUG:root:dict_binary=b'{"data4": [4, 5, 6, 7, 8]}', len=26
    DEBUG:root:Over the limit
    DEBUG:root:dict_binary=b'{"data4": [8]}', len=14
    DEBUG:root:dict_binary=b'{"data4": [8, 9]}', len=17
    DEBUG:root:dict_binary=b'{"data4": [8, 9, 10]}', len=21
    DEBUG:root:dict_binary=b'{"data4": [8, 9, 10, 11]}', len=25
    DEBUG:root:dict_binary=b'{"data4": [8, 9, 10, 11, 12]}', len=29
    DEBUG:root:Over the limit
    DEBUG:root:dict_binary=b'{"data4": [12]}', len=15
    DEBUG:root:dict_binary=b'{"data4": [12, 13]}', len=19
    DEBUG:root:dict_binary=b'{"data4": [12, 13, 14]}', len=23
    DEBUG:root:dict_binary=b'{"data4": [12, 13, 14, 15]}', len=27
    DEBUG:root:Over the limit
    DEBUG:root:dict_binary=b'{"data4": [15]}', len=15
    DEBUG:root:dict_binary=b'{"data4": [15, 16]}', len=19
    DEBUG:root:dict_binary=b'{"data4": [15, 16, 17]}', len=23
    DEBUG:root:dict_binary=b'{"data4": [15, 16, 17, 18]}', len=27
    DEBUG:root:Over the limit
    DEBUG:root:dict_binary=b'{"data4": [18]}', len=15
    DEBUG:root:dict_binary=b'{"data4": [18, 19]}', len=19
    [{'data1': [1, 2, 3, 4]},
     {'data2': [8, 9, 10]},
     {'data3': [1, 2, 3, 4]},
     {'data3': [5, 6, 7]},
     {'data4': [0, 1, 2, 3]},
     {'data4': [4, 5, 6, 7]},
     {'data4': [8, 9, 10, 11]},
     {'data4': [12, 13, 14]},
     {'data4': [15, 16, 17]},
     {'data4': [18, 19]}]
    

    Notes

    • I use the logging library for debug output. If you want to turn off debugging, replace logging.DEBUG with logging.WARN
    • I modified the signature of check_and_chunk to add the size limit and not relying on the global var
    • I use the deque data structure, which behaves like a list, but with faster insert/remove from the left.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search