skip to Main Content

When the vgg16 transfer learning model is being fitted. Tensorflow throws error. Every image patch is 224×224 RGB which is about 602KB (with float32. Max batch size is about 5000 by formula vRAM/patch size/4 ). The GPU vRAM is 12GB on RTX 4070 Super on Ubuntu 24 :

    2024-09-23 05:49:25.162827: I external/local_tsl/tsl/framework/bfc_allocator.cc:1112] Sum Total of in-use chunks: 3.76GiB
2024-09-23 05:49:25.162840: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Total bytes in pool: 10671489024 memory_limit_: 10671489024 available bytes: 0 curr_region_allocation_bytes_: 21342978048
2024-09-23 05:49:25.162860: I external/local_tsl/tsl/framework/bfc_allocator.cc:1119] Stats: 
Limit:                     10671489024
InUse:                      4039777024
MaxInUse:                   4091143936
NumAllocs:                         215
MaxAllocSize:               3507001344
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2024-09-23 05:49:25.162904: W external/local_tsl/tsl/framework/bfc_allocator.cc:499] ***************************************_____________________________________________________________
Traceback (most recent call last):
  File "/home/aiworker9/code/py/aimodels/common/prepare_data_vgg16.py", line 363, in <module>
    scores, history = model_fit_image_label_array(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aiworker9/code/py/aimodels/common/prepare_data_vgg16.py", line 260, in model_fit_image_label_array
    history = model.fit(
              ^^^^^^^^^^
  File "/home/aiworker9/code/py/myenv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/aiworker9/code/py/myenv/lib/python3.12/site-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

Here is what I did to solve the issue:

  1. Reduced batch size a few time to 1 for both model.fit() and data.batch

  2. Added a few code for GPU memory management:

    import os
    os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'  #reduce memory    fragmentation.
    #clear gpu memory
    tf.keras.backend.clear_session()
    
    #reduce memory footprint. nvidia gpu related
    os.environ['XLA_FLAGS'] = '--xla_gpu_strict_conv_algorithm_picker=false'
    from tensorflow.keras import mixed_precision
    mixed_precision.set_global_policy('mixed_float16')
    #reduce GPU memory
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
        except RuntimeError as e:
            print(e)
    

Here is the code for model.fit():

def model_fit_image_label_array(model, train_images, train_label_user_id, train_label_binary, test_images, test_label_user_id, test_label_binary, epochs=10):
    
    # Setting callbacks
    earlyStopping = EarlyStopping(monitor = 'val_loss', patience = 3, verbose = 0, mode = 'min') # Monitora as épocas e para caso não esteja melhorando
    mcp_save = ModelCheckpoint('best_weights.keras', save_best_only = True, monitor = 'val_loss', mode = 'min') # Salvando a melhor configuração
    reduce_lr_loss = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.4, patience = 7, verbose = 1, min_delta = 1e-4, mode = 'auto') # Reduz o learning_rate quando a métrica de avaliação para de melhorar

    # Training model. epochs=1 as it takes long time for each epoch. reach about 99.6% accuracy
    history = model.fit(
        [train_images, train_label_user_id], 
        [train_label_user_id, train_label_binary], 
        batch_size = BATCH_SIZE, #1, 
        epochs = epochs, 
        validation_data = ([test_images, test_label_user_id], [test_label_user_id, test_label_binary]), 
        callbacks = [earlyStopping, mcp_save, reduce_lr_loss]
        )

    # Evaluate model
    eval_results = model.evaluate(
        [test_images, test_label_user_id], 
        [test_label_user_id, test_label_binary], 
        verbose=0
        )
  # Unpack based on the number of outputs and metrics
    print(f'nmodel.fit history keys : {history.history}')
    print(f'neval_results : {eval_results}')

    return eval_results, history   
  1. run Nvidia-smi. Here is the output seconds before crash error:

enter image description here

I am kind of running out of ideas. What else can be done to solve the gpu memory issue? Is it possible that somehow the whole dataset was loaded into GPU memory instead of BATCH_SIZE (2 here) of them as specified in the code?

UPDATE: portion of model summary:

│ (Embedding)         │                  │           │                   │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ dropout (Dropout)   │ (None, 256)      │         0 │ dense[0][0]       │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ flatten_1 (Flatten) │ (None, 150)      │         0 │ embedding[0][0]   │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ concatenate         │ (None, 406)      │         0 │ dropout[0][0],    │
│ (Concatenate)       │                  │           │ flatten_1[0][0]   │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ user_output (Dense) │ (None, 3)        │     1,221 │ concatenate[0][0] │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ binary_output       │ (None, 2)        │       814 │ concatenate[0][0] │
│ (Dense)             │                  │           │                   │
└─────────────────────┴──────────────────┴───────────┴───────────────────┘
 Total params: 21,139,657 (80.64 MB)
 Trainable params: 6,424,969 (24.51 MB)
 Non-trainable params: 14,714,688 (56.13 MB)

Here is how the dataset is created:

def return_dataset_image_embedding(data, window_size=(224, 224), step_size=112, shuffle_buffer_size=1000, prefetch_buffer_size=1):
    # Example DataFrame
    
    print("nBatch size: ", BATCH_SIZE)
    print("nShuffle Buffer size : ", shuffle_buffer_size)
    print("nPrefetch Buffer size : ", prefetch_buffer_size)
    df = pd.DataFrame(data)
    
    image_paths = './data/'+df['image_id'].values
    label_user_ids = df['label_user_id'].values
    label_binary_flags = df['label_binary_flag'].values

    image_patches, patch_label_user_ids, patch_label_binary_flags = 

    preprocess_image_patches_image_embedding(image_paths, label_user_ids, label_binary_flags, window_size, step_size)
        
    

 # Create TensorFlow dataset
        patch_dataset = tf.data.Dataset.from_tensor_slices((image_patches, patch_label_user_ids, patch_label_binary_flags))
    
  # Shuffle, augment, batch, and prefetch the dataset
        
  patch_dataset = patch_dataset.shuffle(buffer_size=shuffle_buffer_size)  # Shuffle data
  patch_dataset = patch_dataset.map(augment_data_image_embedding)  # Apply augmentation
  patch_dataset = patch_dataset.batch(batch_size=BATCH_SIZE)  # Create batches
  patch_dataset = patch_dataset.prefetch(buffer_size=prefetch_buffer_size)  # reduce gpu memory usage
    
        
   return patch_dataset

2

Answers


  1. Chosen as BEST ANSWER

    The problem is that the pipeline from tf.data.Dataset.from_generator() broke after dataset is generated and before feeding it to model.fit() (don't know how). The solution worked is to feed dataset generated from from_generator into model.fit() right after its generation without any processing in between.
    In doing so, now image augment happens in image patch preparation before dataset is generated. Also directly append .batch().prefetech() to from_generator instead of the dataset variable (again pipeline broke in my case but don't know why).

    To avoid OOM on GPU memory, image augment is now done with PIL.image lib (use CPU memory. can be another python lib) instead of calling tf lib (which loads up the GPU memory and likely OOM).


  2. Note: You can get more information about each of suggestions on the official guide – https://www.tensorflow.org/guide. I always find this really helpful.

    Dealing with memory issues is really challenging and you can try one of the suggested approaches below to identify and solve the problem.

    1. Try Reduce Batch Size Further
      But I see that since you are already trying a batch size of one ensure that no other part of your code is increasing the batch size or increasing load on the memory.

    2. Use Mixed Precision Training
      You are already using mixed precision with mixed_precision.set_global_policy(‘mixed_float16’). Ensure that your model and optimizer are compatible with mixed precision. If you encounter issues, consider disabling mixed precision temporarily to see if it affects the memory usage.

    3. Clear GPU Memory Before Training
      You are already using tf.keras.backend.clear_session() to clear the session. You could call this at the beginning of your training code to free up memory before loading the model.

    4. Limit GPU Memory Growth
      You are already setting memory growth with:

    tf.config.experimental.set_memory_growth(gpu, True)
    Another good approach is to ensure this is code executed before any model processing.

    1. Check for Memory Leaks
      Ensure that your dataset is not loading all images into memory. Check the return_dataset_image_embedding function to ensure it only processes a batch of images.

    2. Adjusting the Prefetching or Buffer Sizes
      You are currently using:

    patch_dataset = patch_dataset.prefetch(buffer_size=prefetch_buffer_size)
    Consider increasing the prefetch_buffer_size to a larger number to allow better data pipeline performance.

    1. Profile Memory Usage
      Use TensorFlow profiling tools to monitor memory usage. You can enable profiling with TensorBoard.

    2. Use tf.data correctly
      An example – using .cache() after .shuffle() if your dataset fits in memory, which might improve performance.

    Some popular tips:

    • Use nvidia-smi to monitor GPU memory usage at the time of training.
    • Try to incorporate Smaller Input Images VGG16 expects 224×224 input we you can still experiment with smaller sizes.

    Let me know if this helps else I can help you to fix your code.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search