skip to Main Content

I am traning YOLOV5 "L6" model for my important project. I have a so huge dataset contains UAV and drone image, and I need the train with huge input dimension (A few months ago I train "M" model with 640×640 input dimension with RTX 3060) in the model there are several bad performance the some categories detection is really god (Vehice and landing are etc.) but when the job came small objects like human model is stuck and confuse.So I decide to train 1280×1280 input size and one months ago I bought RTX 3090 TI. I am run my code in WSL 2 and its fully configured for DL/ML.

The point is when I run any YOLOV5 model with higher then 640×640 size I am getting below error. In the below example I ran with "M6" model with 8 batch size and 1280×1280 input size and the vram usage is around 12 GB so its not exclusive higher model. Also its look like not generally out of memory error because I tried "L6" model with 16 batch size and 1280×1280 input size. I get vram usage bigger then 24 GB vram usage it was instantly crash with cuda out of memory error and like always it was showing allocatin error.

  File "/mnt/d/Ubuntu-WSL-Workspace/Code_Space/Code Workspace/Python Projects/AI Workspace/Teknofest-AI-in-T/2023YOLOV5/Last-YOLOV5/yolov5/train.py", line 640, in <module>
    main(opt)
  File "/mnt/d/Ubuntu-WSL-Workspace/Code_Space/Code Workspace/Python Projects/AI Workspace/Teknofest-AI-in-T/2023YOLOV5/Last-YOLOV5/yolov5/train.py", line 529, in main
    train(opt.hyp, opt, device, callbacks)
  File "/mnt/d/Ubuntu-WSL-Workspace/Code_Space/Code Workspace/Python Projects/AI Workspace/Teknofest-AI-in-T/2023YOLOV5/Last-YOLOV5/yolov5/train.py", line 352, in train
    results, maps, _ = validate.run(data_dict,
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/d/Ubuntu-WSL-Workspace/Code_Space/Code Workspace/Python Projects/AI Workspace/Teknofest-AI-in-T/2023YOLOV5/Last-YOLOV5/yolov5/val.py", line 198, in run
    for batch_i, (im, targets, paths, shapes) in enumerate(pbar):
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/mnt/d/Ubuntu-WSL-Workspace/Code_Space/Code Workspace/Python Projects/AI Workspace/Teknofest-AI-in-T/2023YOLOV5/Last-YOLOV5/yolov5/utils/dataloaders.py", line 172, in __iter__
    yield next(self.iterator)
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in do_one_step
    data = pin_memory(data, device)
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 67, in pin_memory
    return [pin_memory(sample, device) for sample in data]  # Backwards compatibility.
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 67, in <listcomp>
    return [pin_memory(sample, device) for sample in data]  # Backwards compatibility.
  File "/home/yigit-ai-dev/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
    return data.pin_memory(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.```

3

Answers


  1. Chosen as BEST ANSWER

    I solved the problem by returning to using physical Linux. In my case problem is I think WSL because the computer must reserve resources for both Linux and Windows. For this reason, the computing power is more limited.


  2. I can think of few ways to be able to start your training.

    1. Decrease your batch size
    2. You say you have a rtx3060 and a rtx3090. If your computer can handle both, then use them both during training.
    3. Change your data to fp16 precision.
    4. Crop your images and then train on the cropped data. That shouldn’t change that much the accuracy.
    Login or Signup to reply.
  3. It may be related to WSL2. Both not letting you use most of your system’s RAM and also constraining the memory available for one application, as that is one of the known limitations of WSL.

    References to the Nvidia WSL guide regarding its limitations, etc:
    https://docs.nvidia.com/cuda/wsl-user-guide/index.html

    "Pinned system memory (example: System memory that an application makes resident for GPU accesses) availability for applications is limited."

    "For example, some deep learning training workloads, depending on the framework, model and dataset size used, can exceed this limit and may not work."

    Regarding how to fix this problem. The following thread provides some advice on it:

    https://github.com/huggingface/diffusers/issues/807

    Setting a higher limit for your system’s RAM on WSL and updating the distribution may help getting a higher use rate of your hardware resources.

    Modify the .wslconfig file to set a higher amount of system memory and also call wsl --update to update your Linux distribution within Windows.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search