skip to Main Content

My code involves slicing into 432x432x400 arrays a total of ~10 million times to generate batches of data for neural network training. As these are fairly large arrays (92 million data points/300MB), I was hoping to speed this up using CuPy (and maybe even speed training up by generating data on the same GPU as training), but found it actually made the code about 5x slower.

Is this expected behaviour due to CuPy overheads or am I missing something?

Code to reproduce:

import cupy as cp
import numpy as np
import timeit
cp_arr = cp.zeros((432, 432, 400), dtype=cp.float32)
np_arr = np.zeros((432, 432, 400), dtype=np.float32)

# numbers below are representative of my code
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120]'
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120]'

timeit.timeit(cp_code, number=8192*4, globals=globals())  # prints 0.122
timeit.timeit(np_code, number=8192*4, globals=globals())  # prints 0.027

Setup:

  • GPU: NVIDIA Quadro P4000

  • CuPy Version: 7.3.0

  • OS: CentOS Linux 7

  • CUDA Version: 10.1

  • cuDNN Version: 7.6.5

2

Answers


  1. I also confirmed that the slicing is about 5x times slower in cupy, while there’s a more precise way to measure the time (see e.g. https://github.com/cupy/cupy/pull/2740).

    The size of the array does not matter because slice operations do not copy the data but create views. The result with the following is similar.

    cp_arr = cp.zeros((4, 4, 4), dtype=cp.float32)
    cp_code = 'arr2 = cp_arr[1:3, 1:3, 1:3]'
    

    It is natural that “take slice then send it to GPU” is faster because it reduces the bytes to be transferred. Consider doing so if the first preprocess is the slicing.

    Login or Signup to reply.
  2. Slicing in NumPy and CuPy is not actually copying the data anywhere, but simply returning a new array where the data is the same but with the its pointer being offset to the first element of the new slice and an adjusted shape. Note below how both the original array and the slice have the same strides:

    In [1]: import cupy as cp
    
    In [2]: a = cp.zeros((432, 432, 400), dtype=cp.float32)
    
    In [3]: b = a[100:120, 100:120, 100:120]
    
    In [4]: a.strides
    Out[4]: (691200, 1600, 4)
    
    In [5]: b.strides
    Out[5]: (691200, 1600, 4)
    

    The same above could be verified by replacing CuPy with NumPy.

    If you want to time the actual slicing operation, the most reliable way of doing this would be to add a .copy() to each operation, thus enforcing the memory accessing/copying:

    cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120].copy()'  # 0.771 seconds
    np_code = 'arr2 = np_arr[100:120, 100:120, 100:120].copy()'  # 0.154 seconds
    

    Unfortunately, for the case above the memory pattern is bad for GPUs as the small chunks won’t be able to saturate memory channels, thus it’s still slower than NumPy. However, CuPy can be much faster if the chunks are able to get close to memory channel saturation, for example:

    cp_code = 'arr2 = cp_arr[:, 100:120, 100:120].copy()'  # 0.786 seconds
    np_code = 'arr2 = np_arr[:, 100:120, 100:120].copy()'  # 2.911 seconds
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search