Ubuntu - Numpy (np.unique) is taking up so much of space and time for very large arrays, any efficient alternative?

TarunSrinivasan
July 6, 2023
128 views
0 votes
2 Answers

I am using Numpy to do some downsampling processing on a pointcloud file. During this process I am using np.unique to get unique value counts from the array. This is a very large array mind you, with around 36 million 3-D points.
Is there any other efficient alternative I can use? Or any other data structure I should shift to to do exactly what np.unique is doing here to make the process faster? Right now it takes around 300 seconds. I get that unique is already optimized, but is there any alternate data structure I could use to get a better result ? The array just contains points in x,y,z format.

I am attaching a snippet of my code below. Thanks in advance for the help, and please feel free to ask me any other information you might need. I tried changing the precision level, with no effect. I am using numpy 1.24.4 on Ubuntu 22.04.

import numpy as np
import time

points = np.random.random(size = (38867362,3))*10000

# print(points)
# print("points bytes:",points.nbytes)

start_time = time.time()
unique_points, inverse, counts = np.unique(((points - np.min(points, axis=0)) // 3).astype(int), axis=0, return_inverse=True, return_counts=True)
print("::INFO:: Total time taken: ", time.time()-start_time)

and so on,.

Answers

you can try to create a set from your data. The elements of your set wont have any duplicates.

data = [1,1,2,3]
s = set(data)
print(len(s))

However, I have not tested if this is faster than your approach. Also, this won’t give you the other return values of np.unique().

Edit

As OP gave an example I measured the runtime for (1) np.unique, (2) conversion to list of tuple and (3) the set()


import numpy as np
import time

points = np.random.random(size = (38867362, 3)) * 10000
points = ((points - np.min(points, axis=0)) // 3).astype(int)

start_time = time.time()
unique_points, inverse, counts = np.unique(points, axis=0, return_inverse=True, return_counts=True)
print("::INFO:: Total time taken np.unique: ", time.time() - start_time)

start_time = time.time()
points = tuple(map(tuple, points))
print("::INFO:: Total time tuple conversion: ", time.time() - start_time)

points = tuple(map(tuple, points))
start_time = time.time()
unique_points = set(points)
print("::INFO:: Total time taken set: ", time.time() - start_time)

with output

::INFO:: Total time taken np.unique:  115.49143147468567
::INFO:: Total time tuple conversion:  36.052345991134644
::INFO:: Total time taken set:  9.044668436050415

So set() is indeed an option, if the other return values can be lost…

- Klops
- July 6, 2023 at 9:58 am
- 0 votes
0
Here is a version that uses pandas DataFrame.drop_duplicates. This is nice since it does no need a tuple conversion as set() would.
```
import numpy as np
import time
import pandas as pd

points = np.random.random(size = (38867362, 3)) * 10000
points = ((points - np.min(points, axis=0)) // 3).astype(int)

df = pd.DataFrame(points)

start_time = time.time()
df.drop_duplicates()
print("::INFO:: Total time taken by drop_duplicates: ", time.time() - start_time)
```
output is
```
::INFO:: Total time taken by drop_duplicates:  10.315263986587524
```
which is 4 times faster than set() plus tuple conversion!

Note df.drop_duplicates(...).index are a subset of the input indices an thus comparable.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – Numpy (np.unique) is taking up so much of space and time for very large arrays, any efficient alternative?

Answers