I have a script like this
for epoch in range(num_epochs):
for bag in range(num_bags):
feats = pd.read_csv(f"feats_{bag}.csv")
... # some logic
As you can see, it repetitively reads data from a set of files. Each "feats_{bag}.csv" file is repetitively being read from disk num_epoch
times. This slowed down the program. I preloaded all of the data at once, which helped significantly. In the following script, each "feats_{bag}.csv" is only read once.
all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
for epoch in range(num_epochs):
for bag in range(num_bags):
feats = all_feats[bag]
The issue with the above program is memory usage as it loads all the data at once. the all_feats
variable roughly takes 20GB of memory. I have about 64 GB of memory, so I am limited to executing the program 3 times simultaneously. As all the runs use the same set of feats
, I thought there must be a way to load the data (all_feats
variable) once and use it in all runs simultaneously for more than 3 runs.
In other words, with only taking 20GB of storge for the all_feats
variable, I want to run many scripts (that all use the all_feats
variable), by sharing the all_feats
variable between them.
I’ve looked up mmap
and Python multiprocessing.shared_memory
. Although both allow sharing of a variable between processes, they seem unsuitable for my problem. For example for Shared Memory, I tried the following:
# SharedMemory Server
all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
sl = shared_memory.ShareableList(all_feats, name='all_feats')
# SharedMemory Client
all_feats = shared_memory.ShareableList(name='MyList')
print(id(all_feats), all_feats)
However, After running the server, when I run the client multiple times, the id of all_feats
seems to be different with every run, meaning they have used different memory locations, thus taking up more memory than what I intended again.
Some other ideas I had for speed up:
- Load all_feats files in memory using Redis or some other in-memory database. Then use the first approach again. i.e. load only the current needed feats at each iteration, but this time from Redis.
for epoch in range(num_epochs):
for bag in range(num_bags):
feats = redis.get("bag_i")
... # some logic
I’m hoping reading from memory (redis) instead of the disk, gives a reasonable speed boost, Although not as much as preloading all the data in Python.
Divide the feats in chunks, and preload them part by part.won’t work.
In summary, I’m looking for a way to use the same variable in different runs of the same/different Python scripts. Note that I don’t want to use duplicates of the variable. Thus, using a file in memory, and reading the file does not help me. All scripts only read the variable and do not change it.
Can shared_memory
solve this? then why did it assign different IDs for the all_feats
variable in different runs?
2
Answers
Why not just reverse the order of the loops:
I would suggest using a manager of your data. According to the documentation you can create a manager to hold the data and then provide the manager-object to all sub-processes. The data is then shared across all processes, but is residing in the manager. So you need 2x your 20GB – 20GB for the main process creating the manager, and another 20GB for the manager itself.
Use the following to share your data:
Your processes are started using the following lines, providing the manager-reference.
Now you can read directly from the pandas-DataFrame
shared_data.all_feats
without having to copy it.