Redis - Using the same variable between different Python Runs

DANIAL
September 21, 2023
152 views
1 vote
2 Answers

I have a script like this

for epoch in range(num_epochs):
    for bag in range(num_bags):
        feats = pd.read_csv(f"feats_{bag}.csv")
        ... # some logic

As you can see, it repetitively reads data from a set of files. Each "feats_{bag}.csv" file is repetitively being read from disk num_epoch times. This slowed down the program. I preloaded all of the data at once, which helped significantly. In the following script, each "feats_{bag}.csv" is only read once.

all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
for epoch in range(num_epochs):
    for bag in range(num_bags):
        feats = all_feats[bag]

The issue with the above program is memory usage as it loads all the data at once. the all_feats variable roughly takes 20GB of memory. I have about 64 GB of memory, so I am limited to executing the program 3 times simultaneously. As all the runs use the same set of feats, I thought there must be a way to load the data (all_feats variable) once and use it in all runs simultaneously for more than 3 runs.

In other words, with only taking 20GB of storge for the all_feats variable, I want to run many scripts (that all use the all_feats variable), by sharing the all_feats variable between them.

I’ve looked up mmap and Python multiprocessing.shared_memory. Although both allow sharing of a variable between processes, they seem unsuitable for my problem. For example for Shared Memory, I tried the following:

# SharedMemory Server
all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
sl = shared_memory.ShareableList(all_feats, name='all_feats')

# SharedMemory Client
all_feats = shared_memory.ShareableList(name='MyList')
print(id(all_feats), all_feats)

However, After running the server, when I run the client multiple times, the id of all_feats seems to be different with every run, meaning they have used different memory locations, thus taking up more memory than what I intended again.

Some other ideas I had for speed up:

Load all_feats files in memory using Redis or some other in-memory database. Then use the first approach again. i.e. load only the current needed feats at each iteration, but this time from Redis.

for epoch in range(num_epochs):
    for bag in range(num_bags):
        feats = redis.get("bag_i")
        ... # some logic

I’m hoping reading from memory (redis) instead of the disk, gives a reasonable speed boost, Although not as much as preloading all the data in Python.

~~Divide the feats in chunks, and preload them part by part.~~ won’t work.

In summary, I’m looking for a way to use the same variable in different runs of the same/different Python scripts. Note that I don’t want to use duplicates of the variable. Thus, using a file in memory, and reading the file does not help me. All scripts only read the variable and do not change it.

Can shared_memory solve this? then why did it assign different IDs for the all_feats variable in different runs?

Answers

- Gerrat
- September 19, 2023 at 6:46 pm
- 0 votes
0
Why not just reverse the order of the loops:
```
    for bag in range(num_bags):
        feats = pd.read_csv(f"feats_{bag}.csv")
        for epoch in range(num_epochs):
    ... # some logic
```
Login or Signup to reply.

- Raja
- September 21, 2023 at 1:22 pm
- 0 votes
0
I would suggest using a manager of your data. According to the documentation you can create a manager to hold the data and then provide the manager-object to all sub-processes. The data is then shared across all processes, but is residing in the manager. So you need 2x your 20GB – 20GB for the main process creating the manager, and another 20GB for the manager itself.

Use the following to share your data:
```
all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
manager = multiprocessing.Manager()
shared_data = manager.Namespace()
shared_data.all_feats = all_feats

del all_feats  # might help freeing some memory
```
Your processes are started using the following lines, providing the manager-reference.
```
p = multiprocessing.Process(target=f, args=(shared_data,))
p.start()
p.join()
```
Now you can read directly from the pandas-DataFrame shared_data.all_feats without having to copy it.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Redis – Using the same variable between different Python Runs

Answers