skip to Main Content

I have a script like this

for epoch in range(num_epochs):
    for bag in range(num_bags):
        feats = pd.read_csv(f"feats_{bag}.csv")
        ... # some logic

As you can see, it repetitively reads data from a set of files. Each "feats_{bag}.csv" file is repetitively being read from disk num_epoch times. This slowed down the program. I preloaded all of the data at once, which helped significantly. In the following script, each "feats_{bag}.csv" is only read once.

all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
for epoch in range(num_epochs):
    for bag in range(num_bags):
        feats = all_feats[bag]

The issue with the above program is memory usage as it loads all the data at once. the all_feats variable roughly takes 20GB of memory. I have about 64 GB of memory, so I am limited to executing the program 3 times simultaneously. As all the runs use the same set of feats, I thought there must be a way to load the data (all_feats variable) once and use it in all runs simultaneously for more than 3 runs.

In other words, with only taking 20GB of storge for the all_feats variable, I want to run many scripts (that all use the all_feats variable), by sharing the all_feats variable between them.

I’ve looked up mmap and Python multiprocessing.shared_memory. Although both allow sharing of a variable between processes, they seem unsuitable for my problem. For example for Shared Memory, I tried the following:

# SharedMemory Server
all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
sl = shared_memory.ShareableList(all_feats, name='all_feats')
# SharedMemory Client
all_feats = shared_memory.ShareableList(name='MyList')
print(id(all_feats), all_feats)

However, After running the server, when I run the client multiple times, the id of all_feats seems to be different with every run, meaning they have used different memory locations, thus taking up more memory than what I intended again.

Some other ideas I had for speed up:

  • Load all_feats files in memory using Redis or some other in-memory database. Then use the first approach again. i.e. load only the current needed feats at each iteration, but this time from Redis.
for epoch in range(num_epochs):
    for bag in range(num_bags):
        feats = redis.get("bag_i")
        ... # some logic

I’m hoping reading from memory (redis) instead of the disk, gives a reasonable speed boost, Although not as much as preloading all the data in Python.

  • Divide the feats in chunks, and preload them part by part. won’t work.

In summary, I’m looking for a way to use the same variable in different runs of the same/different Python scripts. Note that I don’t want to use duplicates of the variable. Thus, using a file in memory, and reading the file does not help me. All scripts only read the variable and do not change it.

Can shared_memory solve this? then why did it assign different IDs for the all_feats variable in different runs?

2

Answers


  1. Why not just reverse the order of the loops:

        for bag in range(num_bags):
            feats = pd.read_csv(f"feats_{bag}.csv")
            for epoch in range(num_epochs):
        ... # some logic
    
    Login or Signup to reply.
  2. I would suggest using a manager of your data. According to the documentation you can create a manager to hold the data and then provide the manager-object to all sub-processes. The data is then shared across all processes, but is residing in the manager. So you need 2x your 20GB – 20GB for the main process creating the manager, and another 20GB for the manager itself.

    Use the following to share your data:

    all_feats = [pd.read_csv(f"feats_{bag}.csv") for bag in range(num_bags)]
    manager = multiprocessing.Manager()
    shared_data = manager.Namespace()
    shared_data.all_feats = all_feats
    
    del all_feats  # might help freeing some memory
    

    Your processes are started using the following lines, providing the manager-reference.

    p = multiprocessing.Process(target=f, args=(shared_data,))
    p.start()
    p.join()
    

    Now you can read directly from the pandas-DataFrame shared_data.all_feats without having to copy it.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search