skip to Main Content

I have a database table holding 500k records. I need to load these records, do an API call, fetch data from this API endpoint, and save new data back to the database.

I tried to do it like this (simplified code):

task get_new_data: :environment do
  data = Model.where('column IS NOT NULL).order('id DESC')
  data.each do |d|
    puts '...info print...'
    api_call = API::Call(...)
    d.assign_attributes(attr1: api_call.data1,
                        attr2: api_call.data2,
                        ...)
    d.save!(validate: false)
    put 'another info print...'
    sleep(0.5)
  end
end

So what happened when I ran this code was that maybe for the first 3-4 minutes, I didn’t see anything in the terminal window when I ran this rake command on the server. After that, I started seeing the puts statements. In another few minutes (maybe 5), the terminal window froze. I could see in the database table that the records were updated – so I thought that it’s only the terminal window that "got broken".

After another 15-20 minutes, the server’s CPU and RAM memory went up to 100% usage and I was not able to ssh to the server (that was expected due to the CPU/RAM 100% usage). Similarly, the Rails app was unreachable – blue error screen (expected as well).

At this point, I was just waiting when the server will fully crash. After another 30mins or so, the CPU/RAM memory got released, the Rails app was again reachable – the rake task finished. But, it didn’t finish completely, it processed "only" about 350k records, so I had to run the rake task again to finish the remaining 150k records. Now, this second run was completed successfully.

The thing is that I’ll need to run a similar task and iterate over the 500k records at least 6 more times, and I don’t want to go through the similar experience as described above.

What is the right way to process such datasets without killing the server? Should I somehow batch the records by, say, 10k chunks? Or is there a better way to handle this thing differently? I currently don’t use Sidekiq in the project.

2

Answers


  1. Rails supports batching records, and it’s easy to implement.

    Instead of

     data.each do |d|
    

    Use

     data.find_each do |d|
    

    The batch size is limited to 1,000 records by default but you can override this with a :batch_size argument.

    Official documentation is here…

    https://api.rubyonrails.org/classes/ActiveRecord/Batches.html

    Login or Signup to reply.
  2. Using find_in_batch and find_each will solve your RAM and CPU problems. However, I see a problem with too many API calls. Is there an API that retrieves information in batches, or can you create it on the data source side?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search