I have a database table holding 500k records. I need to load these records, do an API call, fetch data from this API endpoint, and save new data back to the database.
I tried to do it like this (simplified code):
task get_new_data: :environment do
data = Model.where('column IS NOT NULL).order('id DESC')
data.each do |d|
puts '...info print...'
api_call = API::Call(...)
d.assign_attributes(attr1: api_call.data1,
attr2: api_call.data2,
...)
d.save!(validate: false)
put 'another info print...'
sleep(0.5)
end
end
So what happened when I ran this code was that maybe for the first 3-4 minutes, I didn’t see anything in the terminal window when I ran this rake command on the server. After that, I started seeing the puts
statements. In another few minutes (maybe 5), the terminal window froze. I could see in the database table that the records were updated – so I thought that it’s only the terminal window that "got broken".
After another 15-20 minutes, the server’s CPU and RAM memory went up to 100% usage and I was not able to ssh to the server (that was expected due to the CPU/RAM 100% usage). Similarly, the Rails app was unreachable – blue error screen (expected as well).
At this point, I was just waiting when the server will fully crash. After another 30mins or so, the CPU/RAM memory got released, the Rails app was again reachable – the rake task finished. But, it didn’t finish completely, it processed "only" about 350k records, so I had to run the rake task again to finish the remaining 150k records. Now, this second run was completed successfully.
The thing is that I’ll need to run a similar task and iterate over the 500k records at least 6 more times, and I don’t want to go through the similar experience as described above.
What is the right way to process such datasets without killing the server? Should I somehow batch the records by, say, 10k chunks? Or is there a better way to handle this thing differently? I currently don’t use Sidekiq in the project.
2
Answers
Rails supports batching records, and it’s easy to implement.
Instead of
Use
The batch size is limited to 1,000 records by default but you can override this with a
:batch_size
argument.Official documentation is here…
https://api.rubyonrails.org/classes/ActiveRecord/Batches.html
Using find_in_batch and find_each will solve your RAM and CPU problems. However, I see a problem with too many API calls. Is there an API that retrieves information in batches, or can you create it on the data source side?