I need to merge several batches of very large number of small size files (collected tweets for my project for each user). The large number is approx 50,000 files 🙁
I don’t have a particular issue with the code, but my R just get "frozen" in the merge even if i keep files at 10 000 number. Any ideas?
The code i use is:
files <- list.files(
path = file.path(path),
pattern = pattern,
recursive = TRUE,
include.dirs = TRUE,
full.names = TRUE )
json.data_all <- data.frame()
for (i in seq_along(files)) {
filename <- files[[i]]
json.data <- jsonlite::read_json(filename, simplifyVector = TRUE)
json.data_all <- dplyr::bind_rows(json.data_all, json.data)
}
I have several of such folders.. How can i solve this?
3
Answers
It is generally easier and faster to make a list of all of your data frames and then after the loop use
bind_rows
to merge the entire list in one go.R does not handle memory very well so the constant binding inside the
for()
loop becomes very slow creating and deleting the ever growing data structure.Something like this should work:
Don’t bind by rows or otherwise augment data in a
for
loop, it is very slow.Try something like the following.
The below should solve your problem, and also would be fast in execution due to threading