skip to Main Content

I need to merge several batches of very large number of small size files (collected tweets for my project for each user). The large number is approx 50,000 files 🙁

I don’t have a particular issue with the code, but my R just get "frozen" in the merge even if i keep files at 10 000 number. Any ideas?

The code i use is:

files <-   list.files(
     path = file.path(path),
     pattern = pattern,
     recursive = TRUE,
     include.dirs = TRUE,
     full.names = TRUE   )

json.data_all <- data.frame()


for (i in seq_along(files)) {   
  filename <- files[[i]]  
  json.data <- jsonlite::read_json(filename, simplifyVector = TRUE)  
 json.data_all <- dplyr::bind_rows(json.data_all, json.data) 
 }

I have several of such folders.. How can i solve this?

3

Answers


  1. It is generally easier and faster to make a list of all of your data frames and then after the loop use bind_rows to merge the entire list in one go.
    R does not handle memory very well so the constant binding inside the for() loop becomes very slow creating and deleting the ever growing data structure.

    Something like this should work:

    files <- list.files( path = file.path(path), pattern = pattern, recursive = TRUE, include.dirs = TRUE, full.names = TRUE )
    
    json.data <- lapply(files, function(file){
       jsonlite::read_json(file, simplifyVector = TRUE)
    })
    json.data_all <- dplyr::bind_rows(json.data)
    
    Login or Signup to reply.
  2. Don’t bind by rows or otherwise augment data in a for loop, it is very slow.
    Try something like the following.

    files <- list.files(path = file.path(path), pattern = pattern, recursive = TRUE, include.dirs = TRUE, full.names = TRUE)
    
    json.data_all <- lapply(files, jsonlite::read_json, simplifyVector = TRUE)
    json.data_all <- data.table::rbindlist(json.data_all)
    dim(json.data_all)
    
    Login or Signup to reply.
  3. The below should solve your problem, and also would be fast in execution due to threading

    library("parallel")
    library("foreach")
    library("doParallel")
    
    cluster1 = makeCluster(detectCores(logical=FALSE), type="PSOCK", outfile="")
    registerDoParallel(cluster1)
    
    files <-   list.files( path = file.path(path), pattern = pattern, recursive = TRUE, include.dirs = TRUE, full.names = TRUE )
    dataList = foreach(file=files) %dopar% {
        jsonlite::read_json(file.path(path, file1), simplifyVector = TRUE) 
    }
    json.data_all = data.table::rbindlist(dataList)
    
    stopCluster(cluster1)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search