Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Json – merging very large (thousands) very small files

m45ha
March 18, 2023
254 views
3 votes
3 Answers

I need to merge several batches of very large number of small size files (collected tweets for my project for each user). The large number is approx 50,000 files 🙁

I don’t have a particular issue with the code, but my R just get "frozen" in the merge even if i keep files at 10 000 number. Any ideas?

The code i use is:

files <-   list.files(
     path = file.path(path),
     pattern = pattern,
     recursive = TRUE,
     include.dirs = TRUE,
     full.names = TRUE   )

json.data_all <- data.frame()


for (i in seq_along(files)) {   
  filename <- files[[i]]  
  json.data <- jsonlite::read_json(filename, simplifyVector = TRUE)  
 json.data_all <- dplyr::bind_rows(json.data_all, json.data) 
 }

I have several of such folders.. How can i solve this?

Tags: json r

Answers

- Dave2e
- March 18, 2023 at 12:17 am
- 0 votes
0
It is generally easier and faster to make a list of all of your data frames and then after the loop use bind_rows to merge the entire list in one go.
R does not handle memory very well so the constant binding inside the for() loop becomes very slow creating and deleting the ever growing data structure.

Something like this should work:
```
files <- list.files( path = file.path(path), pattern = pattern, recursive = TRUE, include.dirs = TRUE, full.names = TRUE )

json.data <- lapply(files, function(file){
   jsonlite::read_json(file, simplifyVector = TRUE)
})
json.data_all <- dplyr::bind_rows(json.data)
```
Login or Signup to reply.

- RuiBarradas
- March 18, 2023 at 12:17 am
- 0 votes
0
Don’t bind by rows or otherwise augment data in a for loop, it is very slow.
Try something like the following.
```
files <- list.files(path = file.path(path), pattern = pattern, recursive = TRUE, include.dirs = TRUE, full.names = TRUE)

json.data_all <- lapply(files, jsonlite::read_json, simplifyVector = TRUE)
json.data_all <- data.table::rbindlist(json.data_all)
dim(json.data_all)
```
Login or Signup to reply.

The below should solve your problem, and also would be fast in execution due to threading

library("parallel")
library("foreach")
library("doParallel")

cluster1 = makeCluster(detectCores(logical=FALSE), type="PSOCK", outfile="")
registerDoParallel(cluster1)

files <-   list.files( path = file.path(path), pattern = pattern, recursive = TRUE, include.dirs = TRUE, full.names = TRUE )
dataList = foreach(file=files) %dopar% {
    jsonlite::read_json(file.path(path, file1), simplifyVector = TRUE) 
}
json.data_all = data.table::rbindlist(dataList)

stopCluster(cluster1)

Please signup or login to give your own answer.

Click here to cancel reply.