How can I efficiently read a large CSV file from Azure Blob Storage into R for analysis?

Enes
July 31, 2024
185 views
1 vote
2 Answers

I have the following function to read a CSV file from Azure:

read_csv_from_azure <- function(file_path, container) {
  
  # Try to download the file and handle potential errors
  tryCatch({
    # Download the file from the Azure container
    downloaded_file <- storage_download(container, file_path, NULL)
    
    # Convert the raw data to a character string
    file_content <- rawToChar(downloaded_file)
    
    # Read the CSV content using data.table's fread
    data <- fread(text = file_content, sep = ",")
    
    # Return the data
    return(data)
    
  }, error = function(e) {
    # Print an error message if an exception occurs
    message("An error occurred while downloading or reading the file: ", e)
    return(NULL)
  })
}

However, the performance of this function is not sufficient for my requirements; it takes too long to read a CSV file. The CSV files are around 30MB each.

How can I make it more efficient?

Thanks

Tags: azure data.table r rstudio tidyverse

Answers

- Basti225nOleaHerrera
- July 30, 2024 at 2:53 pm
- 0 votes
0
As far as I know, the {arrow} package is significantly faster for reading csv files in R.

Try saving the file into a temporary directory, then reading it with Arrow:
```
library(arrow)

# create temporary file path
destfile <- tempfile(fileext = '.csv')

# download file into temporary file path
download.file(file_path, destfile)

# read with apache arrow
data <- arrow::read_csv_arrow(destfile)
```
Login or Signup to reply.

How can I efficiently read a large CSV file from Azure Blob Storage into R for analysis?

You can use the below code to read the large csv file using R language.

I agree with Bastián Olea Herrera’s answer, the arrow package is faster to read csv files.

Thanks. But, the main problem is not data.table, rather, it is storage_download.

If you are thinking storage_download is causing problem, you can use httr package to download the file has temp and read with arrow and delete the temp file using Azure SAS token authentication.

Code:

library(httr)
library(arrow)

read_csv_from_azure <- function(file_path, container, sas_token) {
  file_url <- paste0("https://<storage account name>.blob.core.windows.net/", container, "/", file_path, "?", sas_token)
  temp_file <- tempfile(fileext = ".csv")
  response <- GET(file_url, write_disk(temp_file), progress())
  if (status_code(response) != 200) {
    message("An error occurred while downloading the file: ", status_code(response))
    return(NULL)
  }
  data <- read_csv_arrow(temp_file)
  unlink(temp_file)
  return(data)
}

sas_token <- "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-07-31T13:05:05Z&st=2024-07-31T05:05:05Z&spr=https&sig=N5zzzzzzzzzzFkCVeg%2Fzzzzz"
data <- read_csv_from_azure("large_file.csv", "data", sas_token)

print(head(data, 10))

Output:

|======================================================================| 100%
   ID  Name Age             Email
1   0 Name0  56 [email protected]
2   1 Name1  93 [email protected]
3   2 Name2  25 [email protected]
4   3 Name3  77 [email protected]
5   4 Name4  33 [email protected]
6   5 Name5  64 [email protected]
7   6 Name6  95 [email protected]
8   7 Name7  49 [email protected]
9   8 Name8  18 [email protected]
10  9 Name9  39 [email protected]

If you’re need to use the data.table package, you can use fread to read the CSV file directly from the downloaded file. This avoids converting it to a character string first, which can save both time and memory.

Please signup or login to give your own answer.

Click here to cancel reply.