skip to Main Content

I have the following function to read a CSV file from Azure:

read_csv_from_azure <- function(file_path, container) {
  
  # Try to download the file and handle potential errors
  tryCatch({
    # Download the file from the Azure container
    downloaded_file <- storage_download(container, file_path, NULL)
    
    # Convert the raw data to a character string
    file_content <- rawToChar(downloaded_file)
    
    # Read the CSV content using data.table's fread
    data <- fread(text = file_content, sep = ",")
    
    # Return the data
    return(data)
    
  }, error = function(e) {
    # Print an error message if an exception occurs
    message("An error occurred while downloading or reading the file: ", e)
    return(NULL)
  })
}

However, the performance of this function is not sufficient for my requirements; it takes too long to read a CSV file. The CSV files are around 30MB each.

How can I make it more efficient?

Thanks

2

Answers


  1. As far as I know, the {arrow} package is significantly faster for reading csv files in R.

    Try saving the file into a temporary directory, then reading it with Arrow:

    library(arrow)
    
    # create temporary file path
    destfile <- tempfile(fileext = '.csv')
    
    # download file into temporary file path
    download.file(file_path, destfile)
    
    # read with apache arrow
    data <- arrow::read_csv_arrow(destfile)
    
    Login or Signup to reply.
  2. How can I efficiently read a large CSV file from Azure Blob Storage into R for analysis?

    You can use the below code to read the large csv file using R language.

    I agree with Bastián Olea Herrera’s answer, the arrow package is faster to read csv files.

    Thanks. But, the main problem is not data.table, rather, it is storage_download.

    If you are thinking storage_download is causing problem, you can use httr package to download the file has temp and read with arrow and delete the temp file using Azure SAS token authentication.

    Code:

    library(httr)
    library(arrow)
    
    read_csv_from_azure <- function(file_path, container, sas_token) {
      file_url <- paste0("https://<storage account name>.blob.core.windows.net/", container, "/", file_path, "?", sas_token)
      temp_file <- tempfile(fileext = ".csv")
      response <- GET(file_url, write_disk(temp_file), progress())
      if (status_code(response) != 200) {
        message("An error occurred while downloading the file: ", status_code(response))
        return(NULL)
      }
      data <- read_csv_arrow(temp_file)
      unlink(temp_file)
      return(data)
    }
    
    sas_token <- "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-07-31T13:05:05Z&st=2024-07-31T05:05:05Z&spr=https&sig=N5zzzzzzzzzzFkCVeg%2Fzzzzz"
    data <- read_csv_from_azure("large_file.csv", "data", sas_token)
    
    print(head(data, 10))
    

    Output:

    |======================================================================| 100%
       ID  Name Age             Email
    1   0 Name0  56 [email protected]
    2   1 Name1  93 [email protected]
    3   2 Name2  25 [email protected]
    4   3 Name3  77 [email protected]
    5   4 Name4  33 [email protected]
    6   5 Name5  64 [email protected]
    7   6 Name6  95 [email protected]
    8   7 Name7  49 [email protected]
    9   8 Name8  18 [email protected]
    10  9 Name9  39 [email protected]
    

    enter image description here

    If you’re need to use the data.table package, you can use fread to read the CSV file directly from the downloaded file. This avoids converting it to a character string first, which can save both time and memory.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search