I have the following function to read a CSV file from Azure:

read_csv_from_azure <- function(file_path, container) {
  # Try to download the file and handle potential errors
    # Download the file from the Azure container
    downloaded_file <- storage_download(container, file_path, NULL)
    # Convert the raw data to a character string
    file_content <- rawToChar(downloaded_file)
    # Read the CSV content using data.table's fread
    data <- fread(text = file_content, sep = ",")
    # Return the data
  }, error = function(e) {
    # Print an error message if an exception occurs
    message("An error occurred while downloading or reading the file: ", e)

However, the performance of this function is not sufficient for my requirements; it takes too long to read a CSV file. The CSV files are around 30MB each.

How can I make it more efficient?




  1. As far as I know, the {arrow} package is significantly faster for reading csv files in R.

    Try saving the file into a temporary directory, then reading it with Arrow:

    # create temporary file path
    destfile <- tempfile(fileext = '.csv')
    # download file into temporary file path
    download.file(file_path, destfile)
    # read with apache arrow
    data <- arrow::read_csv_arrow(destfile)
  2. How can I efficiently read a large CSV file from Azure Blob Storage into R for analysis?

    You can use the below code to read the large csv file using R language.

    I agree with Bastián Olea Herrera’s answer, the arrow package is faster to read csv files.

    Thanks. But, the main problem is not data.table, rather, it is storage_download.

    If you are thinking storage_download is causing problem, you can use httr package to download the file has temp and read with arrow and delete the temp file using Azure SAS token authentication.


    read_csv_from_azure <- function(file_path, container, sas_token) {
      file_url <- paste0("https://<storage account name>", container, "/", file_path, "?", sas_token)
      temp_file <- tempfile(fileext = ".csv")
      response <- GET(file_url, write_disk(temp_file), progress())
      if (status_code(response) != 200) {
        message("An error occurred while downloading the file: ", status_code(response))
      data <- read_csv_arrow(temp_file)
    sas_token <- "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-07-31T13:05:05Z&st=2024-07-31T05:05:05Z&spr=https&sig=N5zzzzzzzzzzFkCVeg%2Fzzzzz"
    data <- read_csv_from_azure("large_file.csv", "data", sas_token)
    print(head(data, 10))


    |======================================================================| 100%
       ID  Name Age             Email
    1   0 Name0  56 [email protected]
    2   1 Name1  93 [email protected]
    3   2 Name2  25 [email protected]
    4   3 Name3  77 [email protected]
    5   4 Name4  33 [email protected]
    6   5 Name5  64 [email protected]
    7   6 Name6  95 [email protected]
    8   7 Name7  49 [email protected]
    9   8 Name8  18 [email protected]
    10  9 Name9  39 [email protected]

    If you’re need to use the data.table package, you can use fread to read the CSV file directly from the downloaded file. This avoids converting it to a character string first, which can save both time and memory.

