skip to Main Content

I’ve integrated Azure SDK for CPP in my application, and there is significant slow down compared to old Azure SDK.
After upgrade for Upload Azure-sdk-for-cpp parallelism, upload works better, but download is still VERY SLOW.

It can be reproduced with simple example, just by trying to download 1Gb file from Azure storage to local file system.

  • Old SDK ~1min
  • New SDK ~5min

Old SDK was using CPP REST which used concurrency::streams::istream m_stream; There is no such thing in new SDK , except for TransferOptions.Concurrency which does almost nothing.
Is there some idea how can DownloadTo can be speed up? Or should parallelism be implemented on top of the library?

// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

#include <azure/storage/blobs.hpp>

#include <cstdio>
#include <iostream>
#include <stdexcept>

std::string GetConnectionString()
{
  const static std::string ConnectionString = "";

  if (!ConnectionString.empty())
  {
    return ConnectionString;
  }
  const static std::string envConnectionString = std::getenv("AZURE_STORAGE_CONNECTION_STRING");
  if (!envConnectionString.empty())
  {
    return envConnectionString;
  }
  throw std::runtime_error("Cannot find connection string.");
}

int main()
{
  using namespace Azure::Storage::Blobs;

  const std::string containerName = "sample-container";
  const std::string blobName = "sample-blob";
  const std::string blobContent = "Hello Azure!";

  auto containerClient
      = BlobContainerClient::CreateFromConnectionString(GetConnectionString(), containerName);

  containerClient.CreateIfNotExists();

  BlockBlobClient blobClient = containerClient.GetBlockBlobClient(blobName);

  std::vector<uint8_t> buffer(blobContent.begin(), blobContent.end());
  blobClient.UploadFrom(buffer.data(), buffer.size());

  Azure::Storage::Metadata blobMetadata = {{"key1", "value1"}, {"key2", "value2"}};
  blobClient.SetMetadata(blobMetadata);

  auto properties = blobClient.GetProperties().Value;
  for (auto metadata : properties.Metadata)
  {
    std::cout << metadata.first << ":" << metadata.second << std::endl;
  }
  // We know blob size is small, so it's safe to cast here.
  buffer.resize(static_cast<size_t>(properties.BlobSize));

  blobClient.DownloadTo(buffer.data(), buffer.size());

  std::cout << std::string(buffer.begin(), buffer.end()) << std::endl;

  return 0;
}

2

Answers


  1. Chosen as BEST ANSWER

    Long story short, CACHING was the solution. Our system is designed such way, that read function always read only 32kb, and then you can imagine the amount of http requests... At first I have tried to download a 1gb locally and then whenever read is called, get a chunk of that 1gb, afterwards I reduced it all way to 4mb, which showed great results. Speed up was insane.


  2. Is there some idea how can DownloadTo can be speed up? Or should parallelism be implemented on top of the library?.

    I would recommend to go with split your download into chunks and parallelize them manually. This approach resembles the method used by some HTTP clients to download files in parallel.

    You can use the below code to download much faster using C++ SDK.

    Code:

    #include <azure/storage/blobs.hpp>
    #include <chrono>
    #include <fstream>
    #include <future>
    #include <iostream>
    #include <stdexcept>
    #include <vector>
    #include <cstring>
    #include <mutex>
    
    int main()
    {
        using namespace Azure::Storage::Blobs;
    
        const std::string containerName = "result";
        const std::string blobName = "test.mp4";
        const std::string outputFileName = "demo1.mp4"; // Output file
    
        auto containerClient = BlobContainerClient::CreateFromConnectionString("DefaultEndpointsProtocol=https;AccountName=venkat326123;AccountKey=redacted;EndpointSuffix=core.windows.net", containerName);
        containerClient.CreateIfNotExists();
    
        BlockBlobClient blobClient = containerClient.GetBlockBlobClient(blobName);
    
        auto properties = blobClient.GetProperties().Value;
        size_t blobSize = static_cast<size_t>(properties.BlobSize);
    
        const size_t chunkSize = 4 * 1024 * 1024;
        size_t totalChunks = (blobSize + chunkSize - 1) / chunkSize; 
    
        std::ofstream outputFile(outputFileName, std::ios::binary);
        if (!outputFile.is_open())
        {
            std::cerr << "Failed to open output file: " << outputFileName << std::endl;
            return 1;
        }
    
        std::mutex fileMutex;
    
        std::vector<std::future<void>> futures;
    
        auto start = std::chrono::high_resolution_clock::now();
    
        // Start downloading each chunk in parallel
        for (size_t i = 0; i < totalChunks; ++i)
        {
            futures.push_back(std::async(std::launch::async, [&, i]()
                {
                    try
                    {
                        // Calculate start and length for each chunk
                        size_t start = i * chunkSize;
                        size_t length = std::min(chunkSize, blobSize - start);
    
                        // Define download options with range for each chunk
                        Azure::Storage::Blobs::DownloadBlobToOptions rangeOptions;
                        rangeOptions.Range = Azure::Core::Http::HttpRange{ static_cast<int64_t>(start), static_cast<int64_t>(length) };
    
                        // Temporary buffer for chunk
                        std::vector<uint8_t> buffer(length);
    
                        // Download chunk data into the temporary buffer
                        blobClient.DownloadTo(buffer.data(), length, rangeOptions);
    
                        // Lock and write the buffer to the file at the correct position
                        std::lock_guard<std::mutex> lock(fileMutex);
                        outputFile.seekp(start);
                        outputFile.write(reinterpret_cast<char*>(buffer.data()), length);
                    }
                    catch (const std::exception& e)
                    {
                        std::cerr << "Error downloading chunk " << i << ": " << e.what() << std::endl;
                    }
                }));
        }
    
        // Wait for all chunks to finish downloading
        for (auto& f : futures)
        {
            f.get();
        }
    
        // Stop timing the download
        auto end = std::chrono::high_resolution_clock::now();
    
        // Close the file stream
        outputFile.close();
    
        // Calculate time taken in seconds
        std::chrono::duration<double> elapsedSeconds = end - start;
    
        // Calculate download speed in MBps
        double downloadSpeed = (blobSize / (1024.0 * 1024.0)) / elapsedSeconds.count();
    
        std::cout << "Downloaded blob '" << blobName << "' to file '" << outputFileName << "' of size " << blobSize << " bytes." << std::endl;
        std::cout << "Time taken: " << elapsedSeconds.count() << " seconds" << std::endl;
        std::cout << "Download speed: " << downloadSpeed << " MBps" << std::endl;
    
        return 0;
    }
    

    The above code divides the file into 4 MB chunks and downloads them concurrently using std::async for efficient multi-threading, ensuring thread-safe writing with std::mutex.

    Output:

    Downloaded blob 'test.mp4' to file 'demo1.mp4' of size 69632912 bytes.
    Time taken: 15.3713 seconds
    Download speed: 4.32019 MBps
    

    enter image description here

    File:

    enter image description here

    Also check with the GitHub link which your created they can also provide good suggestions to help with C++ SDK.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search