I am working on a project where the UI has to generate previews for many larger still images. I am using a C++ library to apply various lookup tables and conversions to the images which requires the image to be in a RGB float buffer. I use Accelerate to convert the source image into Float and back into a RGB image. I am managing conversion jobs through a serial DispatchQueue so that one conversion is done after another.
What I am running into is peak memory issues. Despite the fact that I manage jobs sequentially and processing images one by one I pile up memory in the GB range. If I have enough memory to complete all processing the memory gets freed and nothing leaks – but during processing I am using up multiple times more memory than I should.
In instruments I have checked that I stack up vImage_Buffer instances (every operation creates 3 but I see many) each multiple MB in size – despite calling free() on the buffers in the function at the end.
I am rather clueless atm about why it piles up memory? My working theory is that vImage_Buffer.free() does not free memory immidiately? But needs something to happen to actually free the memory (it seems like it frees it when all the processing is done, but these are independent jobs really and there is no common thing that says I’m done.). I am actually quite puzzled. How can I limit the amount of memory I consume? I have thought about keeping the buffers around – but the problem is that the images do not have the same size.
It is the first time I use Accelerate and vImage – what did I miss?
Any insight would be much appreciated….
Here is the function (yes I do call free()):
@available(iOS 16.0,*)
static func processImage(from cgImage:CGImage, lookData:Data) throws -> CGImage {
let width = UInt32(cgImage.width)
let height = UInt32(cgImage.height)
let Rec709 = CGColorSpace(name: CGColorSpace.itur_709)!
let bitmapInfo_3ChanFloat:CGBitmapInfo = CGBitmapInfo(
rawValue:CGBitmapInfo.floatComponents.rawValue | CGBitmapInfo.byteOrder32Little.rawValue | CGImageAlphaInfo.none.rawValue
)
var src_format = vImage_CGImageFormat(
bitsPerComponent: cgImage.bitsPerComponent,
bitsPerPixel: cgImage.bitsPerPixel,
colorSpace: cgImage.colorSpace!,
bitmapInfo: cgImage.bitmapInfo)!
var convertible_format = vImage_CGImageFormat(
bitsPerComponent: 32,
bitsPerPixel: 3 * 32,
colorSpace: cgImage.colorSpace!,
bitmapInfo: bitmapInfo_3ChanFloat)!
var converted_format = vImage_CGImageFormat(
bitsPerComponent: 32,
bitsPerPixel: 3 * 32,
colorSpace: Rec709,
bitmapInfo: bitmapInfo_3ChanFloat)!
var end_format = vImage_CGImageFormat(
bitsPerComponent: 8,
bitsPerPixel: 4 * 8,
colorSpace: Rec709,
bitmapInfo: .init(rawValue: CGImageAlphaInfo.noneSkipLast.rawValue))!
let to_converter = vImageConverter_CreateWithCGImageFormat(
&src_format,
&convertible_format,
nil,
vImage_Flags(kvImagePrintDiagnosticsToConsole),
nil).takeRetainedValue()
let back_converter = vImageConverter_CreateWithCGImageFormat(
&converted_format,
&end_format,
nil,
vImage_Flags(kvImagePrintDiagnosticsToConsole),
nil).takeRetainedValue()
var src_buffer = try vImage_Buffer(cgImage: cgImage)
var conversion_buffer = vImage_Buffer()
vImageBuffer_Init(
&conversion_buffer,
UInt(height),
UInt(width),
convertible_format.bitsPerPixel,
vImage_Flags(kvImagePrintDiagnosticsToConsole)
)
var end_buffer = vImage_Buffer()
vImageBuffer_Init(
&end_buffer,
UInt(height),
UInt(width),
end_format.bitsPerPixel,
vImage_Flags(kvImagePrintDiagnosticsToConsole)
)
vImageConvert_AnyToAny(
to_converter,
&src_buffer,
&conversion_buffer,
nil,
vImage_Flags(kvImagePrintDiagnosticsToConsole)
)
let imageBuffer = conversion_buffer.data.assumingMemoryBound(to: Float.self)
let size = Int(width * height * 4 * 3)
try lookData.withUnsafeBytes { (rawLookBuffer:UnsafeRawBufferPointer) in
let lookBuffer:UnsafePointer<UInt8> = rawLookBuffer.baseAddress!.assumingMemoryBound(to: UInt8.self)
let result = applyLook_CPP(imageBuffer, size, width, height, lookBuffer, rawLookBuffer.count)
if result < 0 {
throw Errors.ImageSDKFailed(error: result)
}
}
vImageConvert_AnyToAny(
back_converter,
&conversion_buffer,
&end_buffer,
nil,
vImage_Flags(kvImagePrintDiagnosticsToConsole)
)
let result = try end_buffer.createCGImage(format: end_format)
src_buffer.free()
conversion_buffer.free()
end_buffer.free()
return result
}
2
Answers
So the solution to my peak memory problem is called: autoreleasepool! I discovered it accidentially while digging through the Debug Memory Graph...
Wrapping the processing into an autoreleasepool{ } block fixed the memory issues and the memory was freed more frequently.
The memory issue wasn't caused by the vImage buffers themselves, but the cgImages referencing them. (cgImage is some old code it seems). So if you run into this type of problem make sure that the autorelease covers the entire lifespan of the cgImages (e.g. if you save them to disk) otherwise it won't be effective.
You never stop learning :-)
Speaking as someone who may have written vImageConvert_AnyToAny in a previous life, the key to saving memory on this stuff and therefore run way, way, way faster is to (when possible) do multi-step conversions like this on little chunks of the image, maybe 32kB, at a time — i.e. a "tile". In this example, that would mean:
This prevents the entire width * height * 4 * 3 byte buffer from ever existing, and you can reuse the same (width/N) * (height/M) * 4 * 3 bit of temporary memory over and over. This helps cache temporal locality, which otherwise would be self-evicting with larger buffers. Some problems are computationally expensive enough that a 256kB tile size might work better, so try a couple of sizes. That should sort out your bandwidth issues too, since the tile will be small enough to be in an adjacent cache when you call any2any again to convert back.
Then there is the issue of multithreading. You can process those tiles concurrently in multiple threads using something like dispatch_apply. You’ll just want to pass kVImageDoNotTile to the vImage routines so that they are not multithreading on top of your multithreading, which would be unhelpful. (They have no idea what you are doing.) You’ll need separate parallel working tiny conversion_buffers for this obviously. The vImageConverters can be used reentrantly with vImageConvert_AnyToAny.
Finally, I should note that vImageConvert_Any2Any also has internally lookup tables in it for colorspace conversions. I’m not quite sure what applyLook_CPP does in this example, but if the entire operation is expressible as a ICC 4.3 colorspace conversion, then likely you can get vImage to do the whole thing by simply providing the right colorspaces for the src and dest vImage_CGImageFormat. It is handling most or all of the colorspace conversion within CoreGraphics, so if it can do that, it should be able to handle your stuff.
It can actually do more. IIRC, the entire pipeline of what it can do is described loosely by the stages:
I may have omitted a few stages (colorspace conversion itself is often multiple stages and is egregiously oversimplified here) and I can never remember whether decode or premultiplication is supposed to come first. (The PDF spec defines this but presents it "backward" which just confuses things further, IMO.) …It’s been a while… The tiling engine will drop unneeded stages or concatenate stages into single passes intelligently when it can.
One of the things that would have been nice to do for this particular tiling engine would be to provide hooks so you can insert a callback into the vImage tiling engine, which is already doing this stuff behind the scenes, but that would have in effect frozen the design in time and prevented future improvements. Alas, as described in the other answer, sometimes software has a way of ossifying anyway (when key engineers move on/retire), so one can be overzealous in that direction. It probably would have been greatly complicated by the need to specify whether your filter needed premultiplied or non-premultiplied data, whether it needed to be in a linear color space, whether it is separable, etc. so maybe not realistic in any event, due to complexity on both sides.
Also, in another, another life, I wrote a similar conversion for MetalPerformanceShaders.framework. The GPU has the advantage of a runtime compiler, so can concatenate a lot of conversion passes into some straightforward arithmetic, usually in a single pass. Assuming your function is GPU friendly that might work better. Note there are actually two versions of this in MPS. One operates as an image filter. This would usually result in a 3 pass conversion, end to end. There is another one that operates as a kernel function library, with which you could probably do the whole thing end to end in a single pass for most things. This feature might be the only kernel function library in the OS (excluding the Metal shader language standard library) so might be a bit unexpected! Whether or not this turns out to be useful for you may depend quite a bit on whether your data takes too long to move to the GPU (and back?) which is in turn strongly affected by whether your GPU is a discrete memory device or Apple Silicon. Memory bandwidth, especially over PCI-E can trump computational bandwidth.
When it comes to performance work, the stars need to be aligned correctly to get stellar performance. One mistake and you are in the slow lane.
ATTN Evil Stackoverflow referee: The answer to the question in the title of this post is "Immediately". The buffer is freed immediately. That is however not the answer that the questioner needed.