I’m cleaning up some code that’s started throwing java.lang.OutOfMemoryError
in Production.
The problematic area has a couple of methods that process large collections, e.g.:
public void doSomething(Collection<HeavyObject> inputs) {
... do some stuff using INPUTS, deriving some different objects ...
... do some other stuff NOT using INPUTS, only derived objects ...
}
public void unsuspectingCaller() {
Collection<HeavyObject> largeCollection;
... some stuff to populate the collection ...
doSomething(largeCollection);
... other stuff ...
// this following code may be added in the future
kaboom(largeCollection); // walks into maintenance trap!
}
The code is blowing up and running out of memory in ... do some other stuff NOT using INPUTS ...
I can fix reduce the memory consumption (allowing early GC) by adding a inputs.clear()
in between the two blocks.
But, I do not want to set a trap for future maintainers who might not be aware that the input collection is cleared. In fact, the inputs
would ideally have been immutable, to more clearly communicate the intent of the code.
Is there an idiomatic way to declare doSomething()
to make it clear, or even compiler verifiable, that the caller of doSomething()
is not supposed to continue using the collection after doSomething()
has been called?
UPDATE
For additional clarity, I renamed the parameter to inputs
, instead of targets
. Just keep that in mind when reviewing the comments.
UPDATE2
Addressing the suggestion from @Stephen C, we can see clearly that the JVM does not release references held by the caller, even if they are just passed in as an unnamed parameter. Execute with -Xmx8g
(fail) and -Xmx9g
(pass):
package com.stackoverflow.sandbox;
import java.util.Arrays;
import java.util.Collection;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import org.junit.jupiter.api.Test;
public class MemoryTest {
static class HeavyObject {
int[] oneGigabyte = IntStream.range(0, 256_000_000).toArray();
public int[] getGig() {
return oneGigabyte;
}
}
private int[] skynet(int[] in) {
// perform out-of-this-world artificial intelligence computation
return Arrays.stream(in).map(x -> x >> 1).toArray();
}
void doSomething(Collection<HeavyObject> input) {
Collection<int[]> doubleMemoryUsage = input.stream().map(HeavyObject::getGig).map(this::skynet).collect(Collectors.toList());
input = null;
Collection<int[]> tripleMemoryUsage = doubleMemoryUsage.stream().map(this::skynet).collect(Collectors.toList());
double sum = tripleMemoryUsage.stream().flatMapToDouble(array -> Arrays.stream(array).asDoubleStream()).sum();
System.out.println("sum = " + sum);
}
@Test
void caller1() {
doSomething(List.of(new HeavyObject(), new HeavyObject(), new HeavyObject()));
System.out.println("done1");
}
@Test
void caller2() {
Collection<HeavyObject> threeGigs = List.of(new HeavyObject(), new HeavyObject(), new HeavyObject());
doSomething(threeGigs);
System.out.println("done2");
}
}
Another way to state the challenge, is how to reduce the memory usage in doSomething() from triple to double in an idiomatic way that enforces safe usage at compile time?
3
Answers
I was hoping for something better but here's what I came up with:
Refactor
doSomething()
into a class.Now the caller can perform their own analysis to see if
targets.clear()
is ok to call.The problem with calling
targets.clear()
is (as you noted) that something else may be using the collection. So here’s how I would approach this:The
targets = null;
will prevent this call from retaining a reference to the collection longer than it needs to. The performance impact is almost zero, and the damage (if any!) of nulling prematurely is localized to thedoSomething()
method itself.The problem then falls on the caller:
In the first example, the JVM should be able to tell that the caller has no reachable reference once the call has started. In the second example, it is more difficult for the JVM to know. But in both cases, you are dependent on the JVM to detect that the reference is effectively unreachable in the caller.
UPDATE
I suspect that the reason that your
MemoryTest
example fails is that yourdoSomething
code is creating a temporary variable or using a register or something to hold a reference to theStream
object. The JVM may not realize that that variable / register is no longer live, and may therefore be treating theStream
object as reachable. But theStream
object will most likely have a reference to the original collection, and that will make the collection reachable as well.It could be argued that that would be a JVM bug, but I don’t think so. The JLS and JVMS make no strong statements about whether / when the JVM should detect that an local variable (or temporary variable / register) used a method call is no longer reachable.
But I really think that Bohemian has given you the best answer. (No. I don’t think he was being funny.)
If you are having to micro-optimize this to squeeze your problem into memory footprint of your current (small) heap, then the simple solution is to make the heap bigger.
As you noted the various clever things that you could do to optimize storage usage (e.g. by clearing things) could actually break the application or make it harder to maintain.
(And your
MemoryTest
example is a good illustration of how clever optimizations may fail. There is stuff happening behind the scenes that is difficult to predict.)I am only posting this since it’s too big for a comment:
And run this with
java -Xms6g -Xmx6g MemoryTest.java
. It will work.Now comment that
input = null;
and run it: it will fail.Running this on
jdk-15
btw.In theory even if you change the method to:
it should not fail either, but it does. Even with
ZGC
orShenandoah
. I don’t have the knowledge to explain why, though.