Guava, Graal and Partial Escape Analysis

Dozens were released - and although Graal was available before, it is now even more affordable - Congratulations, you're running #Graal! - just add


-XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler


What exactly can this give us and where can we expect improvements, and which bicycles should I start to cut out?


An example that I will consider is partially contrived, however, based on real events.



Guava


Most likely, guava :


checkArgument(value > 0, "Non-negative value is expected, was %s", value);


And everything would be fine if such a piece did not fall on the critical path in the code - the problem is implicit garbage creation.


This is how the body of the method looks checkArgument :


  public static void checkArgument(
      boolean expression,
      @Nullable String errorMessageTemplate,
      @Nullable Object... errorMessageArgs) {
    if (!expression) {
      throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
    }
  }

Let's make the implicit explicit:


boolean expression = value > 0;
Object[] errorMessageArgs = new Object[]{Integer.valueOf(value)};
if (!expression) {
  throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
}

This is where the checker’s dilemma arises, or go: As a rule, similar checks in the production code are rehashings, and on the one hand I don’t want to pay extra garbage for them, but on the other hand I don’t want to throw away fast fail.


The problem is with objects generated by autoboxing and varargs that may not be used. Alas, when faced with a branch, Escape Analysis is no longer able to identify an object as unnecessary.


How can I solve the problem?


For example, overloading the method checkArgument (which is basically done in guava > = 20 ):


  public static void checkArgument(boolean expression, @Nullable String errorMessageTemplate, int p1) {
    if (!expression) {
      throw new IllegalArgumentException(format(errorMessageTemplate, p1));
    }
  }

But, what if we have more than two arguments, for which there are overloaded methods in guava? To write your crutch or suffer from garbage? In our code, we are faced with a place that contains a combination of 3 ints, one line, which runs millions of times and response time is limited.


Graal


Java 10 and -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler


Graal carries many new optimizations, and in particular Partial Escape Analysis - the essence of which, among other things, is that it is able to determine that created objects are used only in one of the branches - and you can move the creation of these objects inside it.


The moment of truth - what is your evidence?


Jmh


PartialEATest :


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Warmup(iterations = 5, time = 5000, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 5000, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class PartialEATest {

    @Param(value = {"-1", "1"})
    private int value;

    @Benchmark
    public void allocate(Blackhole bh) {
        checkArg(bh, value > 0, "expected non-negative value: %s, %s", value, 1000, "A", 700);
    }

    private static void checkArg(Blackhole bh, boolean cond, String msg, Object ... args){
        if (!cond){
            bh.consume(String.format(msg, args));
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(PartialEATest.class.getSimpleName())
                .addProfiler(GCProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

Of all the numbers, we are interested in allocations - that's why GCProfiler turned on :


Options Benchmark (value) Score Error Units
-Graal PartialEATest.allocate: gc.alloc.rate.norm -1 1008,000 ± 0.001 B / op
-Graal PartialEATest.allocate: gc.alloc.rate.norm 1 32,000 ± 0.001 B / op
+ Graal PartialEATest.allocate: gc.alloc.rate.norm -1 1024,220 ± 0.908 B / op
+ Graal PartialEATest.allocate: gc.alloc.rate.norm 1 ≈ 10⁻⁴ B / op

Which quite clearly demonstrates that Graal does not create objects unnecessarily - and it's time to cut optimization crutches.


Added by :


olegchir reasonably remarked: would it be nice to see what exactly the code compiles into?


Compiled method


Let's see what assembler code is obtained as a result of compilation with the good old C2 and Graal - for this we need hsdis - we download or assemble it ourselves , add parameters to the launch:


-XX:+UnlockDiagnosticVMOptions 
-XX:PrintAssemblyOptions=intel 
-XX:CompileCommand=print,"com/elastic/PartialEATest.*" 

Compiled method :: C2


There is a lot of code - all the compiled code - until the first autoboxing :


ImmutableOopMap{rbx=Oop }pc offsets: 1684 1697 Compiled method (c2)     619  736       4       com.elastic.PartialEATest::allocate (55 bytes)
 total in heap  [0x00000001189a0c90,0x00000001189a1410] = 1920
 relocation     [0x00000001189a0e08,0x00000001189a0e38] = 48
 main code      [0x00000001189a0e40,0x00000001189a1060] = 544
 stub code      [0x00000001189a1060,0x00000001189a1078] = 24
 oops           [0x00000001189a1078,0x00000001189a10a0] = 40
 metadata       [0x00000001189a10a0,0x00000001189a10b0] = 16
 scopes data    [0x00000001189a10b0,0x00000001189a1210] = 352
 scopes pcs     [0x00000001189a1210,0x00000001189a13c0] = 432
 dependencies   [0x00000001189a13c0,0x00000001189a13c8] = 8
 handler table  [0x00000001189a13c8,0x00000001189a1410] = 72
----------------------------------------------------------------------
com/elastic/PartialEATest.allocate(Lorg/openjdk/jmh/infra/Blackhole;)V  [0x00000001189a0e40, 0x00000001189a1078]  568 bytes
[Entry Point]
[Constants]
  # {method} {0x000000022ea937b8} 'allocate' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'com/elastic/PartialEATest'
  # this:     rsi:rsi   = 'com/elastic/PartialEATest'
  # parm0:    rdx:rdx   = 'org/openjdk/jmh/infra/Blackhole'
  #           [sp+0x30]  (sp of caller)
  0x00000001189a0e40: cmp    rax,QWORD PTR [rsi+0x8]
  0x00000001189a0e44: jne    0x0000000110eb7580  ;   {runtime_call ic_miss_stub}
  0x00000001189a0e4a: xchg   ax,ax
  0x00000001189a0e4c: nop    DWORD PTR [rax+0x0]
[Verified Entry Point]
  0x00000001189a0e50: mov    DWORD PTR [rsp-0x14000],eax
  0x00000001189a0e57: push   rbp
  0x00000001189a0e58: sub    rsp,0x20           ;*synchronization entry
                                                ; - com.elastic.PartialEATest::allocate@-1 (line 26)

  0x00000001189a0e5c: mov    r11d,DWORD PTR [rsi+0x10]
                                                ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@1 (line 26)

  0x00000001189a0e60: mov    DWORD PTR [rsp],r11d
  0x00000001189a0e64: test   r11d,r11d
  0x00000001189a0e67: jle    0x00000001189a0ffc  ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@4 (line 26)

  0x00000001189a0e6d: cmp    r11d,0xffffff80
  0x00000001189a0e71: jl     0x00000001189a100e  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@3 (line 1048)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)

  0x00000001189a0e77: cmp    r11d,0x7f
  0x00000001189a0e7b: jg     0x00000001189a0ea9  ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@10 (line 1048)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)

  0x00000001189a0e7d: mov    ebp,r11d
  0x00000001189a0e80: add    ebp,0x80           ;*iadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@20 (line 1049)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)

  0x00000001189a0e86: cmp    ebp,0x100
  0x00000001189a0e8c: jae    0x00000001189a101e
  0x00000001189a0e92: movsxd r10,r11d
  0x00000001189a0e95: movabs r11,0x12ed02000    ;   {oop(a 'java/lang/Integer'[256] {0x000000012ed02000})}
  0x00000001189a0e9f: mov    rbp,QWORD PTR [r11+r10*8+0x418]
                                                ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@21 (line 1049)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)
................                                                

all compiled C2 code


Compiled method :: Graal


ImmutableOopMap{rbx=Oop }pc offsets: 251 264 Compiled method (JVMCI)    1850 3888       4       com.elastic.PartialEATest::allocate (55 bytes)
 total in heap  [0x0000000119292590,0x0000000119292830] = 672
 relocation     [0x0000000119292708,0x0000000119292718] = 16
 main code      [0x0000000119292720,0x0000000119292795] = 117
 stub code      [0x0000000119292795,0x0000000119292798] = 3
 oops           [0x0000000119292798,0x00000001192927a0] = 8
 metadata       [0x00000001192927a0,0x00000001192927a8] = 8
 scopes data    [0x00000001192927a8,0x00000001192927c8] = 32
 scopes pcs     [0x00000001192927c8,0x0000000119292828] = 96
 dependencies   [0x0000000119292828,0x0000000119292830] = 8
----------------------------------------------------------------------
com/elastic/PartialEATest.allocate(Lorg/openjdk/jmh/infra/Blackhole;)V (com.elastic.PartialEATest.allocate(Blackhole))  [0x0000000119292720, 0x0000000119292798]  120 bytes
[Entry Point]
[Constants]
  # {method} {0x0000000231e007b8} 'allocate' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'com/elastic/PartialEATest'
  # this:     rsi:rsi   = 'com/elastic/PartialEATest'
  # parm0:    rdx:rdx   = 'org/openjdk/jmh/infra/Blackhole'
  #           [sp+0x20]  (sp of caller)
  0x0000000119292720: cmp    rax,QWORD PTR [rsi+0x8]
  0x0000000119292724: jne    0x000000010eadc300  ;   {runtime_call ic_miss_stub}
  0x000000011929272a: nop
  0x000000011929272b: data16 data16 nop WORD PTR [rax+rax*1+0x0]
  0x0000000119292736: data16 nop WORD PTR [rax+rax*1+0x0]
[Verified Entry Point]
  0x0000000119292740: mov    DWORD PTR [rsp-0x14000],eax
  0x0000000119292747: sub    rsp,0x18
  0x000000011929274b: mov    QWORD PTR [rsp+0x10],rbp
  0x0000000119292750: cmp    DWORD PTR [rsi+0x10],0x1
  0x0000000119292754: jl     0x000000011929276d  ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@4 (line 26)

  0x000000011929275a: mov    rbp,QWORD PTR [rsp+0x10]
  0x000000011929275f: add    rsp,0x18
  0x0000000119292763: mov    rcx,QWORD PTR [r15+0x70]
  0x0000000119292767: test   DWORD PTR [rcx],eax  ;   {poll_return}
  0x0000000119292769: vzeroupper 
  0x000000011929276c: ret                       ;*return {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@54 (line 27)

  0x000000011929276d: mov    DWORD PTR [r15+0x314],0xffffffed
                                                ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@4 (line 26)

  0x0000000119292778: mov    QWORD PTR [r15+0x320],0x0
  0x0000000119292783: call   0x000000010eadd2a4  ; ImmutableOopMap{rsi=Oop }
                                                ;*aload_0 {reexecute=1 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@0 (line 26)
                                                ;   {runtime_call DeoptimizationBlob}
  0x0000000119292788: nop

You can see how much the code compiled by C2 is larger than the code compiled by Graal - both autoboxing and varargs, while the version of Graal is essentially just a method call.