MyNameIsTrez
January 11, 2025
100 views
0 votes
2 Answers

I think I’ve found a bug, but I’m not sure whether to blame the JVM, JNI, NASM, GCC, or ld.

The bug causes my compiled modding language called grug (see this for videos and explanations) to sporadically crash Minecraft (Java edition) when mods cause a SIGSEGV with infinite recursion.

My modding language just consists of a 9k line grug.c, delivering a compiler and linker for my custom programming language. Its shared object (.so) output is based on ~300 tests I wrote that diff it against the expected NASM output. So my compiler is basically a NASM compiler that takes .grug files as input, rather than .s x86-64 Assembly files. I’m explaining this to clarify why I can’t just ditch NASM.

After days of carefully shaving my original program down, I’ve finally got a minimal reproducible example which shows that errors like FATAL ERROR in native method: Static field ID passed to JNI have a roughly 1 in 20 chance of being printed when these conditions are met:

JNI is used to call a C function that opens mage.so
mage.so was generated with ld from mage.o, where mage.o was generated with nasm from mage.s (where mage.s can just be an empty file)
After loading mage.so, JNI is used to call a C function that causes a SIGSEGV

I have been able to reproduce the errors on both my Ubuntu and Arch Linux computers.

What’s strange is that generating mage.o from an empty mage.c with gcc, instead of using NASM, does not cause the errors to be printed, which is why I think NASM may be to blame.

Minimal reproducible example

Main.java:

class Main {
    private native void init();
    private native void foo();

    public static void main(String[] args) {
        new Main().run();
    }

    public void run() {
        System.loadLibrary("foo");

        init();

        long iteration = 0;
        for (int i = 0; i < 2; i++) {
            System.out.println("Iteration: " + ++iteration);
            foo();
        }
    }
}

foo.c:

#include <dlfcn.h>
#include <jni.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

jmp_buf jmp_buffer;

volatile pthread_t expected_thread;

static void segv_handler(int sig) {
    (void)sig;

    {
        char msg[] = "In segv_handler()n";
        write(STDERR_FILENO, msg, sizeof(msg)-1);
    }

    if (!pthread_equal(pthread_self(), expected_thread)) {
        char msg[] = "Unexpected thread entered handler; exitingn";
        write(STDERR_FILENO, msg, sizeof(msg)-1);
        _exit(EXIT_FAILURE);
    }

    siglongjmp(jmp_buffer, 1);
}

JNIEXPORT void JNICALL Java_Main_init(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;

    fprintf(stderr, "Initializing...n");

    struct sigaction sigsegv_sa = {
        .sa_handler = segv_handler,
        .sa_flags = SA_ONSTACK, // SA_ONSTACK gives SIGSEGV its own stack
    };

    // Handle stack overflow
    // See https://stackoverflow.com/a/7342398/13279557
    static char stack[SIGSTKSZ];
    stack_t ss = {
        .ss_size = SIGSTKSZ,
        .ss_sp = stack,
    };

    if (sigaltstack(&ss, NULL) == -1) {
        perror("sigaltstack");
        exit(EXIT_FAILURE);
    }

    if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }

    if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }

    void *dll = dlopen("./mage.so", RTLD_NOW);
    if (!dll) {
        fprintf(stderr, "dlopen(): %sn", dlerror());
    }
}

void recurse() {
    recurse();
}

JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;

    expected_thread = pthread_self();

    if (sigsetjmp(jmp_buffer, 1)) {
        fprintf(stderr, "Jumpedn");
        return;
    }

    fprintf(stderr, "Recursing...n");

    recurse();
}

mage.s: an empty file

mage.c: an empty file

Compiling foo.so (you will need to replace the jdk include paths here with your own, which ls /usr/lib/jvm can help with):

gcc foo.c -o libfoo.so -shared -fPIC -g -Wall -Wextra -Wpedantic -Werror -Wfatal-errors -Wno-infinite-recursion -I/usr/lib/jvm/jdk-23.0.1-oracle-x64/include -I/usr/lib/jvm/jdk-23.0.1-oracle-x64/include/linux

Then assemble mage.s to mage.o:

nasm mage.s -felf64

And link mage.o to mage.so:

ld mage.o -o mage.so -shared

Finally we run Main.java in an infinite loop, which should eventually print FATAL ERROR in native method: Static field ID passed to JNI:

while true; do java -Xcheck:jni -XX:+AllowUserSignalHandlers -Djava.library.path=. Main.java; done

Hitting Ctrl+Z a few times will suspend the loop, where you can then use kill %% to kill it.

My questions

What I don’t get is why generating mage.o from compiling mage.c, instead of from assembling mage.s, never gets the program to print the error:

gcc mage.c -c

I can see that the output of readelf -a mage.o, objdump -D mage.o, and xxd mage.o are all significantly larger when mage.o is generated from mage.c. This is due to GCC dumping GNU-specific sections in the ELF file and such, so I guess the error only being printed when using NASM to assemble mage.s may have to do with the sections in the ELF file?

What I also don’t understand is why the pthread_equal() check I put in the signal handler that exits right away does not prevent the FATAL ERROR in native method: Static field ID passed to JNI from being printed. I figured that the error was caused by an internal JVM thread entering my handler, while it was meant to enter JVM’s own SIGSEGV handler, but I guess not?

I know the program prints warnings about it wanting it to be ran with jsig, but as I described in this answer, using jsig is not possible when wanting to overwrite JVM’s SIGSEGV handler with your own handler in C (as far as I’ve been able to tell from a week of research).

It’s easy to throw my hands up by blaming the odd behavior of the NASM version on me not using jsig, but it doesn’t make any logical sense to me. I’m still not sure whether it’s actually a NASM or JVM issue.

I’m on Ubuntu 24.04.1, and here are the versions of the programs I am calling in the MRE:

$ java --version
java 23.0.1 2024-10-15
Java(TM) SE Runtime Environment (build 23.0.1+11-39)
Java HotSpot(TM) 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)

$ nasm --version
NASM version 2.16.01 # 2.16.03-1 also reproduces the error

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.42

Answers

Chosen as BEST ANSWER

I now have two solutions to my own question. They're compiled and ran in the exact same way as my original question.

They are based on yyyy's answer and suggestions in the comments, so massive thanks to them!

As noted in the comments of the below solution, I think this GCC page is saying that how I'm setting the rsp and rbp registers here is UB? If so, I'd love to hear of alternative ways to set them:

Another restriction is that the clobber list should not contain the stack pointer register. This is because the compiler requires the value of the stack pointer to be the same after an asm statement as it was on entry to the statement. However, previous versions of GCC did not enforce this rule and allowed the stack pointer to appear in the list, with unclear semantics. This behavior is deprecated and listing the stack pointer may become an error in future versions of GCC.

Solution 1

Here are its steps:

mmap() to create a new stack (that has a guard page provided by MAP_GROWSDOWN)
__asm__ volatile() to save, set, and restore the rsp and rbp registers to this block of mmap()ed memory

#include <assert.h>
#include <dlfcn.h>
#include <jni.h>
#include <setjmp.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

jmp_buf jmp_buffer;

static char *base;
static char *top;

static void segv_handler(int sig) {
    (void)sig;
    siglongjmp(jmp_buffer, 1);
}

static void mod_fn() {
    if (sigsetjmp(jmp_buffer, 1)) {
        fprintf(stderr, "Jumped %p %p %ldn", base, top, (base - top) / 1024);
        return;
    }

    char c;
    top = &c;
    while (1) {
        top--;
        *top = 1;
    }
}

JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;

    char b;
    base = &b;

    struct sigaction sigsegv_sa = {
        .sa_handler = segv_handler,
        .sa_flags = SA_ONSTACK, // SA_ONSTACK gives SIGSEGV its own stack
    };

    // Set up an emergency stack for SIGSEGV
    // See https://stackoverflow.com/a/7342398/13279557
    static char emergency_stack[SIGSTKSZ];
    stack_t ss = {
        .ss_size = SIGSTKSZ,
        .ss_sp = emergency_stack,
    };

    if (sigaltstack(&ss, NULL) == -1) {
        perror("sigaltstack");
        exit(EXIT_FAILURE);
    }

    if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }

    if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }

    void *dll = dlopen("./mage.so", RTLD_NOW);
    if (!dll) {
        fprintf(stderr, "dlopen(): %sn", dlerror());
    }

    size_t page_count = 8192;
    size_t page_size = sysconf(_SC_PAGE_SIZE);
    size_t length = page_count * page_size;

    void *map = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0);
    if (map == MAP_FAILED) {
        perror("mmap");
        exit(EXIT_FAILURE);
    }

    // Asserting 16-byte alignment here is not necessary,
    // since mmap() guarantees it with the args we pass it
    assert(((size_t)map & 0xf) == 0);

    void *stack = (char *)map + length;

    // Save rbp and rsp
    // Marking these static is necessary for restoring
    static int64_t rsp;
    static int64_t rbp;
    __asm__ volatile("mov %%rsp, %0nt" : "=r" (rsp));
    __asm__ volatile("mov %%rbp, %0nt" : "=r" (rbp));

    // Set rbp and rsp to the very start of the mmap-ed memory
    //
    // TODO: I think setting rsp and rbp here is UB?:
    // "Another restriction is that the clobber list should not contain
    // the stack pointer register. This is because the compiler requires
    // the value of the stack pointer to be the same after an asm statement
    // as it was on entry to the statement. However, previous versions
    // of GCC did not enforce this rule and allowed the stack pointer
    // to appear in the list, with unclear semantics. This behavior
    // is deprecated and listing the stack pointer may become an error
    // in future versions of GCC."
    // From https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
    __asm__ volatile("mov %0, %%rspnt" : : "r" (stack));
    __asm__ volatile("mov %0, %%rbpnt" : : "r" (stack));

    mod_fn();

    // Restore rbp and rsp
    __asm__ volatile("mov %0, %%rspnt" : : "r" (rsp));
    __asm__ volatile("mov %0, %%rbpnt" : : "r" (rbp));

    if (munmap(map, length) == -1) {
        perror("munmap");
        exit(EXIT_FAILURE);
    }

    printf("Success!n");
}

Solution 2

Here are its steps:

mmap() to create a new stack (that has a guard page provided by MAP_GROWSDOWN)
__asm__ volatile() to set the rsp and rbp registers to this new block of mmap()ed memory
setjmp() + longjmp() to restore the rsp and rbp registers, and whatever needs to be preserved across fn calls

#include <assert.h>
#include <dlfcn.h>
#include <jni.h>
#include <setjmp.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

jmp_buf jmp_buffer;
jmp_buf mmap_jmp_buffer;

static char *base;
static char *top;

static void segv_handler(int sig) {
    (void)sig;
    siglongjmp(jmp_buffer, 1);
}

static void mod_fn() {
    if (sigsetjmp(jmp_buffer, 1)) {
        fprintf(stderr, "Jumped %p %p %ldn", base, top, (base - top) / 1024);
        return;
    }

    char c;
    top = &c;
    while (1) {
        top--;
        *top = 1;
    }
}

JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;

    char b;
    base = &b;

    struct sigaction sigsegv_sa = {
        .sa_handler = segv_handler,
        .sa_flags = SA_ONSTACK, // SA_ONSTACK gives SIGSEGV its own stack
    };

    // Set up an emergency stack for SIGSEGV
    // See https://stackoverflow.com/a/7342398/13279557
    static char emergency_stack[SIGSTKSZ];
    stack_t ss = {
        .ss_size = SIGSTKSZ,
        .ss_sp = emergency_stack,
    };

    if (sigaltstack(&ss, NULL) == -1) {
        perror("sigaltstack");
        exit(EXIT_FAILURE);
    }

    if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }

    if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }

    void *dll = dlopen("./mage.so", RTLD_NOW);
    if (!dll) {
        fprintf(stderr, "dlopen(): %sn", dlerror());
    }

    size_t page_count = 8192;
    size_t page_size = sysconf(_SC_PAGE_SIZE);
    size_t length = page_count * page_size;

    void *map = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, -1, 0);
    if (map == MAP_FAILED) {
        perror("mmap");
        exit(EXIT_FAILURE);
    }

    // Asserting 16-byte alignment here is not necessary,
    // since mmap() guarantees it with the args we pass it
    assert(((size_t)map & 0xf) == 0);

    void *stack = (char *)map + length;

    if (setjmp(mmap_jmp_buffer) == 0) {
        // Set rbp and rsp to the very start of the mmap-ed memory
        //
        // TODO: I think setting rsp and rbp here is UB?:
        // "Another restriction is that the clobber list should not contain
        // the stack pointer register. This is because the compiler requires
        // the value of the stack pointer to be the same after an asm statement
        // as it was on entry to the statement. However, previous versions
        // of GCC did not enforce this rule and allowed the stack pointer
        // to appear in the list, with unclear semantics. This behavior
        // is deprecated and listing the stack pointer may become an error
        // in future versions of GCC."
        // From https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
        __asm__ volatile("mov %0, %%rspnt" : : "r" (stack));
        __asm__ volatile("mov %0, %%rbpnt" : : "r" (stack));

        mod_fn();

        // Restore rbp and rsp, and whatever needs to be preserved
        // across fn calls: https://stackoverflow.com/a/25266891/13279557
        longjmp(mmap_jmp_buffer, 1);
    }

    if (munmap(map, length) == -1) {
        perror("munmap");
        exit(EXIT_FAILURE);
    }

    printf("Success!n");
}

(Edit)

I guess JVM doesn’t allocate a guard page for each thread.

I can reliably (100% chance) reproduce this error on my machine, even with an simplified version of code.

This code may or may not compile on your machine because I didn’t use extra checks like -Wall and -Werror.

jmp_buf jmp_buffer;

static char *base;
static char *top;

static void segv_handler(int sig) {
    (void)sig;
    siglongjmp(jmp_buffer, 1);
}

JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;
    char b;
    base = &b;

    struct sigaction sigsegv_sa = {
        .sa_handler = segv_handler
    };
    if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }

    if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }

    void *dll = dlopen("./mage.so", RTLD_NOW);
    if (!dll) {
        fprintf(stderr, "dlopen(): %sn", dlerror());
    }

    if (sigsetjmp(jmp_buffer, 1)) {
        fprintf(stderr, "Jumped %lx %lx %ldn", base, top, (base - top) / 1024);
        return;
    }

    char c;
    top = &c;
    while (1) {
        top--;
        *top = 1;
    }
}

I run it with gdb, so it pauses when therminated, give me a chance to inspect its’ memory layout.

The program prints Jumped 7ffff64f9447 7ffff64b0fff 289. While its’ memory layout, as shown in the /proc/<fd>/maps, looks like this:

7ffff6491000-7ffff649d000 r--p 00000000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff649d000-7ffff64ab000 r-xp 0000c000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64ab000-7ffff64b0000 r--p 0001a000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64b0000-7ffff64b1000 r--p 0001f000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64b1000-7ffff64b2000 rw-p 00020000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64b2000-7ffff64b3000 rw-p 00000000 00:00 0
7ffff64b3000-7ffff64bb000 rw-s 00000000 08:10 6624                       /tmp/hsperfdata_root/4468 (deleted)
7ffff64bb000-7ffff64fb000 rwxp 00000000 00:00 0
7ffff64fb000-7ffff64fe000 r--p 00000000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff64fe000-7ffff6519000 r-xp 00003000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff6519000-7ffff651d000 r--p 0001e000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff651d000-7ffff651e000 r--p 00021000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff651e000-7ffff651f000 rw-p 00022000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1

Note that 7ffff64b0fff, which is the address where SIGSEGV is raised, is already in the memory range of /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so, which means previous writes to the "stack" overwrites JVM internal states, causing unpredictable errors.

I’m still not sure why does this only happen when dlopen("mage.so") is called, but I think the absent of a guard page is the root cause.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – Is this a bug in the JVM, or in NASM?

Minimal reproducible example

My questions

Answers

Solution 1

Solution 2