I think I’ve found a bug, but I’m not sure whether to blame the JVM, JNI, NASM, GCC, or ld.
The bug causes my compiled modding language called grug (see this for videos and explanations) to sporadically crash Minecraft (Java edition) when mods cause a SIGSEGV
with infinite recursion.
My modding language just consists of a 9k line grug.c
, delivering a compiler and linker for my custom programming language. Its shared object (.so
) output is based on ~300 tests I wrote that diff it against the expected NASM output. So my compiler is basically a NASM compiler that takes .grug
files as input, rather than .s
x86-64 Assembly files. I’m explaining this to clarify why I can’t just ditch NASM.
After days of carefully shaving my original program down, I’ve finally got a minimal reproducible example which shows that errors like FATAL ERROR in native method: Static field ID passed to JNI
have a roughly 1 in 20 chance of being printed when these conditions are met:
- JNI is used to call a C function that opens
mage.so
mage.so
was generated withld
frommage.o
, wheremage.o
was generated withnasm
frommage.s
(wheremage.s
can just be an empty file)- After loading
mage.so
, JNI is used to call a C function that causes aSIGSEGV
I have been able to reproduce the errors on both my Ubuntu and Arch Linux computers.
What’s strange is that generating mage.o
from an empty mage.c
with gcc
, instead of using NASM, does not cause the errors to be printed, which is why I think NASM may be to blame.
Minimal reproducible example
Main.java
:
class Main {
private native void init();
private native void foo();
public static void main(String[] args) {
new Main().run();
}
public void run() {
System.loadLibrary("foo");
init();
long iteration = 0;
for (int i = 0; i < 2; i++) {
System.out.println("Iteration: " + ++iteration);
foo();
}
}
}
foo.c
:
#include <dlfcn.h>
#include <jni.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
jmp_buf jmp_buffer;
volatile pthread_t expected_thread;
static void segv_handler(int sig) {
(void)sig;
{
char msg[] = "In segv_handler()n";
write(STDERR_FILENO, msg, sizeof(msg)-1);
}
if (!pthread_equal(pthread_self(), expected_thread)) {
char msg[] = "Unexpected thread entered handler; exitingn";
write(STDERR_FILENO, msg, sizeof(msg)-1);
_exit(EXIT_FAILURE);
}
siglongjmp(jmp_buffer, 1);
}
JNIEXPORT void JNICALL Java_Main_init(JNIEnv *env, jobject obj) {
(void)env;
(void)obj;
fprintf(stderr, "Initializing...n");
struct sigaction sigsegv_sa = {
.sa_handler = segv_handler,
.sa_flags = SA_ONSTACK, // SA_ONSTACK gives SIGSEGV its own stack
};
// Handle stack overflow
// See https://stackoverflow.com/a/7342398/13279557
static char stack[SIGSTKSZ];
stack_t ss = {
.ss_size = SIGSTKSZ,
.ss_sp = stack,
};
if (sigaltstack(&ss, NULL) == -1) {
perror("sigaltstack");
exit(EXIT_FAILURE);
}
if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
perror("sigfillset");
exit(EXIT_FAILURE);
}
if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
perror("sigaction");
exit(EXIT_FAILURE);
}
void *dll = dlopen("./mage.so", RTLD_NOW);
if (!dll) {
fprintf(stderr, "dlopen(): %sn", dlerror());
}
}
void recurse() {
recurse();
}
JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
(void)env;
(void)obj;
expected_thread = pthread_self();
if (sigsetjmp(jmp_buffer, 1)) {
fprintf(stderr, "Jumpedn");
return;
}
fprintf(stderr, "Recursing...n");
recurse();
}
mage.s
: an empty file
mage.c
: an empty file
Compiling foo.so
(you will need to replace the jdk include paths here with your own, which ls /usr/lib/jvm
can help with):
gcc foo.c -o libfoo.so -shared -fPIC -g -Wall -Wextra -Wpedantic -Werror -Wfatal-errors -Wno-infinite-recursion -I/usr/lib/jvm/jdk-23.0.1-oracle-x64/include -I/usr/lib/jvm/jdk-23.0.1-oracle-x64/include/linux
Then assemble mage.s
to mage.o
:
nasm mage.s -felf64
And link mage.o
to mage.so
:
ld mage.o -o mage.so -shared
Finally we run Main.java
in an infinite loop, which should eventually print FATAL ERROR in native method: Static field ID passed to JNI
:
while true; do java -Xcheck:jni -XX:+AllowUserSignalHandlers -Djava.library.path=. Main.java; done
Hitting Ctrl
+Z
a few times will suspend the loop, where you can then use kill %%
to kill it.
My questions
What I don’t get is why generating mage.o
from compiling mage.c
, instead of from assembling mage.s
, never gets the program to print the error:
gcc mage.c -c
I can see that the output of readelf -a mage.o
, objdump -D mage.o
, and xxd mage.o
are all significantly larger when mage.o
is generated from mage.c
. This is due to GCC dumping GNU-specific sections in the ELF file and such, so I guess the error only being printed when using NASM to assemble mage.s
may have to do with the sections in the ELF file?
What I also don’t understand is why the pthread_equal()
check I put in the signal handler that exits right away does not prevent the FATAL ERROR in native method: Static field ID passed to JNI
from being printed. I figured that the error was caused by an internal JVM thread entering my handler, while it was meant to enter JVM’s own SIGSEGV
handler, but I guess not?
I know the program prints warnings about it wanting it to be ran with jsig, but as I described in this answer, using jsig is not possible when wanting to overwrite JVM’s SIGSEGV
handler with your own handler in C (as far as I’ve been able to tell from a week of research).
It’s easy to throw my hands up by blaming the odd behavior of the NASM version on me not using jsig, but it doesn’t make any logical sense to me. I’m still not sure whether it’s actually a NASM or JVM issue.
I’m on Ubuntu 24.04.1, and here are the versions of the programs I am calling in the MRE:
$ java --version
java 23.0.1 2024-10-15
Java(TM) SE Runtime Environment (build 23.0.1+11-39)
Java HotSpot(TM) 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
$ nasm --version
NASM version 2.16.01 # 2.16.03-1 also reproduces the error
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.42
2
Answers
I now have two solutions to my own question. They're compiled and ran in the exact same way as my original question.
They are based on yyyy's answer and suggestions in the comments, so massive thanks to them!
As noted in the comments of the below solution, I think this GCC page is saying that how I'm setting the
rsp
andrbp
registers here is UB? If so, I'd love to hear of alternative ways to set them:Solution 1
Here are its steps:
MAP_GROWSDOWN
)rsp
andrbp
registers to this block ofmmap()
ed memorySolution 2
Here are its steps:
MAP_GROWSDOWN
)rsp
andrbp
registers to this new block ofmmap()
ed memoryrsp
andrbp
registers, and whatever needs to be preserved across fn callsI guess JVM doesn’t allocate a guard page for each thread.
I can reliably (100% chance) reproduce this error on my machine, even with an simplified version of code.
This code may or may not compile on your machine because I didn’t use extra checks like
-Wall
and-Werror
.I run it with
gdb
, so it pauses when therminated, give me a chance to inspect its’ memory layout.The program prints
Jumped 7ffff64f9447 7ffff64b0fff 289
. While its’ memory layout, as shown in the/proc/<fd>/maps
, looks like this:Note that
7ffff64b0fff
, which is the address where SIGSEGV is raised, is already in the memory range of/usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
, which means previous writes to the "stack" overwrites JVM internal states, causing unpredictable errors.I’m still not sure why does this only happen when
dlopen("mage.so")
is called, but I think the absent of a guard page is the root cause.