For my custom compiled native x64 JIT-code, I have certain instrinsic functions. A lot of them are just called from my code, thus I will generate with my own compiler. Some of them however are directly called from c++-code, and thus I want to have them be compiled inside a static-lib, so that they can at least be linked statically, if not inlined.
I need to use inline-assembly for those functions, as they perform actions that cannot be expressed in regular C++, like setting a non-volatile register from a function-input. However, the function itself must behave like a regular x64-function – it needs a prolog/epilogue, and it must have the necessary unwind-information to support stack-traces and exception-handling. Thus I cannot use MSVC (which is my native compiler), so I decided to make a static lib with clang-cl in Visual Studio instead. The best I got so far is the following:
void interruptEntry(void* pState, const char* pAddress)
{
__asm
{
// load state into RBX
mov rbx,rcx
// load callstack-top into RDI
mov rax,[rbx]
mov rdi,[rax]
// call address
call rdx
};
}
This will generate the proper prolog, epilogue and all required unwind-information. However, it critically lacks the 32 bytes of shadow-space that are necessary by x64 (which pAddress needs to be called by):
Acclimate Engine.dll!interruptEntry(void *, const char *):
push rdi
push rbx
mov rbx,rcx
mov rax,qword ptr [rbx]
mov rdi,qword ptr [rax]
call rdx
pop rbx
pop rdi
ret
Keep in mind, while this code is generated via clang-cl, the DLL is linked with MSVC. The static-lib is compiled with O2 (set from the VisualStudio-project page).
Things I’ve tried:
- Modifying RSP manually, with sub RSP,32. This results in a frame-pointer register being established, as the compiler will count this as a dynamic stack allocation. This adds too much overhead to make it worth using a statically compiled function in the first place
- Similarily, I could reference "pState" directly in asm (mov rbx,pState), this will cause the shadow-space to be added – but also, pState will then be copied onto the stack, and loaded into rbx from that stack location, instead of the register. This once again defeats the purpose of what I am doing here.
- Calling "pAddress" as a function-pointer directly, after the asm-block. This will still not result in any difference in code-gen
- Using normal asm(), or extended asm, in combination with "attribute((naked))". That will not generate the prolog/epilogue, which I can write myself – but then the unwind-information is missing. clang-cl seems to not understand any of the unwind-data directives, like .allocstack or .pushreg, resulting in a "error : unknown directive" – regardless of in which type of asm-block it’s being used.
Is there any reason why the shadow-space is missing, and any way to get it there without adding any uncessary overhead like a frame-pointer (while still having unwind-information)? I’m also open for other suggestions – for example, if there is some intrinsic that let’s me set those registers (while still compiling down to the one move), I would not need to use assembly (manipulating specific registers with global effect is the main reason I cannot write plain C++).
2
Answers
So, I had a look at MASM64, as suggested by Margaret Bloom, and I have to say it is generally cool to still have the option to generate assembly at all in MSVC. However, for my own case, I evaluated all the functions that are on my list to potentially rewrite in assembly. Most of the are more complex functions (including calls to other functions; asserts; loops, jumps; ...), but only need a few specific commands to modify registers, or jmp to an address instead of a call. So, while I can use MASM for the simpler cases, I do want to have the option to code certain things with inline assembly, added to a normal c++-function. Luckily, I did find a way:
This compiles to my expected result:
It does seem like using extended inline assembly is the way. Using the __asm-block somehow confused the compiler about what type of calls are being made - which is a shame, because I much preferred the syntax of __asm. Extended asm does allow me to tell the compiler which registers are modified - which is neat, because certain functions that I want to port, I explicitely do not want the modified registers to be restired. And I also can just use regular C++ intermixed, which is one of my requirements for certain operations.
I also had to make sure to include an empty volatile asm-block after the call, to prevent it from tail-calling (which would pop the registers before). There is this one nop, which I'm unsure about (it is not caused by the empty block directly; as if I write a volatile nop there, it will have two nops). But I kind of assume clang knows what it's doing here - otherwise, the minor cost of one nop is acceptable, as far as I'm concerned.
Making calls from inline asm is generally not well supported. Avoid whenever possible.
The compiler only scans the inline asm block to see what registers are potentially clobbered; it doesn’t assume that
call
instructions in asm are to functions that follow the standard calling convention for this target (otherwise why would you be using inline asm in the first place?) So it’s a huge pain to do it safely, same for x86-64 System V (Calling printf in extended inline ASM – using GNU C inline asm you also have to declare all the register clobbers yourself, as well as take care of the red-zone since there’s no way to declare a clobber on that.)Your idea of using inline asm to leave values in regs and block tail-call optimization is a good idea. But the implementation in your self-answer with two separate
asm()
statements doesn’t do anything to stop the compiler from stepping on RBX with the instructions it emits for code outside theasm
statements. A different compiler or version could easily break your code by picking RBX as a temporary instead of RAX when compiling that code between the asm statements. (And since you didn’t use__attribute__((noinline))
, code from parent functions could be scheduled here.)You can write it in a way that discourages the compiler from stepping on your registers. Make those values needed in those registers after the call (as inputs to an empty
asm
statement), so the asm you want is the only efficient choice. That makes it a lot less likely that this will break in practice.Using
register T foo asm("regname")
local register variables lets you ask for values in any of R8-R15 which don’t have specific-register constraint letters, forcing an"r"(var)
constraint to pick a specific register. (And for many other ISAs, there aren’t letters for any single registers.) It’s not actually needed here because the"D"
and"b"
constraints require RDI and RBX respectively.Godbolt shows it works in GCC and clang (
-masm=ms
for GCC, and-target x86_64-w64-windows-gnu
for Clang. As a bonus, this doesn’t require very-recent Clang for-masm=intel
to apply to asm statements, since the actual templates are empty. The action is in the constraints, requiring the compiler to have both values in the registers we want, but without anyYour code also violates the strict-aliasing rule by pointing a
void **
at an object of a different type. Only[unsigned] char*
and pointers to objects declared with__attribute__((may_alias))
can be pointed at arbitrary things in GNU C. But for compat with MSVC,clang-cl
probably enables-fno-strict-aliasing
.Both of these compile to the same asm as yours with current versions of GCC and clang, and the source is shorter and easier to read (if you know GNU C inline asm). The point is that they will more reliably do so with future versions and even if inlined into other surrounding code.
I didn’t use
__attribute__((noinline))
on my versions since even in a use-case where they do inline into a caller (e.g.-flto
link-time optimization), the asm statements hopefully convinces the compiler not to do something else with RBX or RDI in that window between the asm statement and the call, if it is moving code around to try to schedule it more efficiently.