Memory Safe Inline Assembly

NOTE: This is a pre-release feature. The Fil-C 0.679 release does not ship with this feature. To test this feature, you need to build from source.

GCC and clang both support an incredibly powerful inline assembly syntax. For example:

unsigned rotate(unsigned x, unsigned char c)
{
    asm("roll %1, %0" : "+r"(x) : "c"(c) : "cc");
    return x;
}

Instructs the compiler to emit assembly based on the roll %1, %0 template, where %1 is filled in with %cl, %0 is filled in with whichever register holds x, and c is moved into the %ecx register just before the roll instruction. Additionally, the compiler is told that the instruction will change the value of x and change the value of control flags.

This seems like it cannot possibly be safe! What if the programmer did something wrong, like omitted the + in "+r", or forgot th the "cc" clobber? In Yolo-C, if you make such a mistake, the compiler happily miscompiles your code in those cases.

Yet Fil-C supports this inline assembly syntax and it's completely safe!

This document explains why Fil-C supports inline assembly at all and then goes into the details of how that support is achieved while maintaining both programmer intent (you still get the assembly template you asked for) and complete memory safety (if you do something wrong, you'll panic or get an illegal instruction trap, at worst).

Why Inline Assembly?

While reviewing folks' C and C++ code, I've found the following reasons for inline assembly, where 1 is most common:

Blank inline assembly to prevent compiler analysis. This includes things like asm volatile("" : : : "memory"), which is an old-school way of saying atomic_signal_fence(memory_order_seq_cst). It works because we're telling the compiler that the inline assembly clobbers all memory, which forces the compiler to serialize memory accesses, just like a signal fence would have. The contract with the compiler is clear: the compiler must emit exactly the assembly we're asking it to emit (which is blank here) without second-guessing our claims about the clobbers. That is, the compiler must not infer that because the assembly is blank then there cannot be a memory clobber. We said memory clobber, so that's what the compiler sees. Similarly, folks do stuff like asm("" : "+r(x)). This means: the assembly may read and then write x. The assembly is blank, so this incurs no cost other than forcing the compiler to assume that it doesn't know anything about x's value after the assembly executes. This kind of data flow fence is useful for writing constant-time crypto. Fil-C has long supported blank inline assembly since it's trivially safe. Fil-C even supports "+r" constraints on pointers, in which case both the intval and lower are threaded through their own "+r"-like constraints at the LLVM IR level.
cpuid and xgetbv. The inline assembly snippets for these two instructions occur most often in code that then goes on to use SIMD intrinsics. I think this is because the __get_cpuid API in cpuid.h is confusing to use and, as far as I can tell, does not work right in either GCC or clang. Hence, packages like zstd, simdutf, simdjson, and other SIMD-using programs tend to identify CPU features by using inline assembly that invokes cpuid. They often also use inline assembly to invoke to invoke xgetbv as well. In Fil-C, __get_cpuid is fixed, so you could use that, and zxgetbv is offered as an intrinsic. However, it's better to support those inline assembly snippets without requiring folks to change their code! And there's nothing unsafe about invoking cpuid and xgetbv so long as the code specifies the right clobbers and constraints.
Arithmetic over secrets in crypto code. A great example is OpenSSH's sntrup761 implementation, which wraps key arithmetic in inline assembly to ensure that it gets exactly the right instruction and not some instruction that might have varying execution time depending on inputs. Note that this kind of code often has fallbacks to try to get the compiler to emit constant-time code even if inline assembly is not supported, but those fallbacks are unlikely to be as rigorously validated, and often rely on "optimization blocking" idioms that hurt performance and could be circumvented by a sufficiently clever compiler. Hence, it's safest to support inline assembly snippets that do this. Luckily, these snippets are also completely safe, provided that the constraints and clobbers are correct.
Atomics. Compilers have long supported intrinsics for atomic instructions. Compilers also have a long history of implementing these intrinsics incorrectly! Most recently, clang had bugs in how it lowered CAS to LL/SC on ARM64. Hence, serious lock-free programmers tend to write their atomic instructions using inline assembly at least some of the time, like in those cases where they had encountered a miscompile and so dropping to assembly was their only path to fixing the bug. Supporting atomics in inline assembly would require allowing inline assembly that accesses memory, which would mean somehow inferring what Fil-C bounds checks to do. Inline assembly that accesses memory is currently out of scope. However, this document will show how we support inline assembly for fences (lfence, sfence, mfence, and serialize).
System calls. These are currently out of scope for inline assembly in Fil-C, and that's fine, since using inline assembly for syscalls is only necessary in the guts of libc implementations. Fil-C already has ports of musl and glibc, and in both cases the inline assembly for syscalls is replaced with calls to the pizlonated_syscalls.h API that Fil-C provides. However, I can imagine adding support for inline assembly that does syscalls in the future, to make it easier to port new libc's to Fil-C.
x87 long double functions. If you're working with long double on x86, then you're using the x87 80-bit floating point math. If you want access to the x87 FPU's implementations of various math functions, then often the best way to do that is to drop to inline assembly. This is totally safe, provided that the inline assembly doesn't push or pop the x87 stack, and the constraints correctly spell out which x87 stack registers were clobbered.

It's likely that folks use inline assembly for other purposes, but the above list is all that I've seen when surveying programs in the Linux userland.

To summarize:

There remain many legitimate uses of inline assembly.
Inline assembly use is widespread in C and C++ libraries. You're probably using multiple of those libraries right now as you're reading this post, and the inline assembly in those libraries is on the critical path.
Much of the inline assembly is trivially safe: it doesn't access memory, it does no control flow, and the instructions used have no other sneaky side effects.

Read on for details about the world's first memory safe inline assembly implementation!

Supporting Inline Assembly Safely

When the Fil-C compiler's safety instrumentation pass (called FilPizlonator) runs, inline assembly is present in LLVM IR as a pair of strings:

The assembly string, almost exactly like it appears in the C source code, just with some characters replaced. For example, the roll example turns into roll $1, $0.
The constraint string. This uses an LLVM-specific syntax to express the constraints and clobbers. For the roll example, this is =r,{cx},0,~{cc},~{dirflag},~{fpsr},~{flags}.

Hence, we can validate if an inline assembly expression is safe by:

Parsing and analyzing the assembly. If it contains memory accesses, control flow, or anything we don't recognize, we reject it.
Parsing and analyzing the constraints. If those do anything we don't recognize or support, then reject.
Ensuring that the assembly's effects are fully captured by the constraints. For example, if an assembly instruction modifies a register, then the constraints must capture that register mutation. If any instructin sets some CPU flags, then those flags must be listed as clobbers.

Before the advent of AI, writing a parser for x86_64 assembly would have been such an annoying task that I might have never gotten around to implementing support for memory safe inline assembly other than the trivial kind (where the assembly is blank).

But now, implementing a feature like this is as simple as writing a good prompt! The next section has my original prompt that I used to start work on this feature. I fed it to my own private agent harness (called T800) running with Kimi K2.7-code.

Initial Agent Prompt

Let's add more support to Fil-C for safe, harmless inline assembly!

Please read T800.txt, README.md, and https://fil-c.org/how to understand the context of what we're doing.

Fil-C currently rejects all inline assembly except for trivially safe stuff like:

asm volatile ("" : : : "memory")

Or even:

asm ("" : "+r"(x))

Basically, Fil-C accepts inline assembly if the assembly string is blank, and goes to great lengths to handle the case where the inline assembly snippet has a variable threaded through it. This kind of thing is very common, since it allows programmers to conceal data flow from the compiler to inhibit optimizations, which can be important for things like constant-time crypto.

Let's take this further to support cases where the assembly snippet is not empty, but is still harmless!

Here are examples that should work:

__asm__ ("sarw $15,%0" : "+r"(crypto_int16_x) : : "cc");

Or:

asm volatile("cpuid\n\t" : "+a"(a), "=b"(b), "+c"(c), "=d"(d));

Or:

asm volatile("xgetbv\n\t" : "=a" (xcr0_lo), "=d" (xcr0_hi) : "c" (0));

These examples are safe because:

sarw, cpuid, and xgetbv have no meaningful side effects other than setting registers.
The operands to those instructions have no memory effects.
- There are no explicit memory operands in the inline assembly.
- At the LLVM IR level, constraints like "+r" involve threading the crypto_int16_x variable through the assembly invocation as data flow and this will not turn into a memory access unless the variable is spilled (which is fine - spills are totally legal in Fil-C, and the spills are in a part of the stack that Fil-C cannot get a pointer to).
The registers affected by those instructions are enumerated in the asm.
- In the case of the sarw example, we are letting the compiler pick the register.
- In the other two examples, the asm modifiers correctly list clobbers for all of the registers clobbered by the instruction.
There's no control flow out of the inline assembly. For example, there are no calls. Hence, we know completely what the inline assembly does.

Note that these three examples look like this in LLVM IR. The sarw one is:

  %0 = call i32 asm "sarw $$15,$0", "=r,0,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %x) #3

The cpuid one is:

  %0 = call { i32, i32, i32, i32 } asm sideeffect "cpuid\0A\09", "={ax},={bx},={cx},={dx},0,2,~{dirflag},~{fpsr},~{flags}"(i32 undef, i32 undef) #5

The xgetbv one is:

  %0 = call { i32, i32 } asm sideeffect "xgetbv\0A\09", "={ax},={dx},{cx},~{dirflag},~{fpsr},~{flags}"(i32 0) #5

It would be great to support any inline assembly that meets these criteria. To do that, we need to integrate the following into llvm/lib/Transforms/Instrumentation/FilPizlonator.cpp's handleInlineAsm function:

an x86_64 AT&T syntax parser, augmented for the assembly syntax visible at the LLVM IR level.
- Note that this involves handling things like sarb, sarw, sarl, and sarq, which are all the same instruction but with different word sizes. And sar, where the word size has to be inferred from operands.
improvements to the assembly constraints parser.
a database of instructions that are acceptable (that have no effects beyond registers)
- and for those instructions that clobber specific registers without those registers being named explicitly, the database needs to know what those registers are. For example, it's got to know that cpuid clobbers ax/bx/cx/dx.
- this should include tracking whether instructions clobber cc and if they do, make sure that the assembly constraints also lists cc as clobbered.
comprehensive error checking that rejects:
- instructions that aren't allowlisted as safe
- assembly constraints that don't account for clobbers (for example of ={ax} wasn't part of the constraint when using cpuid)
- assembly constraints that take pointers and cause memory accesses to happen.
  - For this case, we could handle it eventually, by emitting a Fil-C check! But we should not implement this yet.

Make sure that if you reject inline assembly, then handleInlineAsm returns a nice Reason that explains why.

You should reject InlineAsm that doesn't use the AT&T dialect.

Note that your assembly parser doesn't even have to know how to parse any assembly that isn't allowlisted. I think that means that you don't even have to implement parsing of memory operand syntax or any instruction mnemonic that's not in the allowlist!

For now, add support for:

sar, shr, and, shl, xor, mov, test, cmp, bsf (of any width or if implicit)
cmov (note this has many suffixes depending on what the condition is)
cpuid
xgetbv

For examples of assembly snippets that should work, take a look at projects/openssh-10.3p1/sntrup761.c. Note that this file currently has a #undef __GNUC__ to prevent the inline assembly from being used. Note also that this file has C implementations of all of the inline assembly. So, I recommend creating a filc/tests test that has all of those inline assembly snippets and they are tested against their C equivalents for a variety of inputs.

Also be sure to create lots of tests for each allowlisted instruction that check that we reject unsafe uses of inline assembly (memory operands etc). And create tests for instructions that are either obviously unsafe or not yet supported to make sure we reject those. Note that the rejection will be runtime so the manifest for the test should say that the result is failure with the output including a filc safety error. There might be some existing tests that assert such a failure for inline asm that you will make succeed, since those tests might be using one of the instructions I'm requesting that your allowlist. In that case, just fix those tests' result expectation in their manifests.

If you're unsure about any x86 instructions, remember that there's https://www.felixcloutier.com/x86/

I recommend breaking this task up into steps handled by separate subagents:

x86_64 parser. after implementing this, clang should still compile (use ./build_clang.sh) to test that but it might not pass filc/run-tests and ./build_base.sh might fail, since we might pass through assembly permissively or unsoundly. you should try to feed it some code manually via build/bin/clang -c testfile.c to see if the parser is at least not crashing, and you can add llvm::errs() print statements to print out what the parser saw and whether it worked (but disable those print statements, or put them behind if (verbose) after you're done)
improvements to constraint parser in handleInlineAsm (for example, I don't think that parser will currently handle {ax} or ={ax}). after this, ./build_clang.sh should still build, but it might not work right (tests might not pass, build_base.sh might not work). again, you can test this with print statements and trying to run the compiler in -c mode on a standalone simple file.
instruction allowlist and database of extra requirements for those instructions. again, after this, ./build_clang.sh should still build, but it might not work right. However, I would expect that by this point, a test of the instruction sequences from sntrup761.c should pass. Note that you SHOULD NOT ./build_base.sh to run this test; just ./build_clang.sh, since build_base.sh runs the compiler on a lot of stuff that might still not work at this stage. You can use filc/run-tests -t <testname> to run the test at this stage.
fill out fully comprehensive error checking. Use should test the checking as you add it using filc/run-tests -t.
write more tests and grind on test failures
Make sure that ./build_base.sh builds. Grind on any failures you find until it builds.
Make sure that filc/run-tests passes. Grind on any failures you find until it passes.

Initial Implementation

Based on the above prompt, T800 wrote a pretty good initial implementation, including a healthy amount of tests. The C++ code that it added to FilPizlonator is all in a new function called validateSafeInlineAsm, which contains an assembly parser and assembly static analysis.

I then validated that this works by writing some tests by hand and removing the #undef __GNUC__ from sntrup761.c. I also reverted cpuid changes to zstd and simdutf, since it's now OK for them to use their original inline assembly for CPU identification.

It's worth calling out the oddest part of Fil-C inline assembly: if you get it wrong, then there is no compile-time error. Instead the inline assembly snippet turns into a Fil-C panic or an illegal instruction trap at runtime.

If FilPizlonator determined that the inline assembly is not safe, then it'll replace it with a Fil-C panic. That panic will provide diagnostics about why the assembly was rejected.
If the instruction was safe, but your CPU doesn't support it, you'll get an illegal instruction trap. This is possible because there are lots of instructions recognized by FilPizlonator that are not supported by all x86_64 CPUs. Illegal instruction traps are safe because Fil-C provides no facility for catching them. For example, a sigaction call to register a handler for SIGILL will return ENOSYS. Hence, this is just a panic, but with with fewer diagnostics.

Using runtime panics has the nice property that inline assembly in dead code doesn't get in the way of porting software to Fil-C. Also, it's consistent with how Fil-C usually reports errors.

The Loop

Finally I built a loop to implement every safe pre-AVX512 instruction.

It's worth dwelling on what a loop is, since lots of folks talk about looping without necessarily explaining what they mean. Most agent harnesses have the ability to spawn subagents. T800 is based on this architecture, but so are many of the publicly available agents. Hence, the key is to tell the agent that you want it to keep doing something by spawning subagents until it is done, with a crystal-clear criterion for what done looks like. Each subagent does a subtask and reports back. The toplevel agent decides what to do based on its understanding of what the subagents have done so far.

To this end, I had T800 create an instructions_list.txt file that contains all of the X86_64 instructions with either no annotation (if it hadn't been considered), a REJECT annotation if we rejected it, or ACCEPT if we accepted and implemented it. Then I told T800 to write a script to find the first not-yet-considered instructions in that file. These first two steps took very little time; they were just the groundwork. Finally, I told T800 to keep spawning subagents that use that script to find an instruction and then implement it until they could not find any more instructions.

Hence, the loop here is English prose that the agent takes as instruction, and those instructions lead the agent to spawn subagents. Those subagents are prompted to perform a task by the toplevel agent, not by me directly. The objective here is to get the human (me) out of the business of repeatedly telling the agent what to do, since that's exhausting. My loop instructions did include the following: if the agent detects a file called instructions_stop, then it should stop looping and instead move to the terminate phase of T800, where it performs a review/judge loop to check its work, and then stages everything for me to commit it. I did this maybe twice a day, so that I could sanity check what is happening and run some tests myself.

For the first half of the looping, I used Kimi K2.7-code, but then I switched to GLM 5.2. Interestingly, I found that Kimi K2.7-code is more paranoid; it interpreted my instructions as requiring more tests. GLM 5.2 was faster and more brave. That said, most of the super hard groundwork (including supporting static analysis of x87 instructions and their constraints) was done by Kimi, so maybe the greater paranoia I observed was due to the fact that Kimi did the heaviest lift.

It didn't take long for all of the safe pre-AVX512 X86_64 instructions to be implemented along with a plethora of tests to cover both the good case of those instructions and the bad case (which causes a Fil-C panic). This happened while I was away from the computer doing other things (like replaying Witcher 3 and porting Fedora patches for quantum crypto support in OpenSSH, which I did by hand).

Conclusion

As far as I know, Fil-C has the first ever implementation of memory-safe X86_64 inline assembly. It supports hundreds of instructions, including useful x87, SIMD, bitmath, flags management, and fence instructions. Basically anything that is safe within the Fil-C garage-in, memory safety out model.