formal

The same program, every language

The same program, on the left in formal and on the right in each language: switch between Hello World! and the fannkuch-redux benchmark with the program tabs, and between languages with the row above. The point is underneath the source. Counting the actual machine instructions and bytes, formal compiles to far less than the others, because it ships no runtime, no libc, and no garbage collector beneath it; the versions opposite carry a runtime, and that runtime still ships. fannkuch goes one further: its result is proven correct at compile time (the verifier checks the max flip count and checksum against the known answer), so the program need only compute and exit. Each column also reports what the program does when it runs: the exact number of guest instructions it executes, its peak memory working set, and its execution time. Every program is built for RISC-V and run under QEMU with a small instrumentation plugin, so the figures compare like for like (details below the panels).

formal

print("Hello World!\n")
exit(0)

formal new hello_world && cd hello_world
cargo run
grep -cE '^    [a-z]' build/main.s

46 instructions
1,680 bytes

measured at runtime (under QEMU) 108 instructions executed
8 KiB peak memory
16 ms execution time

Rust

fn main() {
    println!("Hello World!");
}

cargo new hw && cd hw
echo 'fn main() { println!("Hello World!"); }' > src/main.rs
CARGO_PROFILE_RELEASE_OPT_LEVEL=z CARGO_PROFILE_RELEASE_LTO=true CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1 CARGO_PROFILE_RELEASE_PANIC=abort CARGO_PROFILE_RELEASE_STRIP=true RUSTFLAGS='-Zlocation-detail=none -Zfmt-debug=none -Zunstable-options -Cpanic=immediate-abort' cargo +nightly build --release --target x86_64-unknown-linux-musl -Zbuild-std=std,panic_abort -Zbuild-std-features=
objdump -d target/x86_64-unknown-linux-musl/release/hw | grep -cE '^[[:space:]]+[0-9a-f]+:'

4,562 instructions
26,720 bytes

measured at runtime (under QEMU) 1,365 instructions executed
24 KiB peak memory
20 ms execution time

How it's measured. Instructions and bytes are the static binary: machine instructions in the compiled program (formal counts its emitted assembly; the others objdump their stripped static binary) and the binary's size. The runtime figures are measured by running each program under qemu for RISC-V with a small TCG plugin that reports the exact guest instructions executed and the peak memory working set (distinct 4 KiB pages touched: instruction fetches and data). Every program is built for RISC-V Linux (musl, static) and run under the same user-mode qemu-riscv64 (formal as it compiles, the others from the source shown), so the numbers are directly comparable. Instructions and memory come from instrumented (plugin) runs; the heavy fannkuch is timed plugin-free, because the plugin's per-access cost would otherwise dwarf the compute (Hello World is tiny, so its plugin run's time stands in). The plugin records the working set over time; the harness reports p50/p90/p99 percentiles and the panel shows the peak. fannkuch-redux runs n = 12 everywhere (the value formal proves and computes); Hello World is the hosted tests/linux_hello and each language's equivalent. Ada is shown static-only (no RISC-V Ada toolchain was available to run it here). Note formal's fannkuch (tests/fannkuch_v2) runs its two harts as real OS threads in parallel (spawned with clone), while the language versions are single-threaded.

Why

Proves your assertions, doesn't trust you. A fail marker is a theorem: the compiler enumerates every hart interleaving and rejects the program unless the marker is unreachable on all of them. Provably-correct racy code, which Rust's borrow checker forbids and SPARK's tasking model excludes, verifies here.
Infallible type inference. Leave a type as _ and the compiler searches type and locality assignments, accepting one iff it makes the whole program verify. If any typing works, it is found: inference and verification are the same search, so there is no "couldn't infer, add an annotation" failure mode. (Scalar types are searched; list/union types are infinite, so they are written explicitly.)
The memory layout is an output, not an input. No .data/.bss in the source: the verifier discovers each variable's type and locality and emits the sections for you.
Every memory access must be provably safe: inside its variable's inferred type, or inside a raw region with the right permissions: one the target system describes (e.g. MMIO, like the UART below) or one the program declares with section. An allocator declares each allocation as it hands it out; an access nothing describes makes the program invalid.
It verifies the machine code, not your source: there is no compiler-trust gap between what was proven and what runs.
The proof shrinks the binary. Dead-data elimination: bytes the proof shows no runtime path can touch are deleted from the output, with the program's address arithmetic rewritten in lockstep, so information consumed only at compile time simply doesn't exist at runtime.

Supports RISC-V ✓ · Doesn't support aarch64 ? · Won't support x86-64 ✗

End to end: source to booted RISC-V

This is the project's real end-to-end test (tests/uart_hello/). The source races two harts on an inferred global, has hart 0 check the message's runtime type descriptor (the program inspects its own inferred types), then prints over the UART. Note what the source does not contain: an entry point (execution starts at the first line, like Python), a data section, or most of the types.

You write input.hl

# Do racy arithmetic with an undefined variable.
# Use global locality so the load/stores are racy.
value: global _
t0 = &value

# Set to 0
t1 = 0
t0[0:4] = t1

# Do non-atomic addition
t1 = t0[0:4]
t1 = t1 + 1
t0[0:4] = t1

# The proof obligation: value < 4 on every interleaving (`require` lowers to
# a branch over `#!`, the `fail` marker).
t1 = t0[0:4]
t2 = 4
require t1 < t2

# Use hart 0 to output a message; every other hart skips straight to the
# `wfi` at the end.
t0 = csr(mhartid)
if t0 == 0:
    welcome: _ [u8]*13
    # H  e  l  l  o     W  o  r  l  d  !  0
    # Type exploration doesn't explore list types since they are infinite, so
    # to define a list a user must define it manually.
    # In this case we leave the locality as unspecified which will default to `thread`.

    # Declare string
    # Get address of type structure
    t0 = type(welcome)

    # Check variable is list
    t2 = 8  # Load list type number
    t1 = t0[0:8]  # Load type type number
    require t1 == t2

    # Check list length
    t0 = t0 + 16  # Increment address to point at length
    t1 = t0[0:8]  # Load length
    t2 = 13  # Load desired length
    require t1 == t2

    # Check all values in list are u8
    t0 = t0 - 8  # Decrement address to point at list address
    t0 = t0[0:8]  # Load list address

    t5 = 0  # Use t5 as the counter
    while t5 != t2:
        t3 = t0[0:8]  # Load type of list item
        t4 = 0  # Load byte type number
        require t3 == t4
        t0 = t0 + 25  # Increment list item address (8+8+8+1)
        t5 = t5 + 1  # Increment the count

    # Set string
    t0 = &welcome
    t1 = 72  # H
    t0[0:1] = t1
    t1 = 101  # e
    t0[1:2] = t1
    t1 = 108  # l
    t0[2:3] = t1
    t1 = 108  # l
    t0[3:4] = t1
    t1 = 111  # o
    t0[4:5] = t1
    t1 = 32  # space
    t0[5:6] = t1
    t1 = 87  # W
    t0[6:7] = t1
    t1 = 111  # o
    t0[7:8] = t1
    t1 = 114  # r
    t0[8:9] = t1
    t1 = 108  # l
    t0[9:10] = t1
    t1 = 100  # d
    t0[10:11] = t1
    t1 = 33  # !
    t0[11:12] = t1
    t1 = 0  # NUL terminator
    t0[12:13] = t1

    # Output message
    a0 = &welcome  # Load message address
    t1 = 0x10000000  # Load character device address
    t2 = a0[0:1]  # Load the first character
    while t2 != 0:
        t1[0:1] = t2  # Store character into character device
        a0 = a0 + 1  # Add 1 message address
        t2 = a0[0:1]  # Load next character

asm:
    wfi
unreachable

It proves, optimizes & emits

.global _start
_start:
    #$ value global _
    la t0, value
    li t1, 0
    sw t1, 0(t0)
    lw t1, 0(t0)
    addi t1, t1, 1
    sw t1, 0(t0)
    lw t1, 0(t0)
    li t2, 4
    blt t1, t2, _l0
_l0:
    csrr t0, mhartid
    bnez t0, _l1
    #$ welcome _ [u8 u8 u8 u8 u8 u8 u8 u8 u8 u8 u8 u8 u8]
    la t0, __welcome_type  # #& t0, welcome
    li t2, 8
    ld t1, 0(t0)
    beq t1, t2, _l2
_l2:
    addi t0, t0, 16
    ld t1, 0(t0)
    li t2, 13
    beq t1, t2, _l3
_l3:
    addi t0, t0, -8
    ld t0, 0(t0)
    li t5, 0
_l4:
    beq t5, t2, _l5
    ld t3, 0(t0)
    li t4, 0
    beq t3, t4, _l6
_l6:
    addi t0, t0, 8
    addi t5, t5, 1
    j _l4
_l5:
    la t0, welcome
    li t1, 72
    sb t1, 0(t0)
    li t1, 101
    sb t1, 1(t0)
    li t1, 108
    sb t1, 2(t0)
    li t1, 108
    sb t1, 3(t0)
    li t1, 111
    sb t1, 4(t0)
    li t1, 32
    sb t1, 5(t0)
    li t1, 87
    sb t1, 6(t0)
    li t1, 111
    sb t1, 7(t0)
    li t1, 114
    sb t1, 8(t0)
    li t1, 108
    sb t1, 9(t0)
    li t1, 100
    sb t1, 10(t0)
    li t1, 33
    sb t1, 11(t0)
    li t1, 0
    sb t1, 12(t0)
    la a0, welcome
    li t1, 0x10000000
    lb t2, 0(a0)
_l7:
    beqz t2, _l8
    sb t2, 0(t1)
    addi a0, a0, 1
    lb t2, 0(a0)
    j _l7
_l8:
_l1:
    wfi
    j __halt  # unreachable (program end)
__halt:
    wfi
    j __halt

.section .data
__welcome_type:
    .dword 8                # List
    .dword __welcome_subtypes      # subtypes
    .dword 13                # length
__welcome_subtypes:
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0
    .dword 0

.section .bss
    .balign 8
value:
    .zero 4
    .balign 8
welcome:
    .zero 13

translate (one statement, one instruction) → parse → verify (2,026,160 machine states across 1- and 2-hart systems) → infer value: u32, welcome: thread [u8 × 13] → remove dead code & never-taken branches → compact the layout → emit

The emitted program assembles with the RISC-V GNU toolchain, boots under qemu-system-riscv64 -machine virt, and prints Hello World!. Two things to notice in the output:

The descriptor records shrank. In memory the verifier models each type record as 25 bytes ([u64 type, u64 subtypes-ptr, u64 length, u8 locality]), and the source walks them with t0 = t0 + 25. The program only ever reads each record's type number, so the emitted __welcome_subtypes is 13 .dwords (104 bytes, not 325), and the stride was rewritten to addi t0, t0, 8 to match. The locality byte, read only by the verifier, exists nowhere in the binary.
The race was proven, not forbidden. Both harts run the non-atomic increment of value; the proof covers every interleaving and shows the value < 4 assertion holds in all of them.

$ cargo nt uart_hello
PASS [25.001s] formal::uart_hello uart_hello

The language

Two layers, kept deliberately close. You write a Python-like surface language in which every simple statement lowers to exactly one RISC-V instruction or directive, and if/while/require each lower to a small fixed pattern of branches plus generated labels (no register allocation, no hidden control flow, the C-like cost model where the price of a line is visible at the call site), with inline assembly always one asm: block away:

value: global _              ->  #$ value global _        define (locality/type; _ = infer)
welcome: _ [u8]*13           ->  #$ welcome _ [u8 ... u8]  list types: [u8, u8] or [u8]*n
t0 = &value                  ->  la t0, value              address of a variable
t0 = type(welcome)           ->  #& t0, welcome            address of the runtime type descriptor
t0 = csr(mhartid)            ->  csrr t0, mhartid          which hart am I?
t1 = 0x10000000              ->  li t1, 0x10000000         load immediate
t1 = t1 + 1                  ->  addi t1, t1, 1            register +/- immediate
t0[0:4] = t1                 ->  sw t1, 0(t0)              store a byte slice (1 = sb, 4 = sw)
t1 = t0[8:16]                ->  ld t1, 8(t0)              load a byte slice (1 = lb, 4 = lw, 8 = ld)
section 0x100 0x200 rw       ->  #@ 0x100 0x200 rw         declare an accessible region
fail                         ->  #!                        must be proven unreachable
unreachable                  ->  #?                        end of execution
asm:                         ->  (each indented line       inline assembly
    wfi                          emitted verbatim)

require t1 < t2              # fail unless t1 < t2  (one-line `if not cond: fail`)
if t0 == 0:                 # run the block when the condition holds
    t1 = t0[0:8]
while t5 != t2:             # top-tested loop
    t5 = t5 + 1

Underneath sits the verification target: RISC-V assembly plus five directives; every directive starts with #, the RISC-V comment character, so any formal program is also a plain assembly file:

Directive	Surface form	Meaning
#!	fail	An assertion the compiler must prove unreachable on every interleaving.
#?	unreachable	End of execution: the hart that reaches this point halts.
#$ x global u32	x: global u32	Declare a variable's locality and/or type; _ means "infer it for me".
#& t0, x	t0 = type(x)	Load the address of a variable's runtime type descriptor. Programs can inspect their own inferred types.
#@ 100 200 rw	section 100 200 rw	Declare a memory region the program may access (bounds may be registers: an allocator declares each allocation as it makes it).

How it compares

The long version, with the hard cases argued honestly, is in comparison.md. The short version:

vs Zig's comptime: concrete execution is the special case of symbolic execution where every value is exactly known. comptime evaluates the known at compile time; formal evaluates the unknown: ranges, all interleavings, all type assignments.
vs Rust's borrow checker: the borrow checker is a fast, conservative approximation of the race-freedom question; formal answers the question itself, exactly. And you can still opt into borrow-checker economics by building a verified Rc whose invariant is a fail the proof discharges.
vs Lean: Lean proves anything, manually, about a model; formal proves a fixed property class, automatically, about the actual machine code.
vs Ada/SPARK (the closest peer): SPARK is deductive, modular, contract-driven and industrial; formal is operational, whole-program, contract-free and exhaustive: it verifies the racy code SPARK's model rejects, at exponential cost.

That cost is explicit: O(n · h^r · 2^b · 8^v) (instructions × racy-interleavings × indeterminate branches × unspecified variables). The design lever is reducing the exponents (keep shared mutation rare, annotate what you know); planned compiler modes trade them away explicitly.

Where this is going

A bare-metal OS running DOOM. A bare-metal Tor node. A bare-metal web server. A bare-metal hypervisor under a serverless platform. All proven, all the way down: because the verifier works on the machine code, there is nothing underneath it to trust. Nearer term: growing the Python-like surface (it doesn't have a name yet) while keeping its near-one-to-one translation, so the cost model never goes opaque.