Profiling your JIT with Linux Perf

Edd Barrett, Sun 14 September 2025.

Introduction

If you naively try and profile a program that generates code at runtime with Linux Perf, the entries in the profile for the JITted code probably won't look as you'd expect.

The reason for this is that Perf expects to be able to map virtual code addresses to symbols from ahead-of-time-compiled on-disk executables. For normal programs, that's not a tall order, since all of the executable code originates from either the "main binary executable file" or from shared object dependencies. For JITs that simply "magic up" executable code at runtime however, Perf can't (by default) do a good job for those code regions.

In this post I'll demonstrate how to instrument a toy JIT so that you can profile JITted code effectively using Perf.

An toy JIT

Let's demonstrate by profiling a program that generates and executes code at runtime. I'll do this in Rust because it's what I know best (imports elided for brevity):

fn jit_loop(reps: i32) -> ExecutableBuffer {
    let mut dasm = Assembler::new().unwrap();

    // Jitted function that counts from 0 up to `reps`, before returning the counter.
    dynasm!(dasm
        ;    mov rax, 0
        ; ->loop_start:
        ;     cmp rax, reps
        ;     je ->done
        ;     add rax, 1
        ;     jmp ->loop_start
        ; ->done:
        ;     ret
    );

    dasm.finalize().unwrap()
}

fn main() {
    if env::args().len() != 3 {
        eprintln!("args: <iters1> <iters2>");
        process::exit(1);
    }

    let mut args = env::args().skip(1);
    let bound1 = args.next().unwrap().parse::<i32>().unwrap();
    let bound2 = args.next().unwrap().parse::<i32>().unwrap();

    let jitted_code1 = jit_loop(bound1);
    let jitted_func1: extern "C" fn() -> i32 =
        unsafe { mem::transmute(jitted_code1.ptr(AssemblyOffset(0))) };

    let jitted_code2 = jit_loop(bound2);
    let jitted_func2: extern "C" fn() -> i32 =
        unsafe { mem::transmute(jitted_code2.ptr(AssemblyOffset(0))) };

    eprintln!("Running jitted_func1():");
    println!("returned {}", jitted_func1());
    eprintln!("Running jitted_func2():");
    println!("returned {}", jitted_func2());
}

I'm using the dynasm crate to generate some simple X86_64 code at runtime. Each call to jit_loop() generates a function containing a loop that counts up from zero up to a dynamic upper bound read from the command line, before returning the final value of the counter¹. We generate two such functions, jitted_func1 and jitted_func2, before executing each of them and printing their results.

If we build this code² and profile it using Perf, i.e.:

$ cargo build
...
$ perf record -g --call-graph=dwarf ./target/debug/jit 1000000000 2000000000
Running jitted_func1():
returned 1000000000
Running jitted_func2():
returned 2000000000
[ perf record: Woken up 198 times to write data ]
[ perf record: Captured and wrote 49.473 MB perf.data (6143 samples) ]
$ perf report -g

Then this is what the TUI will look like:

Samples: 6K of event 'cycles:P', Event count (approx.): 3026589925
  Children      Self  Command  Shared Object         Symbol
+   66.57%    66.57%  jit      [JIT] tid 781072      [.] 0x00007fa2c3f9b017
+   33.30%    33.30%  jit      [JIT] tid 781072      [.] 0x00007fa2c3f9c017
     0.07%     0.07%  jit      [JIT] tid 781072      [.] 0x00007fa2c3f9c007
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _start
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _dl_start
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _dl_start_final (inlined)"
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _dl_sysdep_start
     0.02%     0.00%  jit      ld-linux-x86-64.so.2  [.] dl_main
...

The first two entries are our JITted code, and as we would expect from the loop bounds I provided (1 billion and 2 billion), Perf reports roughly twice as many samples³ in the JITted code function that runs for twice as long.

But it's not all plain sailing. Notice that the JITted code entries don't have symbol names, just virtual addresses? This can make understanding the profile difficult, especially for more complicated profiles with lots of JITted code entries. Further, if you hit enter on a JITted code entry, and then choose "Annotate", then you will be greeted with an error, e.g.:

Couldn't annotate 00007fa2c3f9b017
Internal error: Invalid -1 error code

I'm also not sure what the third JITted code entry with 0.07% overhead is all about, but we can fix all of these problems by adding Perf support to the JIT.

Adding Perf support to our JIT

There are currently two ways to add Perf support to a JIT: with "JIT map" files, or with "JIT dump" files. The latter is the more modern and reccommended way, so that's what we'll use.

To that end, our JIT runtime has to:

Create a JIT dump file called jit-<pid>.dump.
mmap(2) the dump file into the address space PROT_EXEC to "mark" the JIT dump file.
Write CODE_LOAD records into the JIT dump when JITted code is born.

Then once our JITted process has finished, there's an offline perf inject step required to get the JITted code into the profile⁴.

The mmap(2) marker is the way that perf inject finds which dump files to look at. The profile is scanned for mmap(2) events that map files named like jit-<pid>.dump into the address space. When it sees one that matches the PID of the process being profiled, it writes the JITted code described within into on-disk shared objects and ties the Perf profile to them⁵.

Implementation

With the high-level summary out of the way, let's implement this for our toy JIT. I'm going to elide liberally here, as quite a lot of the code isn't very interesting.

First, let's add a data structure that represents our JIT dump file. This will be responsible for creating a file in the format detailed in this this schema:

struct JitDump {
    /// The jitdump file we will write records to.
    jitdump: std::fs::File,
    /// The current processes PID.
    pid: u32,
}

The new() method will create the file and write a JIT dump header:

impl JitDump {
    /// Create a jitdump file for the current PID.
    fn new() -> Result<Self, Box<dyn Error>> {
        let pid = unsafe { getpid() } as u32;
        let filename = format!("jit-{}.dump", pid);
        let mut f = OpenOptions::new()
            .create_new(true)
            .read(true)
            .write(true)
            .open(&filename)?;

        // Write the file header.
        //
        // magic
        f.write_u32::<NativeEndian>(0x4A695444)?;
        // version
        f.write_u32::<NativeEndian>(1)?;
        // total size of file header
        let fh_size_pos = f.stream_position()?;
        f.write_u32::<NativeEndian>(u32::MAX)?; // placeholder.

        // ...elided...

        // timestamp
        f.write_u64::<NativeEndian>(now_timestamp())?;

        // ...elided...

        // Patch the file header size field now we know how big it is.
        let end_fh_pos = f.stream_position()?;
        f.seek(SeekFrom::Start(fh_size_pos))?;
        f.write_u32::<NativeEndian>(u32::try_from(end_fh_pos).unwrap())?;
        f.seek(SeekFrom::Start(end_fh_pos))?;

        // For `perf inject` to work, we have to map the jitdump file into memory PROT_EXEC.
        let opts = MmapOptions::new();
        let _ = unsafe { opts.map_exec(&f).unwrap() };

        Ok(Self { jitdump: f, pid })
    }
}

A few notes if interest here:

One of the fields is the total size of the file header which we don't actually know until later, so we have to write a place-holder value and then patch it later.
One of the fields is a timestamp (in nanoseconds). For this you can choose whatever clock source you like, as long as you select the same clock in your perf record invocation. The docs recommend using CLOCK_MONOTONIC, so that's what I've used. Here's how I obtain a timestamp:

/// Create a timestamp of "now" in nanoseconds.
fn now_timestamp() -> u64 {
    let mut ts = std::mem::MaybeUninit::<timespec>::uninit();
    if unsafe { clock_gettime(CLOCK_MONOTONIC, ts.as_mut_ptr()) } != 0 {
        panic!("failed to read clock");
    }
    let ts = unsafe { ts.assume_init() };
    (ts.tv_sec as u64) * 1_000_000_000 + (ts.tv_nsec as u64)
}

Then the code to write the CODE_LOAD record for newly JITted code looks like this:

const JIT_CODE_LOAD: u32 = 0;

impl JitDump {
    /// Add a CODE_LOAD record into the jitdump file.
    fn emit_code_load_record(
        &mut self,
        code: &ExecutableBuffer,
        code_idx: u64,
    ) -> Result<(), Box<dyn Error>> {
        let pos_before = self.jitdump.stream_position()?;
        // Write record header
        //
        // id (the kind of JIT event)
        self.jitdump.write_u32::<NativeEndian>(JIT_CODE_LOAD)?;

        // ...edlided...

        // vma (virtual address of jitted code start)
        #[cfg(target_pointer_width = "64")]
        let code_start = code.ptr(AssemblyOffset(0)) as u64;
        self.jitdump.write_u64::<NativeEndian>(code_start)?;
        // code_addr (same as vma for us)
        self.jitdump.write_u64::<NativeEndian>(code_start)?;
        // code_size
        self.jitdump
            .write_u64::<NativeEndian>(u64::try_from(code.len()).unwrap())?;
        // code_index
        self.jitdump.write_u64::<NativeEndian>(code_idx)?;
        // char[n]: function name (including the null terminator)
        self.jitdump
            .write_all(format!("jitted_func{}", code_idx).as_bytes())?;
        self.jitdump.write_u8(0)?;
        // native code
        self.jitdump.write_all(code)?;

        // ...elided...

        Ok(())
    }
}

Essentially here we give each chunk of JITted code a "code index" to uniquely identify it, a symbol name that we'd like to see in the Perf profile, the code's start address and length, and then the JITted executable code itself.

Next we have to think about instantiating our JitDump data structure somewhere. For better or worse, Perf requires the JITted code for the entire the process (i.e. PID) to be written to the same JIT dump file. Were our JIT multi-threaded, access to this JitDump instance would therefore need to be shared and synchronised so that CODE_LOAD records wouldn't get interleaved when two threads emit a CODE_LOAD record simultaneously.

Our toy JIT isn't multi-threaded, but for demonstration purposes we'll use a thread-safe implementation anyway:

/// jit dump file for the current process.
static JIT_DUMP: LazyLock<Mutex<JitDump>> = LazyLock::new(|| Mutex::new(JitDump::new().unwrap()));

Then we can make our life easier by writing a little helper function to write freshly JITted code into the JIT dump:

/// Register JITted code with perf.
fn register_jitted_code(code: &ExecutableBuffer, code_idx: u64) {
    JIT_DUMP
        .lock()
        .unwrap()
        .emit_code_load_record(&code, code_idx)
        .unwrap();
}

Then all that remains is to call register_jitted_code() in the appropriate places (in our case in main()). You want to make sure you do this before the JITted is executed.

    ...
    let jitted_code1 = jit_loop(bound1);
    register_jitted_code(&jitted_code1, 1); // <---
    let jitted_func1: extern "C" fn() -> i32 =
        unsafe { mem::transmute(jitted_code1.ptr(AssemblyOffset(0))) };

    let jitted_code2 = jit_loop(bound2);
    register_jitted_code(&jitted_code2, 2); // <---
    let jitted_func2: extern "C" fn() -> i32 =
        unsafe { mem::transmute(jitted_code2.ptr(AssemblyOffset(0))) };
    ...

That's it. Then we can build and profile our JITted code like so:

$ perf record -k CLOCK_MONOTONIC -g --call-graph=dwarf \
    ./target/debug/jit 1000000000 2000000000
...

Note -k selects the clock source to use (in this case CLOCK_MONOTONIC). This must match the clock source used in the timestamps in your JIT dump files.

After this, we should see our jit-<pid>.dump file on the filesystem:

$ ls *.dump
jit-851825.dump

Next we do the perf inject step:

$ perf inject --jit --input perf.data --output perf.jit.data

Now we should have one on-disk shared object per chunk of JITted code:

$ ls *.so
jitted-851825-1.so  jitted-851825-2.so

And finally, we can look at the profile report:

$ perf report -g -i perf.jit.data

Samples: 6K of event 'cycles:P', Event count (approx.): 3026404125
  Children      Self  Command  Shared Object         Symbol
+   66.61%    66.61%  jit      jitted-851825-2.so    [.] jitted_func2
+   33.34%    33.34%  jit      jitted-851825-1.so    [.] jitted_func1
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _start
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _dl_start
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _dl_start_final (inlined)
     0.03%     0.00%  jit      ld-linux-x86-64.so.2  [.] _dl_sysdep_start
     0.02%     0.00%  jit      [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
     0.02%     0.00%  jit      [kernel.kallsyms]     [k] do_syscall_64
     0.02%     0.00%  jit      ld-linux-x86-64.so.2  [.] dl_main
 ...

The JITted code entries now have symbol names and that mysterious third JITted code entry is gone!

Further, the annotate functionality now works too:

Samples: 6K of event 'cycles:P', 4000 Hz, Event count (approx.): 3026404125
jitted_func2  /home/vext01/tmp/tmp/jit2/jitted-851825-2.so [Percent: local period]
Percent │
        │
        │
        │    Disassembly of section .text:
        │
        │    0000000000000080 <jitted_func2>:
        │      mov $0x0,%rax
   0.02 │ 7:┌─→cmp $0x77359400,%rax
        │   │↓ je  1c
        │   │  add $0x1,%rax
  99.98 │   └──jmp 7
        │1c: ← ret

Summary

In this post I've shown that, with a fairly small amount of instrumentation, we can make JITted code work properly with Linux Perf.

I've pushed the full example code to Github here. I've also included a justfile that will build, record, inject and generate the report. Just run just in the top-level directory.