Edd Barrett, Sun 14 September 2025.
tags: codegen jit profiling perf
Introduction
If you naively try and profile a program that generates code at runtime with
Linux Perf, the entries in the profile for
the JITted code probably won't look as you'd expect.
The reason for this is that Perf expects to be able to map virtual code
addresses to symbols from ahead-of-time-compiled on-disk executables. For
normal programs, that's not a tall order, since all of the executable code
originates from either the "main binary executable file" or from shared object
dependencies. For JITs that simply "magic up" executable code at runtime
however, Perf can't (by default) do a good job for those code regions.
In this post I'll demonstrate how to instrument a toy JIT so that you can
profile JITted code effectively using Perf.
An toy JIT
Let's demonstrate by profiling a program that generates and executes code at
runtime. I'll do this in Rust because it's what I know best (imports elided for
brevity):
fn jit_loop(reps: i32) -> ExecutableBuffer {
let mut dasm = Assembler::new().unwrap();
// Jitted function that counts from 0 up to `reps`, before returning the counter.
dynasm!(dasm
; mov rax, 0
; ->loop_start:
; cmp rax, reps
; je ->done
; add rax, 1
; jmp ->loop_start
; ->done:
; ret
);
dasm.finalize().unwrap()
}
fn main() {
if env::args().len() != 3 {
eprintln!("args: <iters1> <iters2>");
process::exit(1);
}
let mut args = env::args().skip(1);
let bound1 = args.next().unwrap().parse::<i32>().unwrap();
let bound2 = args.next().unwrap().parse::<i32>().unwrap();
let jitted_code1 = jit_loop(bound1);
let jitted_func1: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code1.ptr(AssemblyOffset(0))) };
let jitted_code2 = jit_loop(bound2);
let jitted_func2: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code2.ptr(AssemblyOffset(0))) };
eprintln!("Running jitted_func1():");
println!("returned {}", jitted_func1());
eprintln!("Running jitted_func2():");
println!("returned {}", jitted_func2());
}
I'm using the
dynasm
crate to generate
some simple X86_64 code at runtime. Each call to jit_loop()
generates a
function containing a loop that counts up from zero up to a dynamic upper bound read
from the command line,
before returning the final value of the counter. We
generate two such functions, jitted_func1
and jitted_func2
, before
executing each of them and printing their results.
If we build this code and profile it using Perf, i.e.:
$ cargo build
...
$ perf record -g --call-graph=dwarf ./target/debug/jit 1000000000 2000000000
Running jitted_func1():
returned 1000000000
Running jitted_func2():
returned 2000000000
[ perf record: Woken up 198 times to write data ]
[ perf record: Captured and wrote 49.473 MB perf.data (6143 samples) ]
$ perf report -g
Then this is what the TUI will look like:
Samples: 6K of event 'cycles:P', Event count (approx.): 3026589925
Children Self Command Shared Object Symbol
+ 66.57% 66.57% jit [JIT] tid 781072 [.] 0x00007fa2c3f9b017
+ 33.30% 33.30% jit [JIT] tid 781072 [.] 0x00007fa2c3f9c017
0.07% 0.07% jit [JIT] tid 781072 [.] 0x00007fa2c3f9c007
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start_final (inlined)"
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_sysdep_start
0.02% 0.00% jit ld-linux-x86-64.so.2 [.] dl_main
...
The first two entries are our JITted
code, and as we would expect from the loop bounds I provided (1 billion and 2
billion), Perf reports roughly twice as many samples in the
JITted code function that runs for twice as long.
But it's not all plain sailing. Notice that the JITted code entries don't have
symbol names, just virtual addresses? This can make understanding the profile
difficult, especially for more complicated profiles with lots of JITted code
entries. Further, if you hit enter on a JITted code entry, and then choose
"Annotate", then you will be greeted with an error, e.g.:
Couldn't annotate 00007fa2c3f9b017
Internal error: Invalid -1 error code
I'm also not sure what the third JITted code entry with 0.07%
overhead is all
about, but we can fix all of these problems by adding Perf support to the JIT.
Adding Perf support to our JIT
There are currently two ways to add Perf support to a JIT:
with "JIT map" files, or with "JIT dump" files. The latter
is the more modern and reccommended way, so that's what we'll use.
To that end, our JIT runtime has to:
- Create a JIT dump file called
jit-<pid>.dump
.
mmap(2)
the dump file into the address space PROT_EXEC
to "mark" the JIT
dump file.
- Write
CODE_LOAD
records into the JIT dump when JITted code is born.
Then once our JITted process has finished, there's an offline
perf inject step required to get
the JITted code into the profile.
The mmap(2)
marker is the way that perf inject
finds which dump files to
look at. The profile is scanned for mmap(2)
events that map files named like
jit-<pid>.dump
into the address space. When it sees one that matches the PID
of the process being profiled, it writes the JITted code described within into
on-disk shared objects and ties the Perf profile to them.
Implementation
With the high-level summary out of the way, let's implement this for our toy
JIT. I'm going to elide liberally here, as quite a lot of the code isn't very
interesting.
First, let's add a data structure that represents our JIT dump file. This will
be responsible for creating a file in the format detailed in this this schema:
struct JitDump {
/// The jitdump file we will write records to.
jitdump: std::fs::File,
/// The current processes PID.
pid: u32,
}
The new()
method will create the file and write a JIT dump header:
impl JitDump {
/// Create a jitdump file for the current PID.
fn new() -> Result<Self, Box<dyn Error>> {
let pid = unsafe { getpid() } as u32;
let filename = format!("jit-{}.dump", pid);
let mut f = OpenOptions::new()
.create_new(true)
.read(true)
.write(true)
.open(&filename)?;
// Write the file header.
//
// magic
f.write_u32::<NativeEndian>(0x4A695444)?;
// version
f.write_u32::<NativeEndian>(1)?;
// total size of file header
let fh_size_pos = f.stream_position()?;
f.write_u32::<NativeEndian>(u32::MAX)?; // placeholder.
// ...elided...
// timestamp
f.write_u64::<NativeEndian>(now_timestamp())?;
// ...elided...
// Patch the file header size field now we know how big it is.
let end_fh_pos = f.stream_position()?;
f.seek(SeekFrom::Start(fh_size_pos))?;
f.write_u32::<NativeEndian>(u32::try_from(end_fh_pos).unwrap())?;
f.seek(SeekFrom::Start(end_fh_pos))?;
// For `perf inject` to work, we have to map the jitdump file into memory PROT_EXEC.
let opts = MmapOptions::new();
let _ = unsafe { opts.map_exec(&f).unwrap() };
Ok(Self { jitdump: f, pid })
}
}
A few notes if interest here:
-
One of the fields is the total size of the file header which we don't actually
know until later, so we have to write a place-holder value and then patch it
later.
-
One of the fields is a timestamp (in nanoseconds). For this you can choose
whatever clock source you like, as long as you select the same clock in your
perf record
invocation. The docs recommend using CLOCK_MONOTONIC
, so
that's what I've used. Here's how I obtain a timestamp:
/// Create a timestamp of "now" in nanoseconds.
fn now_timestamp() -> u64 {
let mut ts = std::mem::MaybeUninit::<timespec>::uninit();
if unsafe { clock_gettime(CLOCK_MONOTONIC, ts.as_mut_ptr()) } != 0 {
panic!("failed to read clock");
}
let ts = unsafe { ts.assume_init() };
(ts.tv_sec as u64) * 1_000_000_000 + (ts.tv_nsec as u64)
}
Then the code to write the CODE_LOAD
record for newly JITted code looks like
this:
const JIT_CODE_LOAD: u32 = 0;
impl JitDump {
/// Add a CODE_LOAD record into the jitdump file.
fn emit_code_load_record(
&mut self,
code: &ExecutableBuffer,
code_idx: u64,
) -> Result<(), Box<dyn Error>> {
let pos_before = self.jitdump.stream_position()?;
// Write record header
//
// id (the kind of JIT event)
self.jitdump.write_u32::<NativeEndian>(JIT_CODE_LOAD)?;
// ...edlided...
// vma (virtual address of jitted code start)
#[cfg(target_pointer_width = "64")]
let code_start = code.ptr(AssemblyOffset(0)) as u64;
self.jitdump.write_u64::<NativeEndian>(code_start)?;
// code_addr (same as vma for us)
self.jitdump.write_u64::<NativeEndian>(code_start)?;
// code_size
self.jitdump
.write_u64::<NativeEndian>(u64::try_from(code.len()).unwrap())?;
// code_index
self.jitdump.write_u64::<NativeEndian>(code_idx)?;
// char[n]: function name (including the null terminator)
self.jitdump
.write_all(format!("jitted_func{}", code_idx).as_bytes())?;
self.jitdump.write_u8(0)?;
// native code
self.jitdump.write_all(code)?;
// ...elided...
Ok(())
}
}
Essentially here we give each chunk of JITted code a "code index" to uniquely
identify it, a symbol name that we'd like to see in the Perf profile, the
code's start address and length, and then the JITted executable code itself.
Next we have to think about instantiating our JitDump
data structure
somewhere. For better or worse, Perf requires the JITted code for the entire
the process (i.e. PID) to be written to the same JIT dump file. Were our JIT
multi-threaded, access to this JitDump
instance would therefore need to be
shared and synchronised so that CODE_LOAD
records wouldn't get interleaved
when two threads emit a CODE_LOAD
record simultaneously.
Our toy JIT isn't multi-threaded, but for demonstration purposes we'll use a
thread-safe implementation anyway:
/// jit dump file for the current process.
static JIT_DUMP: LazyLock<Mutex<JitDump>> = LazyLock::new(|| Mutex::new(JitDump::new().unwrap()));
Then we can make our life easier by writing a little helper function to write
freshly JITted code into the JIT dump:
/// Register JITted code with perf.
fn register_jitted_code(code: &ExecutableBuffer, code_idx: u64) {
JIT_DUMP
.lock()
.unwrap()
.emit_code_load_record(&code, code_idx)
.unwrap();
}
Then all that remains is to call register_jitted_code()
in the appropriate
places (in our case in main()
). You want to make sure you do this before
the JITted is executed.
...
let jitted_code1 = jit_loop(bound1);
register_jitted_code(&jitted_code1, 1); // <---
let jitted_func1: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code1.ptr(AssemblyOffset(0))) };
let jitted_code2 = jit_loop(bound2);
register_jitted_code(&jitted_code2, 2); // <---
let jitted_func2: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code2.ptr(AssemblyOffset(0))) };
...
That's it. Then we can build and profile our JITted code like so:
$ perf record -k CLOCK_MONOTONIC -g --call-graph=dwarf \
./target/debug/jit 1000000000 2000000000
...
Note -k
selects the clock source to use (in this case CLOCK_MONOTONIC
).
This must match the clock source used in the timestamps in your JIT dump files.
After this, we should see our jit-<pid>.dump
file on the filesystem:
$ ls *.dump
jit-851825.dump
Next we do the perf inject
step:
$ perf inject --jit --input perf.data --output perf.jit.data
Now we should have one on-disk shared object per chunk of JITted code:
$ ls *.so
jitted-851825-1.so jitted-851825-2.so
And finally, we can look at the profile report:
$ perf report -g -i perf.jit.data
Samples: 6K of event 'cycles:P', Event count (approx.): 3026404125
Children Self Command Shared Object Symbol
+ 66.61% 66.61% jit jitted-851825-2.so [.] jitted_func2
+ 33.34% 33.34% jit jitted-851825-1.so [.] jitted_func1
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start_final (inlined)
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_sysdep_start
0.02% 0.00% jit [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
0.02% 0.00% jit [kernel.kallsyms] [k] do_syscall_64
0.02% 0.00% jit ld-linux-x86-64.so.2 [.] dl_main
...
The JITted code entries now have symbol names and that mysterious third JITted
code entry is gone!
Further, the annotate functionality now works too:
Samples: 6K of event 'cycles:P', 4000 Hz, Event count (approx.): 3026404125
jitted_func2 /home/vext01/tmp/tmp/jit2/jitted-851825-2.so [Percent: local period]
Percent │
│
│
│ Disassembly of section .text:
│
│ 0000000000000080 <jitted_func2>:
│ mov $0x0,%rax
0.02 │ 7:┌─→cmp $0x77359400,%rax
│ │↓ je 1c
│ │ add $0x1,%rax
99.98 │ └──jmp 7
│1c: ← ret
Summary
In this post I've shown that, with a fairly small amount of instrumentation, we
can make JITted code work properly with Linux Perf.
I've pushed the full example code to Github
here. I've also included a
justfile
that will build, record, inject and
generate the report. Just run just
in the top-level directory.
Further reading
(All git links are permalinks to the neweset commit at the time of writing)