Profiling
Quick profiling in your terminal
Note
This is only meant to be used for quick profiling or programmatically accessing the profiling results. For more detailed and GUI friendly profiling proceed to the next section.
Simply replace the use of Base.@time or Base.@timed with Reactant.Profiler.@time or Reactant.Profiler.@timed. We will automatically compile the function if it is not already a Reactant compiled function (with sync=true).
using Reactant
x = Reactant.to_rarray(randn(Float32, 100, 2))
W = Reactant.to_rarray(randn(Float32, 10, 100))
b = Reactant.to_rarray(randn(Float32, 10))
linear(x, W, b) = (W * x) .+ b
Reactant.@time linear(x, W, b)WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1769147330.038449 4284 profiler_session.cc:103] Profiler session initializing.
I0000 00:00:1769147330.038484 4284 profiler_session.cc:118] Profiler session started.
I0000 00:00:1769147330.038752 4284 profiler_session.cc:68] Profiler session collecting data.
I0000 00:00:1769147330.039314 4284 save_profile.cc:150] Collecting XSpace to repository: /tmp/reactant_profile/plugins/profile/2026_01_23_05_48_50/runnervmymu0l.xplane.pb
I0000 00:00:1769147330.039469 4284 save_profile.cc:123] Creating directory: /tmp/reactant_profile/plugins/profile/2026_01_23_05_48_50
I0000 00:00:1769147330.039599 4284 save_profile.cc:129] Dumped gzipped tool data for trace.json.gz to /tmp/reactant_profile/plugins/profile/2026_01_23_05_48_50/runnervmymu0l.trace.json.gz
I0000 00:00:1769147330.039614 4284 profiler_session.cc:136] Profiler session tear down.
I0000 00:00:1769147330.053052 4284 stub_factory.cc:159] Created gRPC channel for address: 0.0.0.0:36283
I0000 00:00:1769147330.053321 4284 grpc_server.cc:93] Server listening on 0.0.0.0:36283
I0000 00:00:1769147330.053337 4284 xplane_to_tools_data.cc:598] serving tool: memory_profile with options: {} using ProfileProcessor
I0000 00:00:1769147330.053343 4284 xplane_to_tools_data.cc:618] Using local processing for tool: memory_profile
I0000 00:00:1769147330.053345 4284 memory_profile_processor.cc:47] Processing memory profile for host: runnervmymu0l
I0000 00:00:1769147330.251909 4284 xplane_to_tools_data.cc:598] serving tool: op_profile with options: {} using ProfileProcessor
I0000 00:00:1769147330.251938 4284 xplane_to_tools_data.cc:618] Using local processing for tool: op_profile
I0000 00:00:1769147330.252124 4284 xprof_thread_pool_executor.cc:22] Creating derived_timeline_trace_events XprofThreadPoolExecutor with 4 threads.
I0000 00:00:1769147330.253142 4284 xprof_thread_pool_executor.cc:22] Creating ProcessTensorCorePlanes XprofThreadPoolExecutor with 4 threads.
I0000 00:00:1769147330.256132 4284 xprof_thread_pool_executor.cc:22] Creating op_stats_threads XprofThreadPoolExecutor with 4 threads.
I0000 00:00:1769147330.317783 4284 xplane_to_tools_data.cc:598] serving tool: overview_page with options: {} using ProfileProcessor
I0000 00:00:1769147330.317804 4284 xplane_to_tools_data.cc:618] Using local processing for tool: overview_page
I0000 00:00:1769147330.317807 4284 overview_page_processor.cc:64] OverviewPageProcessor::ProcessSession
I0000 00:00:1769147330.318134 4284 xprof_thread_pool_executor.cc:22] Creating ConvertMultiXSpaceToInferenceStats XprofThreadPoolExecutor with 1 threads.
runtime: 0.00023045s
compile time: 4.36774145sReactant.@timed nrepeat=100 linear(x, W, b)AggregateProfilingResult(
runtime = 0.00023045s,
compile_time = 0.10729345s, )Note that the information returned depends on the backend. Specifically CUDA and TPU backends provide more detailed information regarding memory usage and allocation (something like the following will be displayed on GPUs):
AggregateProfilingResult(
runtime = 0.00001235s,
compile_time = 0.20724930s, # time spent compiling by Reactant
GPU_0_bfc = MemoryProfileSummary(
peak_bytes_usage_lifetime = 32.015 MiB, # peak memory usage over the entire program (lifetime of memory allocator)
peak_stats = MemoryAggregationStats(
stack_reserved_bytes = 0 bytes, # memory usage by stack reservation
heap_allocated_bytes = 30.750 KiB, # memory usage by heap allocation
free_memory_bytes = 4.228 GiB, # free memory available for allocation or reservation
fragmentation = 0.0, # fragmentation of memory within [0, 1]
peak_bytes_in_use = 30.750 KiB # The peak memory usage over the entire program
)
peak_stats_time = 0.02420451s,
memory_capacity = 4.228 GiB # memory capacity of the allocator
)
flops = FlopsSummary(
Flops = 5.180502680725853e-7, # [flops / (peak flops * program time)], capped at 1.0
UncappedFlops = 5.180502680725853e-7,
RawFlops = 4060.0, # Total FLOPs performed
BF16Flops = 4060.0, # Total FLOPs Normalized to the bf16 (default) devices peak bandwidth
)
)Additionally for GPUs and TPUs, we can use the Reactant.@profile macro to profile the function and get information regarding each of the kernels executed.
Reactant.@profile linear(x, W, b)I0000 00:00:1769147331.082428 4284 profiler_session.cc:103] Profiler session initializing.
I0000 00:00:1769147331.082466 4284 profiler_session.cc:118] Profiler session started.
I0000 00:00:1769147331.082559 4284 profiler_session.cc:68] Profiler session collecting data.
I0000 00:00:1769147331.082935 4284 save_profile.cc:150] Collecting XSpace to repository: /tmp/reactant_profile/plugins/profile/2026_01_23_05_48_51/runnervmymu0l.xplane.pb
I0000 00:00:1769147331.083077 4284 save_profile.cc:123] Creating directory: /tmp/reactant_profile/plugins/profile/2026_01_23_05_48_51
I0000 00:00:1769147331.083189 4284 save_profile.cc:129] Dumped gzipped tool data for trace.json.gz to /tmp/reactant_profile/plugins/profile/2026_01_23_05_48_51/runnervmymu0l.trace.json.gz
I0000 00:00:1769147331.083207 4284 profiler_session.cc:136] Profiler session tear down.
I0000 00:00:1769147331.083265 4284 xplane_to_tools_data.cc:598] serving tool: memory_profile with options: {} using ProfileProcessor
I0000 00:00:1769147331.083270 4284 xplane_to_tools_data.cc:618] Using local processing for tool: memory_profile
I0000 00:00:1769147331.083273 4284 memory_profile_processor.cc:47] Processing memory profile for host: runnervmymu0l
I0000 00:00:1769147331.083460 4284 xplane_to_tools_data.cc:598] serving tool: op_profile with options: {} using ProfileProcessor
I0000 00:00:1769147331.083471 4284 xplane_to_tools_data.cc:618] Using local processing for tool: op_profile
I0000 00:00:1769147331.083572 4284 xprof_thread_pool_executor.cc:22] Creating derived_timeline_trace_events XprofThreadPoolExecutor with 4 threads.
I0000 00:00:1769147331.084518 4284 xprof_thread_pool_executor.cc:22] Creating ProcessTensorCorePlanes XprofThreadPoolExecutor with 4 threads.
I0000 00:00:1769147331.085264 4284 xprof_thread_pool_executor.cc:22] Creating op_stats_threads XprofThreadPoolExecutor with 4 threads.
I0000 00:00:1769147331.085945 4284 xplane_to_tools_data.cc:598] serving tool: overview_page with options: {} using ProfileProcessor
I0000 00:00:1769147331.085961 4284 xplane_to_tools_data.cc:618] Using local processing for tool: overview_page
I0000 00:00:1769147331.085963 4284 overview_page_processor.cc:64] OverviewPageProcessor::ProcessSession
I0000 00:00:1769147331.086065 4284 xprof_thread_pool_executor.cc:22] Creating ConvertMultiXSpaceToInferenceStats XprofThreadPoolExecutor with 1 threads.
I0000 00:00:1769147331.204594 4284 xplane_to_tools_data.cc:598] serving tool: kernel_stats with options: {} using ProfileProcessor
I0000 00:00:1769147331.204624 4284 xplane_to_tools_data.cc:618] Using local processing for tool: kernel_stats
I0000 00:00:1769147331.379936 4284 xplane_to_tools_data.cc:598] serving tool: framework_op_stats with options: {} using ProfileProcessor
I0000 00:00:1769147331.379964 4284 xplane_to_tools_data.cc:618] Using local processing for tool: framework_op_stats
╔================================================================================╗
║ SUMMARY ║
╚================================================================================╝
AggregateProfilingResult(
runtime = 0.00006729s,
compile_time = 0.10620102s, # time spent compiling by Reactant
)On GPUs this would look something like the following:
╔================================================================================╗
║ KERNEL STATISTICS ║
╚================================================================================╝
┌───────────────────┬─────────────┬────────────────┬──────────────┬──────────────┬──────────────┬──────────────┬───────────┬──────────┬────────────┬─────────────┐
│ Kernel Name │ Occurrences │ Total Duration │ Avg Duration │ Min Duration │ Max Duration │ Static Shmem │ Block Dim │ Grid Dim │ TensorCore │ Occupancy % │
├───────────────────┼─────────────┼────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼───────────┼──────────┼────────────┼─────────────┤
│ gemm_fusion_dot_1 │ 1 │ 0.00000266s │ 0.00000266s │ 0.00000266s │ 0.00000266s │ 8.000 KiB │ 64,1,1 │ 1,1,1 │ ✗ │ 50.0% │
│ loop_add_fusion │ 1 │ 0.00000157s │ 0.00000157s │ 0.00000157s │ 0.00000157s │ 0 bytes │ 20,1,1 │ 1,1,1 │ ✗ │ 31.2% │
└───────────────────┴─────────────┴────────────────┴──────────────┴──────────────┴──────────────┴──────────────┴───────────┴──────────┴────────────┴─────────────┘
╔================================================================================╗
║ FRAMEWORK OP STATISTICS ║
╚================================================================================╝
┌───────────────────┬─────────┬─────────────┬─────────────┬─────────────────┬───────────────┬──────────┬───────────┬──────────────┬──────────┐
│ Operation │ Type │ Host/Device │ Occurrences │ Total Self-Time │ Avg Self-Time │ Device % │ Memory BW │ FLOP Rate │ Bound By │
├───────────────────┼─────────┼─────────────┼─────────────┼─────────────────┼───────────────┼──────────┼───────────┼──────────────┼──────────┤
│ gemm_fusion_dot.1 │ Unknown │ Device │ 1 │ 0.00000266s │ 0.00000266s │ 62.88% │ 1.71 GB/s │ 1.51 GFLOP/s │ HBM │
│ +/add │ add │ Device │ 1 │ 0.00000157s │ 0.00000157s │ 37.12% │ 0.12 GB/s │ 0.04 GFLOP/s │ HBM │
└───────────────────┴─────────┴─────────────┴─────────────┴─────────────────┴───────────────┴──────────┴───────────┴──────────────┴──────────┘
╔================================================================================╗
║ SUMMARY ║
╚================================================================================╝
AggregateProfilingResult(
runtime = 0.00002246s,
compile_time = 0.16447328s, # time spent compiling by Reactant
GPU_0_bfc = MemoryProfileSummary(
peak_bytes_usage_lifetime = 32.015 MiB, # peak memory usage over the entire program (lifetime of memory allocator)
peak_stats = MemoryAggregationStats(
stack_reserved_bytes = 0 bytes, # memory usage by stack reservation
heap_allocated_bytes = 31.250 KiB, # memory usage by heap allocation
free_memory_bytes = 4.228 GiB, # free memory available for allocation or reservation
fragmentation = 0.0, # fragmentation of memory within [0, 1]
peak_bytes_in_use = 31.250 KiB # The peak memory usage over the entire program
)
peak_stats_time = 0.00812043s,
memory_capacity = 4.228 GiB # memory capacity of the allocator
)
flops = FlopsSummary(
Flops = 3.747296689092735e-6, # [flops / (peak flops * program time)], capped at 1.0
UncappedFlops = 3.747296689092735e-6,
RawFlops = 4060.0, # Total FLOPs performed
BF16Flops = 4060.0, # Total FLOPs Normalized to the bf16 (default) devices peak bandwidth
)
)Capturing traces
When running Reactant, it is possible to capture traces using the XLA profiler. These traces can provide information about where the XLA specific parts of program spend time during compilation or execution. Note that tracing and compilation happen on the CPU even though the final execution is aimed to run on another device such as GPU or TPU. Therefore, including tracing and compilation in a trace will create annotations on the CPU.
Let's setup a simple function which we can then profile
using Reactant
x = Reactant.to_rarray(randn(Float32, 100, 2))
W = Reactant.to_rarray(randn(Float32, 10, 100))
b = Reactant.to_rarray(randn(Float32, 10))
linear(x, W, b) = (W * x) .+ blinear (generic function with 1 method)The profiler can be accessed using the Reactant.with_profiler function.
Reactant.with_profiler("./") do
mylinear = Reactant.@compile linear(x, W, b)
mylinear(x, W, b)
end10×2 ConcretePJRTArray{Float32,2}:
6.15852 1.35862
9.16206 -4.2409
2.58971 0.318061
-0.246236 14.8722
4.56529 11.4766
-10.7673 -1.77
7.07233 10.7504
0.440019 -13.3573
-12.0986 17.664
-2.28216 4.33028Running this function should create a folder called plugins in the folder provided to Reactant.with_profiler which will contain the trace files. The traces can then be visualized in different ways.
Note
For more insights about the current state of Reactant, it is possible to fetch device information about allocations using the Reactant.XLA.allocatorstats function.
Perfetto UI

The first and easiest way to visualize a captured trace is to use the online perfetto.dev tool. Reactant.with_profiler has a keyword parameter called create_perfetto_link which will create a usable perfetto URL for the generated trace. The function will block execution until the URL has been clicked and the trace is visualized. The URL only works once.
Reactant.with_profiler("./"; create_perfetto_link=true) do
mylinear = Reactant.@compile linear(x, W, b)
mylinear(x, W, b)
endNote
It is recommended to use the Chrome browser to open the perfetto URL.
Tensorboard

Another option to visualize the generated trace files is to use the tensorboard profiler plugin. The tensorboard viewer can offer more details than the timeline view such as visualization for compute graphs.
First install tensorboard and its profiler plugin:
pip install tensorboard tensorboard-plugin-profileAnd then run the following in the folder where the plugins folder was generated:
tensorboard --logdir ./Adding Custom Annotations
By default, the traces contain only information captured from within XLA. The Reactant.Profiler.annotate function can be used to annotate traces for Julia code evaluated during tracing.
Reactant.Profiler.annotate("my_annotation") do
# Do things...
endThe added annotations will be captured in the traces and can be seen in the different viewers along with the default XLA annotations. When the profiler is not activated, then the custom annotations have no effect and can therefore always be activated.