PyTorch Profiler
PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics.Denoising Stage Profiling
Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step):--profile: Enable profiling for the denoising stage--num-profiled-timesteps N: Number of timesteps to profile after warmup (default: 5)- Smaller values reduce trace file size
- Example:
--num-profiled-timesteps 10profiles 10 steps after 1 warmup step
Full Pipeline Profiling
Profile all pipeline stages (text encoding, denoising, VAE decoding, etc.):--profile-all-stages: Used with--profile, profile all pipeline stages instead of just denoising
Output Location
By default, trace files are saved in the ./logs/ directory. The exact output file path will be shown in the console output, for example:View Traces
Load and visualize trace files at:- https://ui.perfetto.dev/ (recommended)
- chrome://tracing (Chrome only)
--num-profiled-timesteps or avoid using --profile-all-stages.
--perf-dump-path (Stage/Step Timing Dump)
Besides profiler traces, you can also dump a lightweight JSON report that contains:
- stage-level timing breakdown for the full pipeline
- step-level timing breakdown for the denoising stage (per diffusion step)
denoise_steps_ms field formatted as an array of objects, each with a step key (the step index) and a duration_ms key.
Example:
Nsight Systems
Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns.Installation
See the SGLang profiling guide for installation instructions.Basic Profiling
Profile the entire pipeline execution:Targeted Stage Profiling
Use--delay and --duration to capture specific stages and reduce file size:
--delay N: Wait N seconds before starting capture (skip initialization overhead)--duration N: Capture for N seconds (focus on specific stages)--force-overwrite: Overwrite existing output files
Notes
- Reduce trace size: Use
--num-profiled-timestepswith smaller values or--delay/--durationwith Nsight Systems - Stage-specific analysis: Use
--profilealone for denoising stage, add--profile-all-stagesfor full pipeline - Multiple runs: Profile with different prompts and resolutions to identify bottlenecks across workloads
FAQ
- If you are profiling
sglang generatewith Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model’s inference steps to extend the execution time.
