Tiny LLM Benchmark: Jetson Orin Nano Super 8GB — Four Power Modes × Eight Models

Eight non-thinking LLMs benchmarked across all four Jetson Orin Nano Super power modes (7W / 15W / 25W / MAXN) with llama.cpp CUDA. 25W is the Pareto-optimal sweet spot: 43 % more throughput than 15W while beating MAXN on output tok/J for every model tested.

Four Power Modes × Eight Models: llama.cpp / CUDA

Platform: NVIDIA Jetson Orin Nano Super 8GB
CPU: 6-core Arm Cortex-A78AE · GPU: NVIDIA Ampere (1024 CUDA cores, 32 Tensor cores)
Memory: 8 GB LPDDR5 shared CPU+GPU · JetPack: R36.4.7 (L4T 36.4)
Backend: llama.cpp CUDA, -ngl 99 (all layers on GPU), --no-cache-prompt
Runs: Four full sweeps: 7W, 15W, 25W, MAXN_SUPER
Sweep: prompt ∈ {128, 512, 1024, 2048} tok × gen ∈ {64, 128, 256} tok × 20 reqs/combo
Concurrency: 1 (single-user) · Key metric: output tok/J = OSL ÷ (avg_power_W × RL_p50_s)

Raw data on Hugging Face — complete per-cell JSON exports (all 33 metrics, 12 prompt×gen combos × 20 requests per cell, profile_export_aiperf.json + tegrastats.log + server logs):

Mode Dataset Models Cells
7W YuvrajSingh9886/jetson-non-reasoning-benchmark-7w 8 96
15W YuvrajSingh9886/jetson-non-reasoning-benchmark-15w 8 96
25W YuvrajSingh9886/jetson-non-reasoning-benchmark-25w 8 96
MAXN YuvrajSingh9886/jetson-non-reasoning-benchmark-maxn 8 96

Executive Summary

Eight tiny non-thinking LLMs were benchmarked across all four Jetson Orin Nano Super power modes: 7W, 15W, 25W, and MAXN_SUPER. Each model ran 12 combinations of prompt × generation length (20 requests per combo) at every power mode where it could load.

Key finding: 25W (nvpmodel -m 1) is the energy-efficiency sweet spot for every model tested. It delivers 36–47 % more output tok/s than 15W while pushing output tok/J 3–26 % higher than 15W and 8–35 % higher than MAXN_SUPER across every model (ctx=2048, gen=256).

Throughput winner at each mode (ctx=2048, gen=256, highest sweep point):

Table 1: Throughput and efficiency winner at each power mode

Mode Fastest model Output Tok/s Output Tok/J
7W smollm2-135m 53.9 21.7
15W smollm2-135m 114.5 21.7
25W smollm2-135m 165.1 22.6
MAXN smollm2-135m 159.4 20.0

gemma3-4b (Q4_K_M, 2.4 GB) fails at every power mode: too large for 8 GB unified memory when combined with KV cache and CUDA overhead.

1. Test Setup

1.1 Hardware

Table 2: Hardware configuration

Component Detail
Board Jetson Orin Nano Super 8GB (Developer Kit)
CPU 6× Arm Cortex-A78AE @ up to 1.728 GHz
GPU NVIDIA Ampere, 1024 CUDA cores, 32 Tensor cores
Memory 8 GB LPDDR5 204.8 GB/s (unified CPU + GPU)
CMA 256 MB (contiguous memory pool; depletes across sequential model loads)
Cooling Active fan; peak junction temperature ≤ 73 °C across all modes

1.2 Software Stack

Table 3: Software stack

Layer Version / Detail
OS / JetPack JetPack R36.4.7 (Ubuntu 22.04, L4T 36.4)
CUDA 12.6
llama.cpp CUDA backend, -ngl 99, --no-cache-prompt --cache-ram 0
Inference server llama-server: host 0.0.0.0:8080, --parallel 1, -c 2560
Load generator aiperf (NVIDIA AI Performance tool)
Power telemetry tegrastats at 500 ms, VDD_CPU_GPU_CV rail (mW)
Python 3.10 (aiperf-env), pandas, seaborn, matplotlib
Datasets Synthetic prompts at exact token counts (128, 512, 1024, 2048) generated synthetically via aiperf

1.3 Models Under Test

Table 4: Models under test

Model Quant GGUF size Tokenizer
SmolLM2-135M-Instruct Q4_K_M 101 MB HuggingFaceTB/SmolLM2-135M-Instruct
SmolLM2-360M-Instruct Q8_0 369 MB HuggingFaceTB/SmolLM2-360M-Instruct
Qwen2.5-0.5B-Instruct Q4_K_M 469 MB Qwen/Qwen2.5-0.5B-Instruct
LFM2.5-350M Q4_K_M 219 MB LiquidAI/LFM2.5-350M
LFM2.5-1.2B-Instruct Q4_K_M 698 MB LiquidAI/LFM2.5-1.2B-Instruct
Qwen3-0.6B Q8_0 610 MB Qwen/Qwen3-0.6B
Llama-3.2-1B-Instruct Q4_K_M 771 MB meta-llama/Llama-3.2-1B-Instruct
Gemma3-1B-IT Q4_K_M 769 MB google/gemma-3-1b-it
Gemma3-4B-IT Q4_K_M 2.4 GB N/A

Quantization note: SmolLM2-360M-Instruct and Qwen3-0.6B use Q8_0 (8-bit, near-lossless); all other models use Q4_K_M (4-bit K-quant medium). These match the quantizations that ollama ships for Qwen2.5-0.5B, Gemma3-1B, and Gemma3-4B, but differ for SmolLM2-135M (Ollama: F16), SmolLM2-360M (Ollama: F16), Qwen3-0.6B (Ollama: Q4_K_M), and Llama-3.2-1B (Ollama: Q8_0). Results should not be assumed comparable to ollama run defaults without accounting for these quantization differences.

1.4 Power Modes

Table 5: Power mode configurations

Mode nvpmodel GPU clock CPU clock VDD_CPU_GPU_CV (observed)
7W -m 3 ~408 MHz 960 MHz 0.5–2.5 W under load
15W -m 0 ~612 MHz 1 190 MHz 3–7 W under load
25W -m 1 ~820 MHz 1 420 MHz 4–10 W under load
MAXN -m 2 + jetson_clocks 1 020 MHz 1 728 MHz 6–12 W under load

1.5 Benchmark Methodology

  • For each model × prompt × gen combo, aiperf sends 20 single-concurrency requests with synthetic prompts at the exact target token count.
  • Power is computed from tegrastats VDD_CPU_GPU_CV (mW → W) averaged over each run’s start_time/end_time window. output_tok_J = OSL ÷ (avg_power_W × RL_p50_s).
  • Clocks were locked with jetson_clocks at all modes. CMA was compacted (/proc/sys/vm/compact_memory) between model loads.
  • Each run’s power and clock speed was capped at x W through nvpmodel and monitored for thermal stability (no sustained throttling; junction temp ≤ 73 °C).
  • Latency percentile used throughout: all TTFT, ITL, and request latency (RL) values reported in charts, tables, and energy calculations use the p50 (median) over the 20 requests per combo. The mean is not used for latency because occasional slow requests (GC pause, memory compaction, OS scheduling) inflate it without reflecting typical behaviour. p90 and p99 are available in the raw per-mode Hugging Face datasets (see raw data table at the top of this post) for tail-latency analysis.

2. Results: Charts

All charts use data from all four power modes.

2.1 Throughput vs Prompt Length

Output tok/s vs prompt length at gen=256 across all models and modes; 25W (orange) consistently leads:

Figure 1: Output tok/s vs prompt length (gen=256, all models and modes)

Tok/s vs Prompt gen=256

Canonical cell (ctx=2048, gen=256), side-by-side output tok/s and output tok/J bars for all 4 modes:

Figure 2: Canonical cell: output tok/s and tok/J side by side (ctx=2048, gen=256)

Canonical Cell Comparison


2.2 Energy Efficiency

  • Output Tok/J vs prompt length at gen=256; 25W leads for every model at every prompt length:

Figure 3: Output tok/J vs prompt length (gen=256, all models and modes)

Output Tok/J vs Prompt

  • Output Tok/J heatmap (gen x prompt) for small models (available at all 4 modes):

Figure 4: Output tok/J heatmap: small models at all 4 power modes (gen x prompt)

Output Tok/J Heatmap small models

  • Output Tok/J heatmap for larger models (all 4 modes):

Figure 5: Output tok/J heatmap: larger models at 15W / 25W / MAXN (gen x prompt)

Output Tok/J Heatmap large models

  • SmolLM2-135M spotlight: output tok/J at all 4 modes across gen sizes:

Figure 6: SmolLM2-135M output tok/J at all 4 power modes across gen lengths

SmolLM2-135M Output Tok/J Spotlight

  • Prefill tok/J (input tokens per joule of prefill energy) vs prompt length at gen=256, how efficiently each mode processes the prompt; higher is better:

Figure 7a: Prefill tok/J (input tok / J) vs prompt length (gen=256, all models and modes)

Prefill tok/J vs prompt gen=256

Prefill tok/J is approximate for 63 % of cells (TTFT < 500 ms → no tegrastats sample in prefill window). Decode tok/J and total tok/J are not affected.

  • Decode tok/J (output tokens per joule of decode energy) vs prompt length at gen=256, output generation efficiency; decreases with increase in prompt length since decode cost depends on output length not input; 25W leads:

Figure 7b: Decode tok/J (output tok / J) vs prompt length (gen=256, all models and modes)

Decode tok/J vs prompt gen=256

  • Total tok/J ((input + output) tokens per joule of total request energy) vs prompt length at gen=256, overall request efficiency; 25W wins at every model and prompt length:

Figure 7c: Total tok/J (input+output tok / J) vs prompt length (gen=256, all models and modes)

Total tok/J vs prompt gen=256

Full tok/J charts for all ctx/gen combinations: F.1 Prefill · F.2 Decode · F.3 Total.


2.3 Latency

TTFT p50 at ctx=2048, gen=256; 25W and MAXN reduce TTFT by 30–38 % vs 15W:

Figure 8: TTFT p50 by power mode (ctx=2048, gen=256)

TTFT vs Prompt

ITL (inter-token latency) p50 at ctx=2048, gen=256; lower is better:

Figure 9: ITL p50 by power mode (ctx=2048, gen=256)

ITL Comparison

Request latency (E2E) p50 at ctx=2048, gen=256; total time from request start to last token received:

Figure 10: Request latency (E2E) p50 by power mode (ctx=2048, gen=256)

Request Latency Comparison


2.4 Prefill Throughput

25W and MAXN provide ~35-40 % faster prefill than 15W:

Figure 11: Prefill throughput by power mode (gen=256, avg over all prompt lengths)

Prefill Comparison


2.5 Power Draw

Average VDD_CPU_GPU_CV per model at each mode:

Figure 12: Average VDD_CPU_GPU_CV power draw per model at each power mode

Avg Power Bar

Table 6: Average power draw per model at each power mode (W, VDD_CPU_GPU_CV)

Model 7W 15W 25W MAXN
SmolLM2-135M 1.99 4.27 5.74 6.51
SmolLM2-360M 2.27 4.98 6.76 7.42
Qwen2.5-0.5B 2.22 5.34 7.05 8.73
LFM2.5-350M 2.10 5.00 6.79 7.88
LFM2.5-1.2B 2.34 5.96 8.46 9.79
Qwen3-0.6B 1.98 5.02 6.89 8.19
Llama3.2-1B 2.26 6.04 8.56 10.54
Gemma3-1B 1.96 5.01 6.87 8.62

Formulae used - tok/s / tok/J. Bold = highest power draw per model.

3. Analysis

3.1 Higher tok/sec != efficient model (tok/J)

Tok/s (left half) and tok/J (right half) are intentionally both shown, a faster mode does not always mean a more efficient one.

  • MAXN beats 25W on raw tok/s for some models but loses on tok/J because its power increase outpaces the throughput gain for ctx = 2048, gen = 256.

output_tok_J = tok_s / VDD_CPU_GPU_CV (W), averaged over each aiperf run window.

Table 7: Canonical cell comparison (ctx=2048, gen=256)

Model 7W tok/s 15W tok/s 25W tok/s MAXN tok/s 7W tok/J 15W tok/J 25W tok/J MAXN tok/J Peak tok/J
SmolLM2-135M 53.9 114.5 165.1 159.4 21.7 21.7 22.6 20.0 25W
SmolLM2-360M 34.8 70.6 101.8 89.4 11.0 9.7 10.2 7.6 7W
Qwen2.5-0.5B 27.4 68.3 92.6 100.5 7.3 7.3 9.2 6.9 25W
LFM2.5-350M 31.5 79.2 115.1 112.9 11.8 11.8 13.7 11.7 25W
LFM2.5-1.2B 13.7 37.0 54.1 52.6 4.8 5.1 5.3 4.5 25W
Qwen3-0.6B 14.2 33.9 49.4 54.2 6.2 5.9 6.3 5.8 25W
Llama3.2-1B 12.1 32.3 47.0 51.9 4.5 4.5 4.7 4.2 25W
Gemma3-1B 11.2 28.1 40.8 44.2 4.9 4.9 5.1 4.5 25W

3.2 The 25W Sweet Spot

25W is unambiguously the best mode for output tok/J and tok/sec across every model. The reason is arithmetic:

  • Going from 15W → 25W: output tok/s rises 36–47 % (GPU clock 612 → 820 MHz), while power rises ~36 %. Net output tok/J gain: +3 to +26 % (range wider than tok/s because 25W also cuts TTFT, shrinking the RL denominator).
  • Going from 25W → MAXN: output tok/s changes −3 % to +8 % depending on model (decode is memory-bandwidth bound, not compute bound), while power rises ~17 %. Net output tok/J loss: −8 to −35 %.

The GPU clock ceiling at 15W (612 MHz) leaves significant decode throughput on the table. Raising it to 820 MHz at 25W captures most of the available throughput improvement with modest additional power. The final jump to 1020 MHz at MAXN costs disproportionate power for marginal gains.

Practical recommendation: Run at 25W for the best balance of speed and efficiency. Use MAXN only when minimising latency (TTFT) matters more than energy (e.g. interactive chat with long prompts).

3.3 Best use cases for each power mode

Table 8: Recommended power mode by use case

Use case Recommended mode
Always-on inference 25W: overall best low TTFT, output tok/J, tok/sec and latency, 45 % faster than 15W
Interactive chat, real-time response MAXN: among the highest prefill tok/sec, ~35 % faster prefill than 15W
Power-constrained / thermally limited 15W: 30-40 % less power draw than MAXN
Edge AI / Smartphone deployment 7W: all 8 models fit (reboot per run required); useful for efficiency research at minimum power

3.4 Throughput Speedup Summary

All figures are mean(p50) across the full prompt × gen sweep (12 combos per model); throughput uses mean(avg tok/s) since aiperf does not report a p50 for tok/s.

Table 9: Output throughput speedup ratios - all pairwise mode comparisons

Model 25W / 15W MAXN / 15W 15W / 7W 25W / 7W MAXN / 7W MAXN / 25W
SmolLM2-135M 1.43x 1.39x 2.19x 3.15x 3.06x 0.97x
SmolLM2-360M 1.44x 1.27x 2.06x 2.98x 2.62x 0.88x
Qwen2.5-0.5B 1.35x 1.47x 2.52x 3.41x 3.70x 1.08x
LFM2.5-350M 1.45x 1.43x 2.55x 3.69x 3.64x 0.99x
LFM2.5-1.2B 1.47x 1.42x 2.70x 3.96x 3.85x 0.97x
Qwen3-0.6B 1.46x 1.59x 2.46x 3.58x 3.90x 1.09x
Llama3.2-1B 1.46x 1.62x 2.70x 3.95x 4.37x 1.11x
Gemma3-1B 1.45x 1.58x 2.50x 3.63x 3.95x 1.09x
  • 25W delivers a consistent ~1.43-1.47x speedup vs 15W across all models.
  • 15W gives about 2.1-2.7x boost vs 7W and even about 1.2 times on top of it for 25W.
  • MAXN/25W < 1 for the smallest models (MAXN gains no throughput) but > 1 for larger models (compute-bound, benefit from clock ceiling). MAXN/7W reaches 4.37x for Llama3.2-1B - the largest speedup in the sweep.
  • Speedups involving MAXN are higher for models in the range of 0.5B - 1B parameters, where the GPU clock increase from 820 MHz to 1020 MHz has the most impact before memory bandwidth becomes the bottleneck.

3.5 Latency Characteristics

TTFT scales near-linearly with prompt across all modes. At ctx=128 a model like LFM2.5-350M prefills in ~80 ms (25W); at ctx=2048 that grows to ~820 ms. The 25W / MAXN modes reduce TTFT proportionally to their clock ratio vs 15W.

Inter-token latency (ITL) p50 is the median per-token decode cost. ITL heatmaps per power mode (all 8 models, all 12 prompt×gen combos) are in Appendix H.2 — see Figures H.2a–H.2d. At the canonical ctx=2048, gen=256:

7W
15W
25W
MAXN

Figure 10a: ITL p50 heatmaps — all 4 power modes (rows = gen length, cols = prompt length)

  • ITL depends on gen-length (64, 128, 256) and to some extent reflects the memory-bandwidth bound.
  • In our case the gen-lengths tested are not enough to cause differences across model × mode combinations beyond the general trend: models <1B have lower ITL than ~1B models, possibly because the KV-cache stays small enough to avoid refills.

Decode time (s) p50 is the time spent generating output tokens: decode_time = request_latency − TTFT. At ctx=2048, gen=256 (computed as RL_s − TTFT_s where RL_s = OSL / tok_s):

Table 10a: Decode time speedup ratios - all pairwise mode comparisons (ctx=2048, gen=256)

Model 25W vs 15W MAXN vs 15W 15W vs 7W 25W vs 7W MAXN vs 7W MAXN vs 25W
SmolLM2-135M 1.41x 1.36x 2.12x 2.97x 2.88x 0.97x
SmolLM2-360M 1.47x 1.29x 1.72x 2.52x 2.23x 0.88x
Qwen2.5-0.5B 1.33x 1.39x 2.85x 3.80x 3.98x 1.05x
LFM2.5-350M 1.45x 1.43x 2.55x 3.69x 3.65x 0.99x
LFM2.5-1.2B 1.47x 1.42x 2.70x 3.96x 3.85x 0.97x
Qwen3-0.6B 1.46x 1.59x 2.45x 3.57x 3.89x 1.09x
Llama3.2-1B 1.46x 1.62x 2.70x 3.95x 4.37x 1.11x
Gemma3-1B 1.45x 1.58x 2.50x 3.63x 3.95x 1.09x

Speedup = mean(decode_time_baseline) / mean(decode_time_mode) where decode_time = RL p50 − TTFT p50, averaged over all 12 prompt × gen combos. decode_J = avg_power_W × decode_time_s.

  • Decode speedups closely mirror throughput speedups (Table 9) since decode time ≈ 1 / tok_s once TTFT is subtracted.
  • MAXN vs 25W > 1.0 for 0.5B–1B models; < 1.0 for SmolLM2 (memory-bandwidth bound, extra clock gives no decode gain).

TTFT speedup - TTFT_baseline / TTFT_mode; values > 1.0 mean the mode has lower TTFT (faster prefill). Speedup = mean(TTFT p50 at baseline) / mean(TTFT p50 at mode), averaged over all 12 prompt × gen combos:

Table 11: TTFT speedup ratios - all pairwise mode comparisons

Model 25W vs 15W MAXN vs 15W 15W vs 7W 25W vs 7W MAXN vs 7W MAXN vs 25W
SmolLM2-135M 1.42x 1.37x 2.31x 3.28x 3.17x 0.97x
SmolLM2-360M 1.44x 1.37x 2.47x 3.55x 3.38x 0.95x
Qwen2.5-0.5B 1.37x 1.51x 2.63x 3.60x 3.96x 1.10x
LFM2.5-350M 1.44x 1.43x 2.62x 3.77x 3.75x 1.00x
LFM2.5-1.2B 1.46x 1.49x 2.79x 4.06x 4.17x 1.03x
Qwen3-0.6B 1.43x 1.57x 2.54x 3.64x 4.01x 1.10x
Llama3.2-1B 1.45x 1.61x 2.78x 4.03x 4.46x 1.11x
Gemma3-1B 1.44x 1.59x 2.63x 3.78x 4.19x 1.11x
  • MAXN has the highest speedup ratios across all modes, with the largest gains for the bigger models (Qwen3-0.6B, Llama3.2-1B, Gemma3-1B) where the GPU clock increase has more impact.
  • MAXN/25W ratios cluster near 1.0x (~0.95–1.11x). Prefill is compute-bound (parallel GEMMs over all input tokens), so a naive expectation would be that higher clocks help proportionally - but this was not the case. Why? maybe it becomes memory-bandwidth bound?(let me know in the comments!). For the two smallest models (SmolLM2) the prefill completes so quickly (<300 ms at 25W) that kernel-launch overhead dominates, making higher clocks irrelevant (0.95–0.97x).

Request latency (E2E) speedup - Speedup = mean(RL p50 at baseline) / mean(RL p50 at mode), averaged over all 12 prompt × gen combos:

Table 12: Request latency (E2E) speedup ratios - all pairwise mode comparisons

Model 25W vs 15W MAXN vs 15W 15W vs 7W 25W vs 7W MAXN vs 7W MAXN vs 25W
SmolLM2-135M 1.41x 1.36x 2.14x 3.02x 2.92x 0.97x
SmolLM2-360M 1.46x 1.31x 1.84x 2.69x 2.40x 0.89x
Qwen2.5-0.5B 1.34x 1.41x 2.81x 3.77x 3.97x 1.06x
LFM2.5-350M 1.45x 1.43x 2.56x 3.70x 3.66x 0.99x
LFM2.5-1.2B 1.46x 1.43x 2.72x 3.98x 3.90x 0.98x
Qwen3-0.6B 1.45x 1.59x 2.46x 3.58x 3.90x 1.09x
Llama3.2-1B 1.46x 1.62x 2.71x 3.96x 4.38x 1.11x
Gemma3-1B 1.45x 1.58x 2.52x 3.65x 3.98x 1.09x
  • Mirrors the TTFT speedup trends since prefill dominates the request latency at these gen lengths.

3.6 Model Size vs Efficiency

The relationship is clear: smaller quantized models always win on total tok/J, not just tok/s.

Table 13: Best total tok/J ranked by model size

Model Params GGUF Best total tok/J At mode / ctx / gen
SmolLM2-135M 135M 101 MB 487.3 25W / 2048 / 64
LFM2.5-350M 350M 219 MB 330.7 25W / 2048 / 64
SmolLM2-360M 360M 369 MB 262.3 25W / 2048 / 64
Qwen2.5-0.5B 500M 469 MB 237.7 25W / 2048 / 64
Qwen3-0.6B 600M 610 MB 149.0 25W / 2048 / 64
Gemma3-1B 1.0B 769 MB 118.5 25W / 2048 / 64
LFM2.5-1.2B 1.2B 698 MB 116.2 25W / 2048 / 64
Llama3.2-1B 1.0B 771 MB 108.9 25W / 2048 / 64

Total tok/J = (ISL + OSL) / (avg_power_W × RL_p50_s) — see Appendix J.6 for the full formula. Peaks at ctx=2048, gen=64 for every model because the long prompt dominates the numerator while 25W minimises energy per token. All 48 mode × ctx × gen combinations were searched.

SmolLM2-135M at 25W achieves 487 total tok/J, nearly 4.5× more efficient than Llama3.2-1B across the full request.


3.7 Energy Efficiency: Decode tok/J and Total tok/J

Two complementary tok/J lenses on energy efficiency — see J.6 for formulas:

  • Decode tok/J = OSL / decode_J — output tokens generated per joule of decode energy only (TTFT excluded). Measures how efficiently the GPU runs the autoregressive generation loop.
  • Total tok/J = (ISL + OSL) / total_J — all tokens processed per joule of the full request. Accounts for both prompt processing and generation; favours models that handle long prompts cheaply.

See Figure 7b (decode tok/J vs prompt length) and Figure 7c (total tok/J vs prompt length) in section 2.2 — 25W leads at every model and prompt length. Full combinations: F.2 Decode · F.3 Total.

Key findings:

  1. 25W wins on both metrics for almost every model. The exception is SmolLM2-360M, where 7W edges ahead on both decode and total tok/J — decode is memory-bandwidth bound for this model and the lower clock still delivers competitive throughput at much lower power.

  2. ~1B around models tops at ~5-8 tok/J (decode) whereas the <1B models can reach 15-35 tok/J. Thus these are more energy efficient (decode) than ~1B models we have tested.

  3. Charts in F.2 show that the ~1B models are roughly flat, that is, prompt length becomes independent of tok/J in decode tok/J as going from 64 to 256 gen length.

  4. Total tok/J grows with prompt length because ISL dominates (ISL+OSL) as ctx increases while total_J grows more slowly (decode time is constant), see F.3.

Figure 15: Total energy per request vs output length at 25W, ctx=2048

Total energy vs output length at 25W

Figure 16: Decode energy per output token in mJ (ctx=2048, gen=256)

mJ per output token by mode

5. Conclusion

What These Numbers Mean for Edge Inference

Tiny LLM inference on a $250 Jetson Orin Nano Super 8GB is genuinely practical. At SmolLM2-135M Q4_K_M:

  • 187 tok/s at 25W : real-time fluent generation
  • 101 MB on disk : trivially portable
  • 5.4 W under load : runs on a USB-C power bank
  • 22.6 output tok/J : the best energy efficiency in this suite

The LFM2.5 models (Liquid AI) are a notable new entrant: LFM2.5-350M achieves 120 tok/s at 25W (competitive with SmolLM2-360M) in 219 MB. LFM2.5-1.2B at 25W (ctx=2048, gen=256):

  • Throughput: 54.1 tok/s — 13 % faster than Llama3.2-1B (47.0) and 33 % faster than Gemma3-1B (40.8)
  • Output tok/J: 5.26 vs Llama 4.67 (+13 %) vs Gemma 5.14 (+2 %) — clear lead over Llama, narrow over Gemma
  • Total tok/J: Gemma edges ahead here (118.5 vs 116.2 vs Llama 108.9) — its lower power draw (6.87 W vs 8.46 W) compensates for slower decode
  • Disk footprint: 698 MB vs 771 MB (Llama) / 769 MB (Gemma) — smallest in the class, making it the best tok/s-per-byte overall

The 25W Mode:

25W is the Pareto-optimal power mode for edge LLM inference on the Jetson Orin Nano Super. It is the right answer for virtually every deployment:

  • 43 % more throughput than 15W; within −3 % to +8 % of MAXN (MAXN gains only marginally on larger models; 25W wins or ties on sub-500M models)
  • Only 36 % more power than 15W; ~17 % less power than MAXN
  • 3–26 % better output tok/J than 15W; 8–35 % better output tok/J than MAXN
  • Low enough peak power (≤ 10 W for sub-1B models) to stay comfortable for sustained operation

NOTE: CMA fragmentation caveat

  • After three sequential model loads in the same OS session, the CUDA IOVA address space accumulates fragmentation that blocks cudaMalloc calls requiring > 300 MB contiguous buffers. Qwen3-0.6B, Llama3.2-1B, Gemma3-1B, and Gemma3-4B all hit NvMapMemAllocInternalTagged: error 12 (ENOMEM) when loaded after other models without a reboot. A reboot + --resume run recovered all three smaller models (Gemma3-4B is OOM at every mode regardless). All 8 non-gemma3-4b models produced valid 7W data after this workaround; the full 96-cell 7W dataset is now complete.

Appendix

A. Full 4-Mode Comparison (ctx=2048, gen=256)

Raw numbers from the canonical benchmark cell. All latencies in milliseconds. Power = VDD_CPU_GPU_CV averaged over each run window.

Table 15: Full 4-mode comparison, ctx=2048, gen=256

Model Mode Output Tok/s TTFT p50 (ms) ITL p50 (ms) Power (W) Output Tok/J
SmolLM2-135M 7W 53.9 1044.7 18.55 1.99 21.72
SmolLM2-135M 15W 114.5 442.5 8.74 4.27 21.67
SmolLM2-135M 25W 165.1 308.7 6.06 5.74 22.57
SmolLM2-135M MAXN 159.4 319.5 6.28 6.51 19.95
SmolLM2-360M 7W 34.8 1820.6 28.74 2.27 10.97
SmolLM2-360M 15W 70.6 725.4 14.16 4.98 9.74
SmolLM2-360M 25W 101.8 502.5 9.82 6.76 10.21
SmolLM2-360M MAXN 89.4 528.3 11.18 7.42 7.56
Qwen2.5-0.5B 7W 27.4 1956.3 36.48 2.22 7.26
Qwen2.5-0.5B 15W 68.3 735.0 14.64 5.34 7.26
Qwen2.5-0.5B 25W 92.6 530.9 10.80 7.05 9.16
Qwen2.5-0.5B MAXN 100.5 484.8 9.95 8.73 6.94
LFM2.5-350M 7W 31.5 1509.2 31.79 2.10 11.83
LFM2.5-350M 15W 79.2 568.3 12.63 5.00 11.78
LFM2.5-350M 25W 115.1 393.7 8.69 6.79 13.74
LFM2.5-350M MAXN 112.9 396.0 8.86 7.88 11.72
LFM2.5-1.2B 7W 13.7 4227.6 72.98 2.34 4.79
LFM2.5-1.2B 15W 37.0 1510.0 27.06 5.96 5.10
LFM2.5-1.2B 25W 54.1 1033.7 18.49 8.46 5.26
LFM2.5-1.2B MAXN 52.6 1008.0 19.00 9.79 4.47
Qwen3-0.6B 7W 14.2 2875.1 70.62 1.98 6.19
Qwen3-0.6B 15W 33.9 1113.4 29.52 5.02 5.90
Qwen3-0.6B 25W 49.4 771.0 20.25 6.89 6.26
Qwen3-0.6B MAXN 54.2 700.3 18.45 8.19 5.78
Llama3.2-1B 7W 12.1 4000.2 82.88 2.26 4.51
Llama3.2-1B 15W 32.3 1432.1 31.00 6.04 4.54
Llama3.2-1B 25W 47.0 982.7 21.27 8.56 4.67
Llama3.2-1B MAXN 51.9 890.5 19.27 10.54 4.19
Gemma3-1B 7W 11.2 3817.6 89.08 1.96 4.92
Gemma3-1B 15W 28.1 1442.3 35.57 5.01 4.85
Gemma3-1B 25W 40.8 1000.1 24.51 6.87 5.14
Gemma3-1B MAXN 44.2 900.2 22.60 8.62 4.46
Gemma3-4B all OOM: too large for 8 GB unified memory at any power mode

B. Thermal Summary - All Power Modes

Power and temperature averaged over each model’s full benchmark window (all 12 prompt×gen combos). No model triggered thermal throttling at any power mode (threshold ≈ 95 °C).

Junction temperature (TJ) is the hottest internal die temperature on the Jetson SoC, reported by tegrastats as tj@. It is the primary metric for thermal safety: if TJ reaches ~95 °C, the hardware automatically throttles clocks to prevent damage. Peak TJ < 70 °C across all runs means thermal headroom is ample.

Table 16: Thermal summary - all power modes

Model Mode Avg Power (W) Avg CPU (°C) Avg GPU (°C) Peak TJ (°C) Throttled
SmolLM2-135M 7W 1.95 47.2 48.9 50.3 No
SmolLM2-135M 15W 4.17 56.0 57.8 60.2 No
SmolLM2-135M 25W 5.60 49.2 51.6 54.3 No
SmolLM2-135M MAXN 6.39 49.0 51.4 53.2 No
SmolLM2-360M 7W 2.23 49.2 50.9 52.1 No
SmolLM2-360M 15W 4.89 59.1 60.9 63.3 No
SmolLM2-360M 25W 6.67 52.7 55.2 58.6 No
SmolLM2-360M MAXN 7.28 50.9 53.5 56.8 No
Qwen2.5-0.5B 7W 2.19 49.2 51.0 52.2 No
Qwen2.5-0.5B 15W 5.24 55.7 57.7 59.6 No
Qwen2.5-0.5B 25W 6.95 53.1 55.7 59.1 No
Qwen2.5-0.5B MAXN 8.57 56.6 59.1 62.8 No
LFM2.5-350M 7W 2.09 50.1 51.7 53.0 No
LFM2.5-350M 15W 4.93 58.0 60.0 62.1 No
LFM2.5-350M 25W 6.72 52.9 55.5 58.1 No
LFM2.5-350M MAXN 7.78 50.5 53.4 56.8 No
LFM2.5-1.2B 7W 2.35 51.3 52.9 54.0 No
LFM2.5-1.2B 15W 6.01 60.7 62.9 65.5 No
LFM2.5-1.2B 25W 8.42 57.4 60.2 63.0 No
LFM2.5-1.2B MAXN 9.68 56.7 59.7 63.5 No
Qwen3-0.6B 7W 2.00 44.2 46.0 47.5 No
Qwen3-0.6B 15W 5.00 61.6 63.6 65.4 No
Qwen3-0.6B 25W 6.83 57.4 59.9 63.1 No
Qwen3-0.6B MAXN 8.32 57.2 59.9 64.0 No
Llama3.2-1B 7W 2.28 44.6 46.5 47.6 No
Llama3.2-1B 15W 6.04 61.9 64.1 65.7 No
Llama3.2-1B 25W 8.52 60.3 63.2 66.1 No
Llama3.2-1B MAXN 10.55 59.9 63.0 69.5 No
Gemma3-1B 7W 1.98 45.1 46.9 50.5 No
Gemma3-1B 15W 4.99 60.2 62.1 63.6 No
Gemma3-1B 25W 6.84 57.5 60.0 61.9 No
Gemma3-1B MAXN 8.51 61.2 63.8 67.0 No

E. Full 12-Combination Heatmaps (All Power Modes)

Each heatmap is a 2×4 grid (8 models) showing all 12 prompt×gen combinations for one power mode and one metric. Rows = gen length (64, 128, 256 tok), columns = prompt length (128, 512, 1024, 2048 tok). Brighter colour = higher value.

E.1 Output Tok/s heatmaps

Figure E.1a: All 12 combos at 7W

Tok/s heatmap 7W

Figure E.1b: All 12 combos at 15W

Tok/s heatmap 15W

Figure E.1c: All 12 combos at 25W

Tok/s heatmap 25W

Figure E.1d: All 12 combos at MAXN

Tok/s heatmap MAXN

E.2 Output Tok/J heatmaps

Figure E.2a: All 12 combos at 7W

Tok/J heatmap 7W

Figure E.2b: All 12 combos at 15W

Tok/J heatmap 15W

Figure E.2c: All 12 combos at 25W

Tok/J heatmap 25W

Figure E.2d: All 12 combos at MAXN

Tok/J heatmap MAXN

F. Prefill / Decode / Total tok/J: All Combinations

All charts are 2×4 faceted line plots with a fixed y-scale across all subplots. The canonical combination (ctx=2048, gen=256) is also shown in §2.2.

F.1 Prefill tok/J (input tok / J) vs prompt length

Figure F.1a: Prefill tok/J vs prompt length: gen=64

Prefill tok/J vs prompt gen=64

Prefill tok/J is approximate for 63 % of cells.

Figure F.1b: Prefill tok/J vs prompt length: gen=128

Prefill tok/J vs prompt gen=128

Prefill tok/J is approximate for 63 % of cells.

Figure F.1c: Prefill tok/J vs prompt length: gen=256 (canonical, also in § 2.2)

Prefill tok/J vs prompt gen=256

Prefill tok/J is approximate for 63 % of cells.

F.2 Decode tok/J (output tok / J) - independent of prompt length

Decode tok/J depends on the number of output tokens (gen length), not input prompt length, since decode happens after prefill completes. These charts show decode tok/J as a function of gen length for each prompt context length.

Figure F.2a: Decode tok/J vs gen length: ctx=128

Decode tok/J vs gen ctx=128

Figure F.2b: Decode tok/J vs gen length: ctx=512

Decode tok/J vs gen ctx=512

Figure F.2c: Decode tok/J vs gen length: ctx=1024

Decode tok/J vs gen ctx=1024

Figure F.2d: Decode tok/J vs gen length: ctx=2048

Decode tok/J vs gen ctx=2048

F.3 Total tok/J ((input+output) tok / J) vs prompt length

Figure F.3a: Total tok/J vs prompt length: gen=64

Total tok/J vs prompt gen=64

Figure F.3b: Total tok/J vs prompt length: gen=128

Total tok/J vs prompt gen=128

Figure F.3c: Total tok/J vs prompt length: gen=256 (canonical, also in § 2.2)

Total tok/J vs prompt gen=256

G. Request Latency (E2E): All Combinations

Request latency (E2E) p50 - total time from request start to last token received. Line charts show variation with prompt length (2×4 facet, fixed y-scale). Grouped bar charts show per-model × per-mode breakdown.

G.1 Request latency vs prompt length (by gen length)

Figure G.1a: Request latency vs prompt length: gen=64

Request latency vs prompt gen=64

Figure G.1b: Request latency vs prompt length: gen=128

Request latency vs prompt gen=128

Figure G.1c: Request latency vs prompt length: gen=256 (canonical, also in §2.3)

Request latency vs prompt gen=256

G. TTFT: All Prompt x Gen Combinations

TTFT p50 (median time to first token, ms) is driven almost entirely by prompt length, it is the prefill cost. These charts show how it varies across all 12 prompt x gen combinations and across all 4 power modes.

G.1 TTFT vs prompt length (by gen length)

Figure G.1a: TTFT vs prompt length: gen=64

TTFT vs prompt gen=64

Figure G.1b: TTFT vs prompt length: gen=256 (canonical, also in section 2.3)

TTFT vs prompt gen=256

TTFT is independent of gen length, so only gen=64 and gen=256 are shown.


G.2 TTFT heatmaps (gen x prompt) per power mode

Each cell is TTFT in ms. Rows = gen length, columns = prompt length. Independent of gen length hence the same across rows.

Figure G.2a: TTFT heatmap: 7W
Figure G.2b: TTFT heatmap: 15W
Figure G.2c: TTFT heatmap: 25W
Figure G.2d: TTFT heatmap: MAXN

H. ITL: All Combinations

Inter-token latency (ms) = time between consecutive output tokens. It measures decode cost and is driven by model size and GPU clock, not prompt length.

H.1 ITL vs prompt length (by gen length)

Figure H.1a: ITL vs prompt length: gen=64

ITL vs prompt gen=64

Figure H.1b: ITL vs prompt length: gen=128

ITL vs prompt gen=128

Figure H.1c: ITL vs prompt length: gen=256 (canonical, also in section 2.3)

ITL vs prompt gen=256


H.2 ITL heatmaps (gen x prompt) per power mode

Figure H.2a: ITL heatmap: 7W
Figure H.2b: ITL heatmap: 15W
Figure H.2c: ITL heatmap: 25W
Figure H.2d: ITL heatmap: MAXN

I. Prefill Throughput: All Combinations

Prefill throughput (tok/s) measures how fast the model processes input tokens. It scales with prompt length (longer prompts hit peak GPU utilisation) and GPU clock speed.

I.1 Prefill throughput vs prompt length (by gen length)

Figure I.1a: Prefill throughput vs prompt length: gen=64

Prefill tput vs prompt gen=64

Figure I.1b: Prefill throughput vs prompt length: gen=256 (canonical, also in section 2.4)

Prefill tput vs prompt gen=256

Prefill throughput is independent of gen length, so only gen=64 and gen=256 are shown.

I.2 Prefill throughput heatmaps (gen x prompt) per power mode

Figure I.2a: Prefill throughput heatmap: 7W
Figure I.2b: Prefill throughput heatmap: 15W
Figure I.2c: Prefill throughput heatmap: 25W
Figure I.2d: Prefill throughput heatmap: MAXN

J. All Metrics, Formulas, and Calculation Methods

This appendix documents every metric reported in this benchmark, its formula, its source, and any caveats.

J.1 Raw inputs from aiperf and tegrastats

Symbol Source Definition
ISL aiperf JSON input_sequence_length.avg Actual input tokens processed per request (may differ from target due to tokenizer rounding)
OSL aiperf JSON output_sequence_length.avg Actual output tokens generated per request
TTFT aiperf JSON time_to_first_token.p50 (ms) Median time from request sent to first output token received; proxy for prefill duration. p50 used (not avg) to avoid skew from occasional slow requests
ITL aiperf JSON inter_token_latency.p50 (ms) Median time between consecutive output tokens; per-token decode cost. p50 used for robustness against outliers
RL aiperf JSON request_latency.p50 (ms) Median total wall time per request: TTFT + all inter-token intervals. p50 used for energy calculations
tok_s aiperf JSON output_token_throughput_per_user.avg Output tokens per second, single-user (OSL / RL in steady state)
prefill_tput aiperf JSON prefill_throughput_per_user.avg Input tokens processed per second during prefill phase
t0, t1 aiperf JSON start_time, end_time (ISO 8601) Wall-clock start and end of the full 20-request profiling run
mW_i tegrastats VDD_CPU_GPU_CV field (mW) Instantaneous power on the CPU+GPU+CV rail at sample i

All aiperf metrics are averages over 20 requests per combo. Percentile variants (p50, p90, p99) are also available in the raw JSON but not reproduced here.


J.2 Power

avg_power_W = mean(mW_i for all tegrastats samples where t0 <= sample_time <= t1) / 1000
  • VDD_CPU_GPU_CV covers the CPU, GPU, and Computer Vision engine rail
  • Does NOT include board overhead (fan, storage, USB) which is on VDD_IN
  • VDD_IN is ~1.5-3 W higher than VDD_CPU_GPU_CV during inference
  • Tegrastats interval: 500 ms

J.3 Output tok/J (main efficiency metric)

output_tok_J = OSL / (avg_power_W * RL_p50_s)

Where RL_s = RL / 1000 (request latency in seconds).

Higher is better. This measures how many output tokens are generated per joule of compute energy. It is the primary metric of the benchmark.

Not affected by the prefill/decode split approximation (see section J.7).


J.4 Request latency energy

total_J = avg_power_W * (RL / 1000)

Energy consumed by one average request from first byte sent to last token received. Accurate for all cells regardless of TTFT.


J.5 Prefill and decode energy

prefill_J  = avg_power_W * (TTFT / 1000)
decode_J   = avg_power_W * ((RL - TTFT) / 1000)
           = total_J - prefill_J

prefill_%  = prefill_J / total_J * 100

CAUTION: See energy measurement caveat.


J.6 Phase tok/J metrics

prefill_tok_J = ISL / prefill_J
              = ISL / (avg_power_W * TTFT_s)

decode_tok_J  = OSL / decode_J
              = OSL / (avg_power_W * (RL_s - TTFT_s))

total_tok_J   = (ISL + OSL) / total_J
              = (ISL + OSL) / (avg_power_W * RL_s)

Where TTFT_s = TTFT / 1000, RL_s = RL / 1000.

  • prefill_tok_J: input tokens processed per joule of prefill energy. Affected by the approximation in J.5.
  • decode_tok_J: output tokens generated per joule of decode energy. Reasonably accurate.
  • total_tok_J: all tokens (in + out) per joule of total request energy. Accurate.

J.7 mJ per output token

mJ_per_output_tok = (decode_J / OSL) * 1000
                  = 1000 / decode_tok_J

Millijoules per generated output token (decode_J is in joules, ×1000 converts to mJ for readability). Carries the same caveat as J.5 for cells where TTFT < 500 ms.


J.8 Prefill throughput

prefill_tput (tok/s) = aiperf JSON prefill_throughput_per_user.avg

Directly from aiperf. Measures how fast input tokens are processed during the prefill phase. Scales with prompt length (longer prompts hit peak GPU utilisation) and GPU clock.


J.9 Throughput speedup ratios (Table 9)

speedup_25W_vs_15W  = mean(tok_s_25W  over all 12 combos) / mean(tok_s_15W  over all 12 combos)
speedup_MAXN_vs_15W = mean(tok_s_MAXN over all 12 combos) / mean(tok_s_15W  over all 12 combos)
speedup_15W_vs_7W   = mean(tok_s_15W  over all 12 combos) / mean(tok_s_7W   over all 12 combos)

Averages are over all 4 prompt lengths × 3 gen lengths = 12 combos. tok_s = output_token_throughput_per_user.avg (aiperf); no p50 is available for throughput. Latency speedup ratios (Tables 10a, 11, 12) use mean of p50 values instead.


J.10 Best total tok/J per model (Table 13)

best_total_tok_J(model) = max(total_tok_J(mode, model, gen, ctx))
                          over all modes in {7W, 15W, 25W, MAXN}
                          and all gen in {64, 128, 256}
                          and all ctx in {128, 512, 1024, 2048}

total_tok_J = (ISL + OSL) / (avg_power_W * RL_p50_s)

The single highest total tok/J value observed for that model across all 48 combinations. Peaks at ctx=2048, gen=64 for every model because the long prompt dominates the (ISL + OSL) numerator.


J.11 TTFT, ITL, RL percentiles

All percentile variants come directly from aiperf JSON without further computation:

TTFT       = time_to_first_token.p50   (canonical; p50 used everywhere)
TTFT_p90   = time_to_first_token.p90
TTFT_p99   = time_to_first_token.p99
ITL        = inter_token_latency.p50    (canonical; p50 used everywhere)
ITL_p99    = inter_token_latency.p99
RL         = request_latency.p50        (canonical; p50 used everywhere)
RL_p99     = request_latency.p99

J.12 Energy caveat: which metrics are accurate vs approximate

Metric Accurate? Condition
output_tok_J Always No phase split needed
total_J Always Full window power * RL
decode_J Mostly avg_power approx decode power since decode dominates window
decode_tok_J Mostly Same as above
total_tok_J Always Uses total_J which is accurate
prefill_J TTFT >= 500 ms only (37 % of cells) Needs tegrastats sample in prefill window
prefill_tok_J TTFT >= 500 ms only (37 % of cells) Derived from prefill_J
prefill_% TTFT >= 500 ms only (37 % of cells) Derived from prefill_J
mJ_per_output_tok Mostly Derived from decode_J

J.13 Power and temperature

avg_power_W = mean(tegrastats.VDD_CPU_GPU_CV[mW] / 1000
              for all samples where aiperf_t0 <= sample_time <= aiperf_t1)

Power is the mean VDD_CPU_GPU_CV (CPU+GPU+CV rail) from tegrastats sampled at 500 ms intervals, averaged over each model’s active inference windows only (idle/cool-down between models excluded).

Junction temperature (TJ) is the hottest internal die temperature on the Jetson SoC, reported by tegrastats as tj@. The hardware automatically throttles GPU/CPU clocks when TJ reaches ~95 °C to prevent damage. Peak TJ < 70 °C across all runs confirms ample thermal headroom at every power mode.

Symbol Source Definition
VDD_CPU_GPU_CV tegrastats Instantaneous power (mW) on the CPU+GPU+CV rail
cpu@ tegrastats CPU cluster temperature (°C)
gpu@ tegrastats GPU temperature (°C)
tj@ tegrastats Junction (hottest internal die) temperature (°C)
avg_power_W computed Mean VDD_CPU_GPU_CV over active inference window (W)
avg_cpu_C computed Mean CPU temp over active inference window
avg_gpu_C computed Mean GPU temp over active inference window
peak_tj_C computed Maximum TJ temperature observed

Throttling is flagged when peak_tj_C > 85 °C (leaving a 10 °C safety margin below the hardware limit).

Tags: , , , , , , ,