ML System Bottleneck Analyzer

Pick a model and hardware. See the decode rate and what's bottlenecking it.

Model

Tuning & advanced options

Defaults work for most setups. Change these only if you need specific behavior.

Prompt tokens processed before generation.
Generated tokens decoded autoregressively.
Memory uses the combined context window; timing is split into prompt processing and decode. Applies to KV cache only; most useful for long contexts or memory overflow.
Cost assumptions (power pricing)
Model internals (override preset)

Hardware

Select one device to edit. The full system appears in the topology below.

Topology How selected hardware connects
PCIe 5.0 (32-64 GB/s) PCIe 4.0 (8-32 GB/s) Slow (<8 GB/s) Selected device is highlighted

Results - approximate

Resource utilization

Published benchmarks for reference click to expand

Real-world throughput plus official model-card task scores from vendor, community, and research sources. Throughput rows are used for configuration alignment; task-score rows are for comparison only.

Model Quantization / Mode Runtime / Benchmark Hardware / Task Batch / Eval Seq / Setting Batch Rate / Score Single Rate / Score Source