ML System Bottleneck Analyzer

Pick a model and hardware. See the decode rate and what's bottlenecking it.

Model

Tuning & advanced options

Defaults work for most setups. Change these only if you need specific behavior.

Workload
Prompt tokens processed before generation.
Generated tokens decoded autoregressively.
Memory uses the combined context window; timing is split into prompt processing and decode.
Distribution & optimization Applies to KV cache only; most useful for long contexts or memory overflow.
Runtime
Scenarios
Cost assumptions (power pricing)
Model internals (override preset)

Hardware

Topology How selected hardware connects
Status OK (≤ 85%) Strained (85-100%) Overloaded (> 100% or overflow) Bar segments: M=Memory · C=Compute · L=Local BW · N=Network BW · click a node to edit

Results - approximate

Resource utilization

Published benchmarks for reference click to expand

Real-world throughput plus official model-card task scores from vendor, community, and research sources. Throughput rows are used for configuration alignment; task-score rows are for comparison only.

Model Quantization / Mode Runtime / Benchmark Hardware / Task Batch / Eval Seq / Setting Batch Rate / Score Single Rate / Score Source