Pick a model and hardware. See the decode rate and what's bottlenecking it.
Model
Llama3 8b @ Q4Tuning & advanced options
Defaults work for most setups. Change these only if you need specific behavior.
Workload
Prompt tokens processed before generation.
Generated tokens decoded autoregressively.
Memory uses the combined context window; timing is split into prompt processing and decode.
Distribution & optimization
Applies to KV cache only; most useful for long contexts or memory overflow.
Runtime
Scenarios
Cost assumptions (power pricing)
Model internals (override preset)
Hardware
TopologyHow selected hardware connects
StatusOK (≤ 85%)Strained (85-100%)Overloaded (> 100% or overflow)Bar segments: M=Memory · C=Compute · L=Local BW · N=Network BW · click a node to edit
Results - approximate
Resource utilization
Published benchmarks for referenceclick to expand
Real-world throughput plus official model-card task scores from vendor, community, and research sources. Throughput rows are used for configuration alignment; task-score rows are for comparison only.