Model

Model Quantization

Tuning & advanced options

Defaults work for most setups. Change these only if you need specific behavior.

Workload

Input context Prompt tokens processed before generation.

Output response Generated tokens decoded autoregressively.

Memory uses the combined context window; timing is split into prompt processing and decode. Batch size

Distribution & optimization Distribution across devices Inference optimization KV cache compression Applies to KV cache only; most useful for long contexts or memory overflow.

Runtime Runtime framework

Scenarios Load a scenario preset

Cost assumptions (power pricing)

Hours/day $/kWh

Model internals (override preset)

Total Parameters (B) Hidden Size Number of Layers Number of Heads Architecture Family

KV Heads

Intermediate Size

Active Params (B)

Experts

Active Experts

Routing

Attention

ML System Bottleneck Analyzer

Model

Hardware

Results - approximate

Resource utilization