ML System Bottleneck Analyzer
Model Configuration
▼
Llama3 8b @ Q4
Model Preset
Custom
Llama 3 8B
Llama 3 70B
Llama 3.3 70B
Llama 3.2 3B
Llama 3.2 1B
Llama 3.2 90B Vision
Llama 3.2 11B Vision
DeepSeek R1 (671B MoE)
DeepSeek V3 (671B)
DeepSeek R1 Distill 70B
DeepSeek R1 Distill 32B
DeepSeek R1 Distill 14B
DeepSeek R1 Distill 8B
DeepSeek R1 Distill 1.5B
Qwen 2.5 72B
Qwen 2.5 32B
Qwen 2.5 14B
Qwen 2.5 7B
Qwen 2.5 3B
Qwen 2.5 Coder 32B
Mistral 7B
Mistral Large 2 (123B)
Mistral Nemo 12B
Mixtral 8x7B
Mixtral 8x22B
GLM-4 9B
Command R+ 104B
Gemma 3 27B
Phi-3 14B
Phi-3 3.8B
SDXL Base (3.5B)
Flux.1 Dev (12B)
SD3 Medium (2B)
Large Model (400B+)
Very Large Model (1T+)
Quantization
Q4 (Recommended)
INT8
FP16
BF16
FP32
Total Parameters (B)
Batch Size
Sequence Length
Hidden Size
Number of Layers
Number of Heads
Data Type
float32
bfloat16
float16
int8
q4
Parallelism Strategy
AUTO (Find Optimal)
Pipeline Parallelism (PP)
Tensor Parallelism (TP)
Data Parallelism (DP)
Expert Parallelism (EP - MoE)
Sequence Parallelism (SP)
Context Parallelism (CP)
Hybrid TP+PP
Hybrid TP+DP
Optimization
Standard Inference
Speculative Decoding (2-3x speedup)
Continuous Batching
PagedAttention (vLLM)
Flash Attention
EXO Phase Split (Prefill/Decode)
Batch Size
1 (Interactive)
4 (Small batch)
8 (Medium batch)
16 (Large batch)
32 (High throughput)
64 (Max throughput)
Devices
Quick Scenario Preset
Custom Configuration
RTX 5090 + DDR5 Overflow (36GB model)
RTX 5090 x16 + RTX 5060 Ti x1 (slow PCIe)
RTX 5090 x8 + RTX 5060 Ti x8 (balanced)
RTX 5090 x16 + GPU via Oculink (x4)
2x RTX 5090 x16 (theoretical)
RTX 4090 + DDR5 Overflow
RTX 4090 + RTX 4080 (both x8)
Mac M4 Max (128GB)
Mac M3 Ultra (512GB)
Single H100
8x H100 (NVLink)
Resource Utilization
System Analysis
(Token rates are approximations)
System Topology
(Connection diagram)
NVLink (300+ GB/s)
PCIe 5.0 (32-64 GB/s)
PCIe 4.0 (8-32 GB/s)
PCIe 3.0/DDR5 (<16 GB/s)
Real-world results are below for reference
Model:
Hardware:
Quantization:
Model
Quantization
Framework
Hardware
Batch Size
Sequence Length
Token Rate (Batch)
Token Rate (Single)
Source