uvspeed — GPU Architecture Live Visualization

Blackwell Ultra SM — Layer Architecture 192 SMs × B200 Live

📥

Instruction Cache

L0/L1 I-cache — fetch, decode, predicate

60%

▼

Host / L2 → I-Cache 128KB → Warp Schedulers

Cache Size

128 KB

Hit Rate

96.2%

Fetch Bandwidth

256 B/clk

Misses / 1K Instr

3.8

🔀

Warp Schedulers × 4

4 schedulers, 4 dispatch units — 128 threads/warp

74%

▼

I-Cache → Decode → Schedule → Dispatch ×4 → RegFile

Active Warps

48 / 64

Occupancy

75%

Stall Cycles

12.4%

IPC (Instr/Clk)

3.2

🗄️

65,536 × 32-bit registers — operand source for all units

68%

▼

Dispatch → RegFile 256KB → INT32 / FP32 / Tensor

Total Registers

65,536

Per Thread

255 max

Utilization

68%

Bank Conflicts

2.1%

⚡

Processing Cores

INT32 + FP32 + FP64 + 5th Gen Tensor Cores (FP4/FP6/FP8)

82%

▼

INT32

128

FP32

128

FP64

TENSOR 5G

Tensor FP4 PFLOPS

20.0

Tensor FP8 PFLOPS

10.0

FP32 TFLOPS

INT32 TOPS

📦

Load/Store + SFU

32 LD/ST units + 8 Special Function Units (sin, cos, rcp, sqrt)

55%

▼

RegFile → LD/ST ×32 ⇄ L1 / Shared | SFU ×8

LD/ST Units

SFU Units

LD/ST Throughput

128 B/clk

Pending Requests

💾

L1 Data Cache / Shared Memory

256 KB configurable — L1 + Shared + Texture cache

72%

▼

LD/ST ⇄ L1 256KB ⇄ TEX ×4 → L2 Cache

L1 Size

128 KB

Shared Mem

128 KB

Hit Rate

89.4%

TEX Throughput

4 filtered/clk

🌐

L2 Cache → HBM3e → NVLink 5.0

96 MB L2 — 192 GB HBM3e @ 8 TB/s — NVLink 5.0 @ 1.8 TB/s

65%

▼

L1 → L2 96MB ⇄ HBM3e 192GB ⇄ NVLink 5.0 → uvspeed

L2 Hit Rate

82.1%

HBM3e Bandwidth

5.2 TB/s

NVLink BW

1.17 TB/s

Power Draw

840W

Aggregate Data Flow — All Layers Streaming

FPS: 60 | SMs: 192 | Bandwidth: 1.8 TB/s | Utilization: —%

SM Throughput

72%

HBM3e BW

1.17 TB/s

NVLink Traffic

1.04 TB/s

Tensor Core FP4

81%

Power Draw

840W

SM Activity Heatmap — 192 Streaming Multiprocessors Real-time

Idle

Saturated

Blackwell Ultra Specifications B200 / GB200

Specification	Blackwell Ultra (B200)	Blackwell (B100)	Hopper (H100)
Transistors	208B	208B	80B
Process	TSMC 4NP	TSMC 4NP	TSMC 4N
SMs	192	180	132
FP4 Tensor	20 PFLOPS	18 PFLOPS	—
FP8 Tensor	10 PFLOPS	9 PFLOPS	3.96 PFLOPS
HBM	HBM3e 192GB	HBM3e 192GB	HBM3 80GB
Memory BW	8 TB/s	8 TB/s	3.35 TB/s
NVLink BW	1.8 TB/s (NVL5)	1.8 TB/s	900 GB/s
TDP	1200W	1000W	700W
Interconnect	NVLink 5.0	NVLink 5.0	NVLink 4.0

Localized Deploy Targets — Server Stacks Colab Ready

DGX Spark

GPUs1× GB200 Grace

Memory128 GB unified

Perf1000 TOPS AI

Statusready

Supermicro GPU

GPUs8× B200

Memory8× 192GB HBM3e

Perf160 PFLOPS FP4

Statusready

Lambda Supercluster

GPUs3,584× B200

FabricNVLink + InfiniBand

Perf~71 EFLOPS FP4

Statuscloud

Local Dev (uvspeed)

Bridge:8085 HTTP + :8086 WS

EngineQuantum Prefix 3.0

DeployElectron / Web

Statuslocal

Google Colab

GPUsT4 / A100 / L4

RuntimePython + Jupyter

Bridgengrok tunnel

Statuscloud

DGX Cloud

GPUs8× B200 per node

NetworkNVLink + IB NDR

Perf160 PFLOPS/node

Statuscloud

References & Resources