Blackwell Ultra SM — Layer Architecture
192 SMs × B200
Live
Instruction Cache
L0/L1 I-cache — fetch, decode, predicate
60%
Host / L2
→
I-Cache 128KB
→
Warp Schedulers
Cache Size
128 KB
Hit Rate
96.2%
Fetch Bandwidth
256 B/clk
Misses / 1K Instr
3.8
Warp Schedulers × 4
4 schedulers, 4 dispatch units — 128 threads/warp
74%
I-Cache
→
Decode
→
Schedule
→
Dispatch ×4
→
RegFile
Active Warps
48 / 64
Occupancy
75%
Stall Cycles
12.4%
IPC (Instr/Clk)
3.2
Register File
65,536 × 32-bit registers — operand source for all units
68%
Dispatch
→
RegFile 256KB
→
INT32
/
FP32
/
Tensor
Total Registers
65,536
Per Thread
255 max
Utilization
68%
Bank Conflicts
2.1%
Processing Cores
INT32 + FP32 + FP64 + 5th Gen Tensor Cores (FP4/FP6/FP8)
82%
INT32
128
FP32
128
FP64
2
TENSOR 5G
4
Tensor FP4 PFLOPS
20.0
Tensor FP8 PFLOPS
10.0
FP32 TFLOPS
90
INT32 TOPS
90
Load/Store + SFU
32 LD/ST units + 8 Special Function Units (sin, cos, rcp, sqrt)
55%
RegFile
→
LD/ST ×32
⇄
L1 / Shared
|
SFU ×8
LD/ST Units
32
SFU Units
8
LD/ST Throughput
128 B/clk
Pending Requests
24
L1 Data Cache / Shared Memory
256 KB configurable — L1 + Shared + Texture cache
72%
LD/ST
⇄
L1 256KB
⇄
TEX ×4
→
L2 Cache
L1 Size
128 KB
Shared Mem
Hit Rate
89.4%
TEX Throughput
4 filtered/clk
L2 Cache → HBM3e → NVLink 5.0
96 MB L2 — 192 GB HBM3e @ 8 TB/s — NVLink 5.0 @ 1.8 TB/s
65%
L1
→
L2 96MB
⇄
HBM3e 192GB
⇄
NVLink 5.0
→
uvspeed
L2 Hit Rate
82.1%
HBM3e Bandwidth
5.2 TB/s
NVLink BW
1.17 TB/s
Power Draw
840W
Aggregate Data Flow — All Layers
Streaming
SM Throughput
72%
HBM3e BW
1.17 TB/s
NVLink Traffic
1.04 TB/s
Tensor Core FP4
81%
Power Draw
840W
SM Activity Heatmap — 192 Streaming Multiprocessors
Real-time
Idle
Saturated
Blackwell Ultra Specifications
B200 / GB200
| Specification | Blackwell Ultra (B200) | Blackwell (B100) | Hopper (H100) |
|---|---|---|---|
| Transistors | 208B | 208B | 80B |
| Process | TSMC 4NP | TSMC 4NP | TSMC 4N |
| SMs | 192 | 180 | 132 |
| FP4 Tensor | 20 PFLOPS | 18 PFLOPS | — |
| FP8 Tensor | 10 PFLOPS | 9 PFLOPS | 3.96 PFLOPS |
| HBM | HBM3e 192GB | HBM3e 192GB | HBM3 80GB |
| Memory BW | 8 TB/s | 8 TB/s | 3.35 TB/s |
| NVLink BW | 1.8 TB/s (NVL5) | 1.8 TB/s | 900 GB/s |
| TDP | 1200W | 1000W | 700W |
| Interconnect | NVLink 5.0 | NVLink 5.0 | NVLink 4.0 |
Localized Deploy Targets — Server Stacks
Colab Ready
DGX Spark
GPUs1× GB200 Grace
Memory128 GB unified
Perf1000 TOPS AI
Statusready
Supermicro GPU
GPUs8× B200
Memory8× 192GB HBM3e
Perf160 PFLOPS FP4
Statusready
Lambda Supercluster
GPUs3,584× B200
FabricNVLink + InfiniBand
Perf~71 EFLOPS FP4
Statuscloud
Local Dev (uvspeed)
Bridge:8085 HTTP + :8086 WS
EngineQuantum Prefix 3.0
DeployElectron / Web
Statuslocal
Google Colab
GPUsT4 / A100 / L4
RuntimePython + Jupyter
Bridgengrok tunnel
Statuscloud
DGX Cloud
GPUs8× B200 per node
NetworkNVLink + IB NDR
Perf160 PFLOPS/node
Statuscloud