uvspeed — Blackwell Live NVIDIA
NVIDIA AMD Intel Apple Qualcomm | Notepad Research archflow hexcast
Blackwell Ultra SM — Layer Architecture 192 SMs × B200   Live
📥
Instruction Cache
L0/L1 I-cache — fetch, decode, predicate
60%
Host / L2 I-Cache 128KB Warp Schedulers
Cache Size
128 KB
Hit Rate
96.2%
Fetch Bandwidth
256 B/clk
Misses / 1K Instr
3.8
🔀
Warp Schedulers × 4
4 schedulers, 4 dispatch units — 128 threads/warp
74%
I-Cache Decode Schedule Dispatch ×4 RegFile
Active Warps
48 / 64
Occupancy
75%
Stall Cycles
12.4%
IPC (Instr/Clk)
3.2
🗄️
Register File
65,536 × 32-bit registers — operand source for all units
68%
Dispatch RegFile 256KB INT32 / FP32 / Tensor
Total Registers
65,536
Per Thread
255 max
Utilization
68%
Bank Conflicts
2.1%
Processing Cores
INT32 + FP32 + FP64 + 5th Gen Tensor Cores (FP4/FP6/FP8)
82%
INT32
128
FP32
128
FP64
2
TENSOR 5G
4
Tensor FP4 PFLOPS
20.0
Tensor FP8 PFLOPS
10.0
FP32 TFLOPS
90
INT32 TOPS
90
📦
Load/Store + SFU
32 LD/ST units + 8 Special Function Units (sin, cos, rcp, sqrt)
55%
RegFile LD/ST ×32 L1 / Shared | SFU ×8
LD/ST Units
32
SFU Units
8
LD/ST Throughput
128 B/clk
Pending Requests
24
💾
L1 Data Cache / Shared Memory
256 KB configurable — L1 + Shared + Texture cache
72%
LD/ST L1 256KB TEX ×4 L2 Cache
L1 Size
128 KB
Shared Mem
128 KB
Hit Rate
89.4%
TEX Throughput
4 filtered/clk
🌐
L2 Cache → HBM3e → NVLink 5.0
96 MB L2 — 192 GB HBM3e @ 8 TB/s — NVLink 5.0 @ 1.8 TB/s
65%
L1 L2 96MB HBM3e 192GB NVLink 5.0 uvspeed
L2 Hit Rate
82.1%
HBM3e Bandwidth
5.2 TB/s
NVLink BW
1.17 TB/s
Power Draw
840W
Aggregate Data Flow — All Layers   Streaming
FPS: 60 | SMs: 192 | Bandwidth: 1.8 TB/s | Utilization: —%
SM Throughput
72%
HBM3e BW
1.17 TB/s
NVLink Traffic
1.04 TB/s
Tensor Core FP4
81%
Power Draw
840W
SM Activity Heatmap — 192 Streaming Multiprocessors Real-time
Idle
Saturated
Blackwell Ultra Specifications B200 / GB200
SpecificationBlackwell Ultra (B200)Blackwell (B100)Hopper (H100)
Transistors208B208B80B
ProcessTSMC 4NPTSMC 4NPTSMC 4N
SMs192180132
FP4 Tensor20 PFLOPS18 PFLOPS
FP8 Tensor10 PFLOPS9 PFLOPS3.96 PFLOPS
HBMHBM3e 192GBHBM3e 192GBHBM3 80GB
Memory BW8 TB/s8 TB/s3.35 TB/s
NVLink BW1.8 TB/s (NVL5)1.8 TB/s900 GB/s
TDP1200W1000W700W
InterconnectNVLink 5.0NVLink 5.0NVLink 4.0
Localized Deploy Targets — Server Stacks   Colab Ready
DGX Spark
GPUs1× GB200 Grace
Memory128 GB unified
Perf1000 TOPS AI
Statusready
Supermicro GPU
GPUs8× B200
Memory8× 192GB HBM3e
Perf160 PFLOPS FP4
Statusready
Lambda Supercluster
GPUs3,584× B200
FabricNVLink + InfiniBand
Perf~71 EFLOPS FP4
Statuscloud
Local Dev (uvspeed)
Bridge:8085 HTTP + :8086 WS
EngineQuantum Prefix 3.0
DeployElectron / Web
Statuslocal
Google Colab
GPUsT4 / A100 / L4
RuntimePython + Jupyter
Bridgengrok tunnel
Statuscloud
DGX Cloud
GPUs8× B200 per node
NetworkNVLink + IB NDR
Perf160 PFLOPS/node
Statuscloud