Memory Architecture: The 360ร— Performance Gap

Why GPU demolished FPGA in this implementation

๐Ÿš€ GPU (Tesla T4)

Memory Hierarchy
L2 Cache 40 MB
GDDR6 Memory 16 GB
Bandwidth 320 GB/s
Compute Resources
CUDA Cores 2,560
Tensor Cores 320
Peak FP32 8.1 TFLOPS
Why It Won
โœ“ Massive parallel threads hide memory latency
โœ“ 320 GB/s bandwidth keeps compute fed
โœ“ 40 MB L2 cache holds most parameters
โœ“ Optimized CUDA libraries (cuDNN)
โœ“ Memory access pattern optimized for CNNs

โšก FPGA (Zynq-7020)

Memory Hierarchy
On-chip BRAM 560 KB
DDR3 Memory 512 MB
Bandwidth ~2.1 GB/s
Compute Resources
DSP Slices 220
LUTs 53,200
Clock Speed 66.7 MHz
The Bottleneck
โš  Only 560 KB on-chip BRAM
โš  2.8 MB model weights in DDR3
โš  Single AXI master (sequential access)
โš  152ร— slower memory bandwidth
โš  Tiled convolution = many DDR round-trips

FPGA Inference Flow (Single 8ร—8 Tile)

1
Fetch 8ร—8 input tile from DDR3 to BRAM
โฑ Memory bound - waiting on DDR
2
Fetch corresponding weights from DDR3
โฑ Memory bound - waiting on DDR
3
Compute convolution in FPGA fabric
โšก Fast! DSP slices at work
4
Write output tile back to DDR3
โฑ Memory bound - waiting on DDR
5
Repeat for all tiles (32ร—32 image = hundreds of tiles)
๐Ÿ” Each iteration stalls on memory

The 58ร— Mystery Solved

HLS synthesis predicted: 7.98 ms minimum latency
Actual hardware measured: 466.5 ms
Gap: 58ร— slower than theoretical

This gap is entirely due to DDR3 memory accesses. The FPGA spent 98.5% of execution time waiting for data transfers, not computing. With weights embedded in BRAM, we'd achieve near-theoretical 7.98 ms performance.

๐Ÿ’ก Key Insight: Memory Bandwidth Dominates Everything

The GPU didn't win because it computed faster โ€” it won because it starved less.

GPU advantage: 320 GB/s รท 2.1 GB/s = 152ร— more memory bandwidth

For CNN inference, memory access patterns matter more than raw compute power. The GPU's massive bandwidth and sophisticated caching kept its 2,560 cores fed with data. The FPGA's 220 DSP slices spent most of their time idle, waiting for the next DDR transfer.