Memory Architecture: GPU vs FPGA

🚀 GPU (Tesla T4)

Memory Hierarchy

L2 Cache 40 MB

GDDR6 Memory 16 GB

Bandwidth 320 GB/s

Compute Resources

CUDA Cores 2,560

Tensor Cores 320

Peak FP32 8.1 TFLOPS

Why It Won

✓ Massive parallel threads hide memory latency
✓ 320 GB/s bandwidth keeps compute fed
✓ 40 MB L2 cache holds most parameters
✓ Optimized CUDA libraries (cuDNN)
✓ Memory access pattern optimized for CNNs

⚡ FPGA (Zynq-7020)

Memory Hierarchy

On-chip BRAM 560 KB

DDR3 Memory 512 MB

Bandwidth ~2.1 GB/s

Compute Resources

DSP Slices 220

LUTs 53,200

Clock Speed 66.7 MHz

The Bottleneck

⚠ Only 560 KB on-chip BRAM
⚠ 2.8 MB model weights in DDR3
⚠ Single AXI master (sequential access)
⚠ 152× slower memory bandwidth
⚠ Tiled convolution = many DDR round-trips

FPGA Inference Flow (Single 8×8 Tile)

Fetch 8×8 input tile from DDR3 to BRAM

⏱ Memory bound - waiting on DDR

Fetch corresponding weights from DDR3

⏱ Memory bound - waiting on DDR

Compute convolution in FPGA fabric

⚡ Fast! DSP slices at work

Write output tile back to DDR3

⏱ Memory bound - waiting on DDR

Repeat for all tiles (32×32 image = hundreds of tiles)

🔁 Each iteration stalls on memory

The 58× Mystery Solved

HLS synthesis predicted: 7.98 ms minimum latency
Actual hardware measured: 466.5 ms
Gap: 58× slower than theoretical

This gap is entirely due to DDR3 memory accesses. The FPGA spent 98.5% of execution time waiting for data transfers, not computing. With weights embedded in BRAM, we'd achieve near-theoretical 7.98 ms performance.

💡 Key Insight: Memory Bandwidth Dominates Everything

The GPU didn't win because it computed faster — it won because it starved less.

GPU advantage: 320 GB/s ÷ 2.1 GB/s = 152× more memory bandwidth

For CNN inference, memory access patterns matter more than raw compute power. The GPU's massive bandwidth and sophisticated caching kept its 2,560 cores fed with data. The FPGA's 220 DSP slices spent most of their time idle, waiting for the next DDR transfer.

Memory Architecture: The 360× Performance Gap

🚀 GPU (Tesla T4)

⚡ FPGA (Zynq-7020)

FPGA Inference Flow (Single 8×8 Tile)

The 58× Mystery Solved

💡 Key Insight: Memory Bandwidth Dominates Everything