Why GPU demolished FPGA in this implementation
HLS synthesis predicted: 7.98 ms minimum latency
Actual hardware measured: 466.5 ms
Gap: 58ร slower than theoretical
This gap is entirely due to DDR3 memory accesses. The FPGA spent 98.5% of execution time
waiting for data transfers, not computing. With weights embedded in BRAM, we'd achieve
near-theoretical 7.98 ms performance.
The GPU didn't win because it computed faster โ it won because it starved less.
GPU advantage: 320 GB/s รท 2.1 GB/s = 152ร more memory bandwidth
For CNN inference, memory access patterns matter more than raw compute power.
The GPU's massive bandwidth and sophisticated caching kept its 2,560 cores fed with data.
The FPGA's 220 DSP slices spent most of their time idle, waiting for the next DDR transfer.