FPGA Resource Utilization Journey

❌ Baseline (Unoptimized)

BRAM 18K 65 / 280 blocks

23%

DSP48 Slices 59 / 220

26%

LUTs 44,469 / 53,200

83%

Flip-Flops 25,211 / 106,400

23%

❌ Problems:

83% LUT utilization → routing congestion
Low BRAM usage → insufficient buffering
Clock: 10ns → tight timing
Latency: 397ms → memory-bound

✅ Optimized

BRAM 18K 196 / 280 blocks

70%

DSP48 Slices 32 / 220

14%

LUTs 33,437 / 53,200

62%

Flip-Flops 23,210 / 106,400

21%

✅ Optimizations Applied:

Loop pipelining with II=1
Array partitioning for parallel access
Tiled convolution with on-chip buffers
Clock relaxed to 15ns → timing closure
Dataflow pragma for layer pipelining

📊 Optimization Impact

Min Latency

49.8×

faster

397ms → 7.98ms

LUT Usage

-21%

reduction

83% → 62%

BRAM Usage

+47%

increase

23% → 70%

The Key Tradeoff: We traded BRAM (23% → 70%) for LUT reduction (83% → 62%) and massive speedup (49.8×). On-chip buffering eliminated memory bottlenecks at the cost of using more block RAM. This is the classic FPGA optimization pattern: use precious on-chip memory to avoid slow DDR accesses.

💡 Key Lessons

Memory bandwidth trumps compute: The 58× gap (7.98ms theoretical vs 466.5ms actual) was entirely due to DDR accesses, not computation limits.
On-chip buffering is critical: Using 70% of BRAM enabled the 49.8× speedup from baseline. Every byte moved from DDR to BRAM saved hundreds of cycles.
Clock relaxation helped: Increasing clock period from 10ns to 15ns gave tools breathing room for optimization and achieved timing closure with positive slack.
Resource sharing works: DSP usage dropped from 26% to 14% through better scheduling, proving the tools can do more with less when given proper constraints.

FPGA Resource Utilization Journey

❌ Problems:

✅ Optimizations Applied:

📊 Optimization Impact

Resource Utilization Comparison

Latency Journey

💡 Key Lessons