FPGA Design Workflow

Complete pipeline from PyTorch training to PYNQ deployment

1
🧠
Model Training & Validation
PyTorch on Google Colab (Tesla T4)
Train ReducedVGG model on CIFAR-10 dataset using PyTorch. Standard deep learning workflow with data augmentation, Adam optimizer, and early stopping.
Architecture
ReducedVGG (1.44M params)
Test Accuracy
86.94%
Training Time
~20 epochs, 15 minutes
Output
reduced_vgg_best.pth
2
🔢
Post-Training Quantization
PyTorch Quantization API
Apply INT16 weight-only quantization to reduce model size and prepare for FPGA fixed-point arithmetic. Tested 12 different quantization configs to find optimal accuracy/efficiency tradeoff.
# Quantize to INT32, then convert to INT16 quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint32 ) # Export weights as binary files for name, tensor in model.state_dict().items(): tensor.numpy().tofile(f"params_int32/{name}.bin")
Quantization
INT32 → INT16
Accuracy Loss
1.35% (86.94% → 85.59%)
Memory Reduction
2× smaller (FP32 → INT16)
Output
60 .bin weight files (~2.8 MB)
3
🔄
Fixed-Point Conversion
Python script (convert_weights.py)
Convert INT32 quantized weights to ap_fixed<16,12> format (Q12.4) compatible with HLS. This involves rescaling from scale factor 65536 (2^16) to 16 (2^4).
# INT32 uses scale of 65536 (2^16) # ap_fixed<16,12> needs scale of 16 (2^4 fractional bits) fp32_data = int32_data / 65536.0 fixed_data = np.round(fp32_data * 16.0) fixed_data = np.clip(fixed_data, -32768, 32767) return fixed_data.astype(np.int16)
Format
ap_fixed<16,12> (Q12.4)
Range
-2048 to +2047.9375
Precision
4 fractional bits (0.0625)
Output
48 parameter .bin files
4
⚙️
HLS C++ Implementation
Vitis HLS 2022.2
Write hardware accelerator in C++ using Vitis HLS. Implement tiled convolution, batch normalization, pooling, and fully-connected layers with fixed-point arithmetic.
void reduced_vgg_inference( const fm_t input[32*32*3], fm_t output[10], const wt_t *W1, // 96 weight pointers ... ) { #pragma HLS INTERFACE m_axi port=W1 bundle=gmem #pragma HLS INTERFACE s_axilite port=return // Tiled convolution with on-chip buffering tiled_conv_8x8(input, output, W1, B1); }
C Simulation
✓ Passed (MSE=0)
Synthesis Time
~15 minutes
Clock Target
15ns (66.7 MHz)
Output
IP catalog RTL
5
🔧
Vivado Block Design
Vivado 2022.2
Integrate HLS IP with Zynq Processing System (PS). Connect accelerator to ARM cores via AXI interfaces for control and DDR memory access.
Block Design
PS7 + HLS IP + AXI interconnect
Synthesis
~20 minutes
Implementation
~45 minutes
Timing
✓ Closure (WNS: +0.183ns)
6
💾
Bitstream Generation
Vivado Implementation
Generate final FPGA configuration bitstream and hardware handoff file for PYNQ deployment.
Bitstream Size
45 MB
Power
1.451W (41.7°C junction)
Resources
70% BRAM, 62% LUT, 14% DSP
Output
.bit + .hwh files
7
🎯
PYNQ Deployment & Testing
Python on PYNQ-Z2 Board
Load bitstream on PYNQ-Z2 board, configure AXI registers, load parameters into DDR, and run inference on test images.
from pynq import Overlay overlay = Overlay("design_1_wrapper.bit") accelerator = overlay.reduced_vgg_inference_0 # Configure 96 AXI address registers for i, addr in enumerate(param_addresses): accelerator.write(0x10 + i*8, addr) # Run inference accelerator.write(0x00, 0x01) # Start while accelerator.read(0x00) & 0x04 == 0: pass # Wait for done
Latency
466.53 ms
Throughput
2.14 img/s
Accuracy
85.59%
Power
1.451W

⏱️ Development Timeline

Model Training
~1 day
Including architecture experiments
HLS Development
~1 week
C++ implementation + optimization
Vivado Integration
~3 days
Block design + synthesis
PYNQ Deployment
~2 days
Python driver + testing
Debugging
~1 week
Fixing timing/memory issues
Total
~3 weeks
For someone new to FPGAs

🎓 Key Takeaways