FPGA Design Workflow: PyTorch to PYNQ

1

🧠

Model Training & Validation

PyTorch on Google Colab (Tesla T4)

Train ReducedVGG model on CIFAR-10 dataset using PyTorch. Standard deep learning workflow with data augmentation, Adam optimizer, and early stopping.

Architecture

ReducedVGG (1.44M params)

Test Accuracy

86.94%

Training Time

~20 epochs, 15 minutes

Output

reduced_vgg_best.pth

2

🔢

Post-Training Quantization

PyTorch Quantization API

Apply INT16 weight-only quantization to reduce model size and prepare for FPGA fixed-point arithmetic. Tested 12 different quantization configs to find optimal accuracy/efficiency tradeoff.

# Quantize to INT32, then convert to INT16
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear, torch.nn.Conv2d}, 
    dtype=torch.qint32
)
# Export weights as binary files
for name, tensor in model.state_dict().items():
    tensor.numpy().tofile(f"params_int32/{name}.bin")
                    

Quantization

INT32 → INT16

Accuracy Loss

1.35% (86.94% → 85.59%)

Memory Reduction

2× smaller (FP32 → INT16)

Output

60 .bin weight files (~2.8 MB)

3

🔄

Fixed-Point Conversion

Python script (convert_weights.py)

Convert INT32 quantized weights to ap_fixed<16,12> format (Q12.4) compatible with HLS. This involves rescaling from scale factor 65536 (2^16) to 16 (2^4).

# INT32 uses scale of 65536 (2^16)
# ap_fixed<16,12> needs scale of 16 (2^4 fractional bits)
fp32_data = int32_data / 65536.0
fixed_data = np.round(fp32_data * 16.0)
fixed_data = np.clip(fixed_data, -32768, 32767)
return fixed_data.astype(np.int16)
                    

Format

ap_fixed<16,12> (Q12.4)

Range

-2048 to +2047.9375

Precision

4 fractional bits (0.0625)

Output

48 parameter .bin files

4

⚙️

HLS C++ Implementation

Vitis HLS 2022.2

Write hardware accelerator in C++ using Vitis HLS. Implement tiled convolution, batch normalization, pooling, and fully-connected layers with fixed-point arithmetic.

void reduced_vgg_inference(
    const fm_t input[32*32*3],
    fm_t output[10],
    const wt_t *W1, // 96 weight pointers
    ...
) {
    #pragma HLS INTERFACE m_axi port=W1 bundle=gmem
    #pragma HLS INTERFACE s_axilite port=return
    
    // Tiled convolution with on-chip buffering
    tiled_conv_8x8(input, output, W1, B1);
}
                    

C Simulation

✓ Passed (MSE=0)

Synthesis Time

~15 minutes

Clock Target

15ns (66.7 MHz)

Output

IP catalog RTL

5

🔧

Vivado Block Design

Vivado 2022.2

Integrate HLS IP with Zynq Processing System (PS). Connect accelerator to ARM cores via AXI interfaces for control and DDR memory access.

Block Design

PS7 + HLS IP + AXI interconnect

Synthesis

~20 minutes

Implementation

~45 minutes

Timing

✓ Closure (WNS: +0.183ns)

6

💾

Bitstream Generation

Vivado Implementation

Generate final FPGA configuration bitstream and hardware handoff file for PYNQ deployment.

Bitstream Size

45 MB

Power

1.451W (41.7°C junction)

Resources

70% BRAM, 62% LUT, 14% DSP

Output

.bit + .hwh files

7

🎯

PYNQ Deployment & Testing

Python on PYNQ-Z2 Board

Load bitstream on PYNQ-Z2 board, configure AXI registers, load parameters into DDR, and run inference on test images.

from pynq import Overlay

overlay = Overlay("design_1_wrapper.bit")
accelerator = overlay.reduced_vgg_inference_0

# Configure 96 AXI address registers
for i, addr in enumerate(param_addresses):
    accelerator.write(0x10 + i*8, addr)

# Run inference
accelerator.write(0x00, 0x01)  # Start
while accelerator.read(0x00) & 0x04 == 0:
    pass  # Wait for done
                    

Latency

466.53 ms

Throughput

2.14 img/s

Accuracy

85.59%

Power

1.451W

FPGA Design Workflow

⏱️ Development Timeline

🎓 Key Takeaways