Skip to content

S10 NX Vector Mode Support

Marius Stan requested to merge marius_development into master

This commit contains changes to allow HPIPE to run on the Stratix 10 NX. All changes require the following compile flags to work: --arch architectures/s10_NX.json --device devices/s10_NX.json

The full list of changes is as follows:

  • Majority of convolutions are mapped to tensor block in vector mode (6 8-bit multiplications and accumulate).
    • Tensor mode has different latencies compared to regular DSP block, cascading in particular has an extra pipeline register
    • Accumulating and cascading cannot be used at the same time so an external register is instantiated
  • Depthwise convolutions are packed into tensor block in scalar mode (3 indpenedent 8-bit multiplications)
    • This mode has similar latencies and accumulate/cascade restrictions to vector mode
  • Chains are broken using tensor block in fp32 accumulate mode (allows addition of 2 indepenedent 32-bit FP numbers)
    • Note that we only use this mode to pass a high precision value to the start of the next carry chain, the actual value we pass in is fixed point as the addition function is not used
    • Care must be taken to ensure a valid fp32 number is used (i.e. 0 and NaN must be avoided). This is done by by modifying the "exponent" while taking note of the fact that the top bits won't be used further down the chain anyways
  • Mean module, which requires a higher precision multiplier, is implemented using an Intel IP for the NX. It uses roughly 7 tensor blocks
  • Support for adding static Verilog files (from the digitallogic/verilog_files/ folder) ro the HPIPE output folder and quartus includes list.
    • These include had-written files to model the tensor block behaviour in simulation without the need for the encrypted Intel IP
  • Planner was updated to account for extra multipliers avaliable inside each tensor block
  • Added support for balancing M20K and MLAB usage for larger designs on the NX as they run out of MLABS (thus they will get converted to M20Ks after a certain threshold)
  • Commited NX driver and some of the core project files
    • Driver was based off of the Gidel DMA example
    • Quartus project was generated using Gidel's Procwizard and then modified to match the HPIPE spec (Gidel uses a slightly different method for writing data)
    • Note the programmable PLLs are not supported in current project
  • Signal Tap tutorial was added to documentation

Testing:

  • NX unit tests for Mobilenet-v1 were added as part of the testing that occurs when someone commits
  • Mobilenet-v1 was tested in both simulation and on the FPGA for performance and accuracy
    • Top-1: 66.49% Top-5: 87.02% (GX reference 16-bit is 71.69% Top-5: 90.16%)
    • Scaled throughput: 6207 images/s (1.79x speedup)
  • Mobilenet-v2 was tested in both simulation and on the FPGA for performance and accuracy
    • Top-1: 61.79% Top-5: 84.08% (Training ongoing, currently at Top-1: 66.75% and Top-5: 88.16%)
    • Un-scaled throughput bottlenecks at 5100 images/s. Bottleneck is due to a PCIe hardware configuration issue as simulation achieves 10,300 images/s with similar parameters
  • Resnet-50 uses up most of the RAMs even at smaller tensor block counts and will therefore not be used on the NX. It will be replaced by Mobilenet-v3
  • Networks re-testing on GX in progress
    • Resnet-50 achieves 3277 images/s with 4800 DSPs @ 450 MHz (this is a fair amount lower than the thesis results, but this is because it uses too many memories and achieves poor timing) and an accuracy of Top-1: 71.65% Top-5: 90.58%
    • Mobilenet-v1 achieves 5548 images/s with 5100 DSPs @ 450 MHz (higher than thesis results even once scaled down to 430 MHz) and an accuracy of Top-1: 71.68% Top-5: 90.15%
    • Mobilenet-v2 achieves 6375 images/s with 3500 DSPs and an accuracy of Top-1: 71.80% Top-5: 90.20% (matches Mohammed's results)
Edited by Marius Stan

Merge request reports

Loading