Next Generation FPGA Arithmetic for AI

Martin Langhammer
Intel, GB

ABSTRACT

The most recent FPGA architectures have introduced new levels of embedded floating point performance, with tens of TFLOPs now available across a wide range of device sizes. The last two generations of FPGAs have introduced IEEE754 single precision (FP32) arithmetic, containing up to 10 TFLOPs. The emergence of AI/Machine Learning as the highest profile FPGA application has changed the focus from signal processing and embedded calculations supported by FP32 to smaller floating point precisions, such as BFLOAT16 for training and FP16 for inference. In this talk, we will describe the architecture and development of the Intel Agilex DSP Block, which contains a FP32 multiplier-adder pair that can be decomposed into two smaller precision pairs; fp16, bfloat16, and a third proprietary format which can be used for both training and inference. In the Edge, where even lower precision arithmetic is required for inference, new FPGA EDA flows can implement 100 TFLOPs+ of soft logic-based compute power. In the second half of our talk, we will describe new synthesis, clustering, and packing methodologies – collectively known as Fractal Synthesis - that allow an unprecedented near 100% logic use of the FPGA for arithmetic, while maintaining the clock rates of a small example design. The soft logic and embedded arithmetic capabilities can be used simultaneously, making the FPGA the most flexible, and amongst the highest performing AI platform.