12.6 Reconfigurable Computing Platforms and Architectures

Time	Label	Presentation Title Authors
16:00	12.6.1	SECURING THE CLOUD WITH RECONFIGURABLE COMPUTING: AN FPGA ACCELERATOR FOR HOMOMORPHIC ENCRYPTION Speaker: Alessandro Cilardo, University of Naples Federico II, IT Authors: Alessandro Cilardo and Domenico Argenziano, University of Naples Federico II, IT Abstract A hot topic in current cloud security research, homomorphic encryption is a recently introduced technique allowing computation to take place on encrypted data. This work presents the architecture and implementation of a dedicated FPGA-based accelerator addressing the prohibitive computing demand of homomorphic encryption. In particular, the accelerator targets the most time consuming operation used by the encryption primitive, large integer multiplication. Based on an Altera's Stratix V FPGA platform, the prototype implementation achieves significant improvements in terms of execution time -under a comparable hardware cost- against alternative solutions previously presented in the technical literature. Download Paper (PDF; Only available from the DATE venue WiFi)
16:30	12.6.2	THROUGHPUT ORIENTED FPGA OVERLAYS USING DSP BLOCKS Speaker: Douglas L. Maskell, Nanyang Technological University, SG Authors: Abhishek K. Jain¹, Douglas L. Maskell¹ and Suhaib A. Fahmy² ¹Nanyang Technological University, SG; ²University of Warwick, GB Abstract Design productivity is a major concern preventing the mainstream adoption of FPGAs. Overlay architectures have emerged as one possible solution to this challenge, offering fast compilation and software-like programmability. However, overlays typically suffer from area and performance overheads due to limited consideration for the underlying FPGA architecture. These overlays have often been of limited size, supporting only relatively small compute kernels. This paper examines the possibility of developing larger, more efficient, overlays using multiple DSP blocks and then maximising utilisation by mapping multiple instances of kernels simultaneously onto the overlay to exploit kernel level parallelism. We show a significant improvement in achievable overlay size and overlay utilisation, with a reduction of almost 70% in the overlay tile requirement compared to existing overlay architectures, an operating frequency in excess of 300 MHz, and kernel throughputs of almost 60 GOPS. Download Paper (PDF; Only available from the DATE venue WiFi)
17:00	12.6.3	RUN-TIME PHASE PREDICTION FOR A RECONFIGURABLE VLIW PROCESSOR Speaker: Stephan Wong, TUDelft, NL Authors: Qi Guo¹, Anderson Sartor², Anthony Brandon³, Xuehai Zhou¹ and Stephan Wong³ ¹University of Science and Technology of China, CN; ²Universidade Federal do Rio Grande do Sul (UFRGS), BR; ³TUDelft, NL Abstract It is well-known that different applications exhibit varying amounts of ILP. Execution of these applications on the same fixed-width VLIW processor will result (1) in wasted energy due to underutilized resources if the issue-width of the processor is larger than the inherent ILP; or alternatively, (2) in lower performance if the issue-width is smaller than the inherent ILP. Moreover, even within a single application distinct phases can be observed with varying ILP and therefore changing resource requirements. With this in mind, we designed the rVEX processor, which is a VLIW processor that can change its issue-width at run-time. In this paper, we propose a novel scheme to dynamically (i.e., at run-time) optimize the resource utilization by predicting and matching the number of active data-paths for each application phase. The purpose is to achieve low energy consumption for applications with low ILP, and high performance for applications with high ILP, on a single VLIW processor design. We prototyped the rVEX processor on an FPGA and obtained the dynamic traces of applications running on top of a Linux port. Our results show that it is possible in some cases to achieve the performance of an 8-issue core with 10% lower energy consumption, while in others we achieve the energy consumption of a 2-issue core with close to 20% lower execution time. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30		End of session

Time

Label

Presentation Title
Authors

16:00

12.6.1

SECURING THE CLOUD WITH RECONFIGURABLE COMPUTING: AN FPGA ACCELERATOR FOR HOMOMORPHIC ENCRYPTION
Speaker:
Alessandro Cilardo, University of Naples Federico II, IT
Authors:
Alessandro Cilardo and Domenico Argenziano, University of Naples Federico II, IT
Abstract
A hot topic in current cloud security research, homomorphic encryption is a recently introduced technique allowing computation to take place on encrypted data. This work presents the architecture and implementation of a dedicated FPGA-based accelerator addressing the prohibitive computing demand of homomorphic encryption. In particular, the accelerator targets the most time consuming operation used by the encryption primitive, large integer multiplication. Based on an Altera's Stratix V FPGA platform, the prototype implementation achieves significant improvements in terms of execution time -under a comparable hardware cost- against alternative solutions previously presented in the technical literature.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:30

12.6.2

THROUGHPUT ORIENTED FPGA OVERLAYS USING DSP BLOCKS
Speaker:
Douglas L. Maskell, Nanyang Technological University, SG
Authors:
Abhishek K. Jain¹, Douglas L. Maskell¹ and Suhaib A. Fahmy²
¹Nanyang Technological University, SG; ²University of Warwick, GB
Abstract
Design productivity is a major concern preventing the mainstream adoption of FPGAs. Overlay architectures have emerged as one possible solution to this challenge, offering fast compilation and software-like programmability. However, overlays typically suffer from area and performance overheads due to limited consideration for the underlying FPGA architecture. These overlays have often been of limited size, supporting only relatively small compute kernels. This paper examines the possibility of developing larger, more efficient, overlays using multiple DSP blocks and then maximising utilisation by mapping multiple instances of kernels simultaneously onto the overlay to exploit kernel level parallelism. We show a significant improvement in achievable overlay size and overlay utilisation, with a reduction of almost 70% in the overlay tile requirement compared to existing overlay architectures, an operating frequency in excess of 300 MHz, and kernel throughputs of almost 60 GOPS.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:00

12.6.3

RUN-TIME PHASE PREDICTION FOR A RECONFIGURABLE VLIW PROCESSOR
Speaker:
Stephan Wong, TUDelft, NL
Authors:
Qi Guo¹, Anderson Sartor², Anthony Brandon³, Xuehai Zhou¹ and Stephan Wong³
¹University of Science and Technology of China, CN; ²Universidade Federal do Rio Grande do Sul (UFRGS), BR; ³TUDelft, NL
Abstract
It is well-known that different applications exhibit varying amounts of ILP. Execution of these applications on the same fixed-width VLIW processor will result (1) in wasted energy due to underutilized resources if the issue-width of the processor is larger than the inherent ILP; or alternatively, (2) in lower performance if the issue-width is smaller than the inherent ILP. Moreover, even within a single application distinct phases can be observed with varying ILP and therefore changing resource requirements. With this in mind, we designed the rVEX processor, which is a VLIW processor that can change its issue-width at run-time. In this paper, we propose a novel scheme to dynamically (i.e., at run-time) optimize the resource utilization by predicting and matching the number of active data-paths for each application phase. The purpose is to achieve low energy consumption for applications with low ILP, and high performance for applications with high ILP, on a single VLIW processor design. We prototyped the rVEX processor on an FPGA and obtained the dynamic traces of applications running on top of a Linux port. Our results show that it is possible in some cases to achieve the performance of an 8-issue core with 10% lower energy consumption, while in others we achieve the energy consumption of a 2-issue core with close to 20% lower execution time.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

End of session

Visit us at DATE 2016