4.2 Reconfigurable Architecture and Tools

Printer-friendly version PDF version

Date: Tuesday, March 26, 2019
Time: 17:00 - 18:30
Location / Room: Room 2

Smail Niar, Université Polytechnique Hauts-de-France, FR, Contact Smail Niar

Lars Bauer, Karlsruhe Institute of Technology, DE, Contact Lars Bauer

This session presents three papers that improved application mapping onto coarse-grained reconfigurable arrays, thermal aware application mapping for FPGAs, and hardware security coprocessor for reconfigurable CPUs, along with two interactive presentations that introduce novel hardware accelerator interfaces to multi-core CPUs and approximate arithmetic components.

TimeLabelPresentation Title
Satyajit Das, Univ. Bretagne-Sud, CNRS UMR 6285, Lab-STICC, FR
Satyajit Das1, Kevin Martin2 and Philippe Coussy3
1Université de Bretagne-Sud, FR; 2Univ. Bretagne Sud, FR; 3Universite de Bretagne-Sud / Lab-STICC, FR
Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as low power computing alternative providing a high grade of acceleration. However, the area and energy efficiency of these devices are bottlenecked by the configuration/context memory when they are made autonomous and loosely coupled with CPUs. The size of these instruction memories is of prime importance due to their high area and impact on the power consumption. For instance, a 64-word instruction memory typically represents 40% of a processing element area. In this context, since traditional mapping approaches do not take the size of the context memory into account, CGRAs often become oversized which strongly degrade their performance and interest. In this paper, we propose a context memory aware mapping for CGRAs to achieve better area and energy efficiency. This paper motivates the need of constraining the size of the context memory inside the processing element (PE) for ultra low power acceleration. It also describes the mapping approach which tries to find at least one mapping solution for a given set of constraints defined by the context memories of the PEs. Experiments show that our proposed solution achieves an average of 2.3× energy gain (with a maximum of 3.1× and a minimum of 1.4×) compared to the mapping approach without the memory constraints, while using 2× less instruction memory. When compared to the CPU, the proposed mapping achieves an average of 14× (with a maximum of 23× and minimum of 5×) energy gain.
Tajana Rosing, University of California, San Diego, US
Behnam Khaleghi1 and Tajana Rosing2
1University of California, San Diego, US; 2UCSD, US
To ensure reliable operation of circuits under elevated temperatures, designers are obliged to put a pessimistic timing margin proportional to the worst-case temperature (T worst ), which incurs significant performance overhead. The problem is exacerbated in deep-CMOS technologies with increased leakage power, particularly in Field-Programmable Gate Arrays (FPGAs) that comprise an abundance of leaky resources. We propose a two-fold approach to tackle the problem in FPGAs. For this end, we first obtain the performance and power characteristics of FPGA resources in a temperature range. Having the temperature-performance correlation of resources together with the estimated thermal distribution of applications makes it feasible to apply minimal, yet sufficient, timing margin. Second, we show how optimizing an FPGA device for a specific thermal corner affects its performance in the operating temperature range. This emphasizes the need for optimizing the device according to the target (range of) temperature. Building upon this observation, we propose thermal-aware optimization of FPGA architecture for foreknown field conditions. We performed a comprehensive set of experiments to implement and examine the proposed techniques. The experimental results reveal that thermal-aware timing on FPGAs yields up to 36.5% performance improvement. Optimizing the architecture further boosts the performance by 6.7%.
Swaroop Ghosh, The Pennsylvania State University, US
Asmit De, Aditya Basu, Swaroop Ghosh and Trent Jaeger, Pennsylvania State University, US
With the recent proliferation of Internet of Things (IoT) and embedded devices, there is a growing need to develop a security framework to protect such devices. RISC-V is a promising open source architecture that targets low-power embedded devices and SoCs. However, there is a dearth of practical and low-overhead security solutions in the RISC-V architecture. Programs compiled using RISC-V toolchains are still vulnerable to code injection and code reuse attacks such as buffer overflow and return-oriented programming (ROP). In this paper, we propose FIXER, a hardware implemented security extension to RISC-V that provides a defense mechanism against such attacks. FIXER enforces fine-grained control-flow integrity (CFI) of running programs on backward edges (returns) and forward edges (calls) without requiring any architectural modifications to the RISC-V processor core. We implement FIXER on RocketChip, a RISC-V SoC platform, by leveraging the integrated Rocket Custom Coprocessor (RoCC) to detect and prevent attacks. Compared to existing software based solutions, FIXER reduces energy overhead by 60% at minimal execution time (1.5%) and area (2.9%) overheads.
Marcelo Brandalero, Universidade Federal do Rio Grande do Sul (UFRGS), BR
Marcelo Brandalero1, Muhammad Shafique2, Luigi Carro3 and Antonio Carlos Schneider Beck1
1Universidade Federal do Rio Grande do Sul, BR; 2Vienna University of Technology (TU Wien), AT; 3UFRGS, BR
Single-ISA heterogeneous systems, such as ARM's big.LITTLE, use microarchitecturally-different General-Purpose Processor cores to efficiently match the capabilities of the processing resources with applications' performance and energy requirements that change at run time. However, since only a fixed and non-configurable set of cores is available, reaching the best possible match between the available resources and applications' requirements remains a challenge, especially considering the varying and unpredictable workloads. In this work, we propose TransRec, a hardware architecture which improves over these traditional heterogeneous designs. TransRec integrates a shared, transparent (i.e., no need to change application binary) and adaptive accelerator in the form of a Coarse-Grained Reconfigurable Array that can be used by any of the General-Purpose Processor cores for on-demand acceleration. Through evaluations with cycle-accurate gem5 simulations, synthesis of real RISC-V processor designs for a 15nm technology, and considering the effects of Dynamic Voltage and Frequency Scaling, we demonstrate that TransRec provides better performance-energy tradeoffs that are otherwise unachievable with traditional big.LITTLE-like designs. In particular, for less than 40% area overhead, TransRec can improve performance in the low-energy mode (LITTLE) by 2.28x, and can improve both performance and energy efficiency by 1.32x and 1.59x, respectively, in high-performance mode (big).
Mohsen Imani, University of California, San Diego, US
Mohsen Imani, Ricardo Garcia, Andrew Huang and Tajana Rosing, University of California San Diego, US
Approximate computing is a promising solution to design faster and more energy efficient systems, which provides an adequate quality for a variety of functions. Division, in particular, floating point division, is one of the most important operations in multimedia applications, which has been implemented less in hardware due to its significant cost and complexity. In this paper, we proposed CADE, a Configurable Approximate Divider which performs floating point division operation with a runtime controllable accuracy. The approximation of the CADE is accomplished by removing the costly division operation and replacing it with a subtraction of the input operands mantissa. To increase the level of accuracy, CADE analyses the first N bits (called tuning bits) of both input operands mantissa to estimate the division error. If CADE determines that the first approximation is unacceptable, a pre-computed value is retrieved from memory and subtracted from the first approximation mantissa. At runtime, CADE can provide a higher accuracy by increasing the number of tuning bits. The proposed CADE was integrated on the AMD GPU architecture. Our evaluation shows that CADE is at least 4.1× more energy efficient, 1.5× faster, and 1.7× higher area efficient as compared to state-of-the-art approximate dividers while providing 25% lower error rate. In addition, CADE gives a new knob to GPU in order to configure the level of approximation at runtime depending on the application/user accuracy requirement.
18:30End of session
Exhibition Reception in Exhibition Area

The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.