11.5 Design of Efficient Microarchitectures

Printer-friendly version PDF version

Date: Thursday 17 March 2016
Time: 14:00 - 15:30
Location / Room: Konferenz 3

Chair:
Dionisios Pnevmatikatos, Technical University of Crete, GR

Co-Chair:
Todd Austin, University of Michigan, US

The microarchitecture session presents innovative ideas for the efficient design of computing components. The first paper presents a viable prediction technique to deactivate cache ways in order to save energy without compromising performance. The second paper proposes a micro-architectural extension for approximate computing that reduces bit-error-rate while providing the power benefits of extreme voltage scaling techniques. The third paper presents a faster and accurate version of logarithmic number unit (LNU) design and implementation using a co-transformation scheme.

TimeLabelPresentation Title
Authors
14:0011.5.1PRACTICAL WAY HALTING BY SPECULATIVELY ACCESSING HALT TAGS
Speaker:
Daniel Moreau, Chalmers University of Technology, SE
Authors:
Daniel Moreau1, Alen Bardizbanyan1, Magnus Själander2, Dave Whalley3 and Per Larsson-Edefors1
1Chalmers University of Technology, SE; 2Uppsala University, SE; 3Florida State University, US
Abstract
Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.

Download Paper (PDF; Only available from the DATE venue WiFi)
14:3011.5.2LAZY PIPELINES: ENHANCING QUALITY IN APPROXIMATE COMPUTING
Speaker:
Georgios Tziantzioulis, Northwestern University, US
Authors:
Georgios Tziantzioulis1, Ali Murat Gok1, S M Faisal2, Nikos Hardavellas1, Seda Ogrenci-Memik1 and Srinivasan Parthasarathy2
1Northwestern University, US; 2The Ohio State University, US
Abstract
Approximate computing techniques based on Voltage Over-Scaling (VOS) can provide quadratic improvements in power efficiency. However, voltage scaling is limited by the inherent fault-tolerance of an application, thus preventing VOS schemes from realizing their full potential. To gain further power efficiency a reduction of the error rate experienced in a given voltage level is required. We propose Lazy Pipelines, a micro-architectural technique that utilizes vacant cycles in a VOS functional unit to extend execution and reduce the error rate.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:0011.5.3HIGH-EFFICIENCY LOGARITHMIC NUMBER UNIT DESIGN BASED ON AN IMPROVED CO-TRANSFORMATION SCHEME
Speaker:
Youri Popoff, ETH Zürich, CH
Authors:
Youri Popoff, Florian Scheidegger, Michael Schaffner, Michael Gautschi, Frank K. Gürkaynak and Luca Benini, ETH Zürich, CH
Abstract
The logarithmic number system (LNS) has always been an interesting alternative for floating point calculations since the implementation of several arithmetic operations such as divisions, exponentiations and square-roots, which are required for computationally intensive nonlinear functions, is greatly simplified in the logarithmic space. However, additions and subtractions become nonlinear operations that have to be approximated using polynomials for area efficient realizations. A particular challenge is the accuracy within the so-called critical region which is encountered for subtractions where the difference between the operands is close to zero. In the literature, several arithmetic cotransformations that reduce the overhead of approximating these operations have been presented. Even so, the main problem with practical LNS realizations is the area overhead when compared to standard FPUs with comparable accuracy. In this paper, we propose a highly hardware-efficient novel cotransformation concept that not only reduces the area requirements by up to 35% when compared to the state-of-the-art, but also allows the LNU to calculate single cycle logarithms and exponentiations within the same datapath. We present comprehensive results for a complete processing system that includes the LNU and an OpenRISC based core in 65nm, and 28nm technologies. We compare this implementation with a system using a standard IEEE compliant FPU and show that the LNS based system can outperform its FP counterpart by up to 4.35x in speed. The final, pipelined LNU system when implemented in 65nm occupies an area of 54.3 kGE, allows 89 MFLOP per second and consumes 15.9- 136.7 pJ per operation at 1.2V under typical conditions and 25°C.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:30IP5-14, 521SEERAD: A HIGH SPEED YET ENERGY-EFFICIENT ROUNDING-BASED APPROXIMATE DIVIDER
Speaker:
Ali Afzali-Kusha, University of Tehran, IR
Authors:
Reza Zendegani1, Mehdi Kamal1, Arash Fayyazi1, Ali Afzali-Kusha1, Saeed Safari1 and Massoud Pedram2
1University of Tehran, IR; 2University of Southern California, US
Abstract
In this paper, a high speed yet energy-efficient approximate divider for error resilient applications is proposed. For the division operation, the divisor is rounded to a value with a specific form resulting in the transformation of the division operation to the multiplication one. The proposed approximate divider enjoys the flexibility of increasing the accuracy at the price of higher delay and hardware usage. The efficacy of the proposed approximate divider is evaluated in comparison to three different implementations of the SRT divider. The results show that the delay and energy consumption of the proposed approximate divider are, on average, 14 and 300 times smaller than those of the Radix-2 SRT with the carry-save reminder computation. Additionally, the effectiveness of the proposed approximate divider is studied in an image division operation performed in image processing applications. The results suggest the appropriateness of the proposed approximate divider for digital signal processing applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:31IP5-15, 140IMPROVING PERFORMANCE GUARANTEES IN WORMHOLE MESH NOC DESIGNS
Speaker:
Milos Panic, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Authors:
Milos Panic1, Carles Hernandez2, Jaume Abella2, Antoni Roca Perez3, Eduardo Quinones2 and Francisco Cazorla4
1Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES; 3Universitat Politècnica de Catalunya, ES; 4Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:32IP5-16, 906A DATA LAYOUT TRANSFORMATION (DLT) ACCELERATOR: ARCHITECTURAL SUPPORT FOR DATA MOVEMENT OPTIMIZATION IN ACCELERATED-CENTRIC HETEROGENEOUS SYSTEMS
Speaker:
Tung Hoang, University of Chicago, US
Authors:
Tung Hoang, Amirali Shambayati and Andrew A. Chien, University of Chicago, US
Abstract
Technology scaling and growing use of accelerators make optimization of data movement of increasing importance in all computing systems. Further, growing diversity in memory structures makes embedding such optimization in software non-portable. We propose a novel architectural solution called Data Layout Transformation (DLT) associated with a simple set of instructions that enable software to describe the required data movement compactly, and free the implementation to optimize the movement based on the knowledge of the memory hierarchy and system structure. The DLT architecture ideas can be applicable to both general-purpose and accelerator-based heterogeneous systems. Experiment results first show that the proposed DLT architecture can make use of the full bandwidth (>97%) of a wide range of memory systems (DDR3 and HMC) while its implementation cost, in 32nm, is low (only 0.246 mm2 and 75mW at 1GHz). Our evaluation of using the DLT accelerator in accelerated-based heterogeneous system across DDR3 and HMC memory shows that the DLT can enhance system performance in range of 4.6x-99x (DDR3), 4.4x-115x (HMC) which turns out 2.8x-48x (DDR3), 1.4x-39x (HMC) improvement for energy efficiency.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:33IP5-17, 203OUESSANT: FLEXIBLE INTEGRATION OF DEDICATED COPROCESSORS IN SYSTEMS ON CHIP
Speaker:
Pierre-Henri Horrein, Lab-STICC/Télécom Bretagne, FR
Authors:
Pierre-Henri Horrein, Philip-Dylan Gleonec, Erwan Libessart, André Lalevée and Matthieu Arzel, Lab-STICC/Télécom Bretagne, FR
Abstract
Integration of hardware accelerators in System on Chips is often complex. When dealing with reconfigurable hardware, this greatly limits the attainable flexibility. In this paper, we propose an alternative approach to the Molen paradigm [1]. This approach, named Ouessant, is based on a very simple general purpose instruction set designed for close interaction with dedicated hardware accelerators. This instruction set is used to program a dedicated controler, which commands the accelerator's execution and data transfer with minimal CPU intervention. The resulting architecture is flexible, extensible, and can be easily integrated in System on Chips. Adding new accelerators is also made easier. Implementation of the architecture on different FPGA resources show very low footprint and a very small impact on attainable performance. Ouessant is freely available under an open-source license.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:30End of session
Coffee Break in Exhibition Area