11.5 Design of Efficient Microarchitectures

Time	Label	Presentation Title Authors
14:00	11.5.1	PRACTICAL WAY HALTING BY SPECULATIVELY ACCESSING HALT TAGS Speaker: Daniel Moreau, Chalmers University of Technology, SE Authors: Daniel Moreau¹, Alen Bardizbanyan¹, Magnus Själander², Dave Whalley³ and Per Larsson-Edefors¹ ¹Chalmers University of Technology, SE; ²Uppsala University, SE; ³Florida State University, US Abstract Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%. Download Paper (PDF; Only available from the DATE venue WiFi)
14:30	11.5.2	LAZY PIPELINES: ENHANCING QUALITY IN APPROXIMATE COMPUTING Speaker: Georgios Tziantzioulis, Northwestern University, US Authors: Georgios Tziantzioulis¹, Ali Murat Gok¹, S M Faisal², Nikos Hardavellas¹, Seda Ogrenci-Memik¹ and Srinivasan Parthasarathy² ¹Northwestern University, US; ²The Ohio State University, US Abstract Approximate computing techniques based on Voltage Over-Scaling (VOS) can provide quadratic improvements in power efficiency. However, voltage scaling is limited by the inherent fault-tolerance of an application, thus preventing VOS schemes from realizing their full potential. To gain further power efficiency a reduction of the error rate experienced in a given voltage level is required. We propose Lazy Pipelines, a micro-architectural technique that utilizes vacant cycles in a VOS functional unit to extend execution and reduce the error rate. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	11.5.3	HIGH-EFFICIENCY LOGARITHMIC NUMBER UNIT DESIGN BASED ON AN IMPROVED CO-TRANSFORMATION SCHEME Speaker: Youri Popoff, ETH Zürich, CH Authors: Youri Popoff, Florian Scheidegger, Michael Schaffner, Michael Gautschi, Frank K. Gürkaynak and Luca Benini, ETH Zürich, CH Abstract The logarithmic number system (LNS) has always been an interesting alternative for floating point calculations since the implementation of several arithmetic operations such as divisions, exponentiations and square-roots, which are required for computationally intensive nonlinear functions, is greatly simplified in the logarithmic space. However, additions and subtractions become nonlinear operations that have to be approximated using polynomials for area efficient realizations. A particular challenge is the accuracy within the so-called critical region which is encountered for subtractions where the difference between the operands is close to zero. In the literature, several arithmetic cotransformations that reduce the overhead of approximating these operations have been presented. Even so, the main problem with practical LNS realizations is the area overhead when compared to standard FPUs with comparable accuracy. In this paper, we propose a highly hardware-efficient novel cotransformation concept that not only reduces the area requirements by up to 35% when compared to the state-of-the-art, but also allows the LNU to calculate single cycle logarithms and exponentiations within the same datapath. We present comprehensive results for a complete processing system that includes the LNU and an OpenRISC based core in 65nm, and 28nm technologies. We compare this implementation with a system using a standard IEEE compliant FPU and show that the LNS based system can outperform its FP counterpart by up to 4.35x in speed. The final, pipelined LNU system when implemented in 65nm occupies an area of 54.3 kGE, allows 89 MFLOP per second and consumes 15.9- 136.7 pJ per operation at 1.2V under typical conditions and 25°C. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	IP5-14, 521	SEERAD: A HIGH SPEED YET ENERGY-EFFICIENT ROUNDING-BASED APPROXIMATE DIVIDER Speaker: Ali Afzali-Kusha, University of Tehran, IR Authors: Reza Zendegani¹, Mehdi Kamal¹, Arash Fayyazi¹, Ali Afzali-Kusha¹, Saeed Safari¹ and Massoud Pedram² ¹University of Tehran, IR; ²University of Southern California, US Abstract In this paper, a high speed yet energy-efficient approximate divider for error resilient applications is proposed. For the division operation, the divisor is rounded to a value with a specific form resulting in the transformation of the division operation to the multiplication one. The proposed approximate divider enjoys the flexibility of increasing the accuracy at the price of higher delay and hardware usage. The efficacy of the proposed approximate divider is evaluated in comparison to three different implementations of the SRT divider. The results show that the delay and energy consumption of the proposed approximate divider are, on average, 14 and 300 times smaller than those of the Radix-2 SRT with the carry-save reminder computation. Additionally, the effectiveness of the proposed approximate divider is studied in an image division operation performed in image processing applications. The results suggest the appropriateness of the proposed approximate divider for digital signal processing applications. Download Paper (PDF; Only available from the DATE venue WiFi)
15:31	IP5-15, 140	IMPROVING PERFORMANCE GUARANTEES IN WORMHOLE MESH NOC DESIGNS Speaker: Milos Panic, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES Authors: Milos Panic¹, Carles Hernandez², Jaume Abella², Antoni Roca Perez³, Eduardo Quinones² and Francisco Cazorla⁴ ¹Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES; ³Universitat Politècnica de Catalunya, ES; ⁴Barcelona Supercomputing Center and IIIA-CSIC, ES Abstract Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so. Download Paper (PDF; Only available from the DATE venue WiFi)
15:32	IP5-16, 906	A DATA LAYOUT TRANSFORMATION (DLT) ACCELERATOR: ARCHITECTURAL SUPPORT FOR DATA MOVEMENT OPTIMIZATION IN ACCELERATED-CENTRIC HETEROGENEOUS SYSTEMS Speaker: Tung Hoang, University of Chicago, US Authors: Tung Hoang, Amirali Shambayati and Andrew A. Chien, University of Chicago, US Abstract Technology scaling and growing use of accelerators make optimization of data movement of increasing importance in all computing systems. Further, growing diversity in memory structures makes embedding such optimization in software non-portable. We propose a novel architectural solution called Data Layout Transformation (DLT) associated with a simple set of instructions that enable software to describe the required data movement compactly, and free the implementation to optimize the movement based on the knowledge of the memory hierarchy and system structure. The DLT architecture ideas can be applicable to both general-purpose and accelerator-based heterogeneous systems. Experiment results first show that the proposed DLT architecture can make use of the full bandwidth (>97%) of a wide range of memory systems (DDR3 and HMC) while its implementation cost, in 32nm, is low (only 0.246 mm2 and 75mW at 1GHz). Our evaluation of using the DLT accelerator in accelerated-based heterogeneous system across DDR3 and HMC memory shows that the DLT can enhance system performance in range of 4.6x-99x (DDR3), 4.4x-115x (HMC) which turns out 2.8x-48x (DDR3), 1.4x-39x (HMC) improvement for energy efficiency. Download Paper (PDF; Only available from the DATE venue WiFi)
15:33	IP5-17, 203	OUESSANT: FLEXIBLE INTEGRATION OF DEDICATED COPROCESSORS IN SYSTEMS ON CHIP Speaker: Pierre-Henri Horrein, Lab-STICC/Télécom Bretagne, FR Authors: Pierre-Henri Horrein, Philip-Dylan Gleonec, Erwan Libessart, André Lalevée and Matthieu Arzel, Lab-STICC/Télécom Bretagne, FR Abstract Integration of hardware accelerators in System on Chips is often complex. When dealing with reconfigurable hardware, this greatly limits the attainable flexibility. In this paper, we propose an alternative approach to the Molen paradigm [1]. This approach, named Ouessant, is based on a very simple general purpose instruction set designed for close interaction with dedicated hardware accelerators. This instruction set is used to program a dedicated controler, which commands the accelerator's execution and data transfer with minimal CPU intervention. The resulting architecture is flexible, extensible, and can be easily integrated in System on Chips. Adding new accelerators is also made easier. Implementation of the architecture on different FPGA resources show very low footprint and a very small impact on attainable performance. Ouessant is freely available under an open-source license. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30		End of session Coffee Break in Exhibition Area

Time

Label

Presentation Title
Authors

14:00

11.5.1

PRACTICAL WAY HALTING BY SPECULATIVELY ACCESSING HALT TAGS
Speaker:
Daniel Moreau, Chalmers University of Technology, SE
Authors:
Daniel Moreau¹, Alen Bardizbanyan¹, Magnus Själander², Dave Whalley³ and Per Larsson-Edefors¹
¹Chalmers University of Technology, SE; ²Uppsala University, SE; ³Florida State University, US
Abstract
Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.
Download Paper (PDF; Only available from the DATE venue WiFi)

14:30

11.5.2

LAZY PIPELINES: ENHANCING QUALITY IN APPROXIMATE COMPUTING
Speaker:
Georgios Tziantzioulis, Northwestern University, US
Authors:
Georgios Tziantzioulis¹, Ali Murat Gok¹, S M Faisal², Nikos Hardavellas¹, Seda Ogrenci-Memik¹ and Srinivasan Parthasarathy²
¹Northwestern University, US; ²The Ohio State University, US
Abstract
Approximate computing techniques based on Voltage Over-Scaling (VOS) can provide quadratic improvements in power efficiency. However, voltage scaling is limited by the inherent fault-tolerance of an application, thus preventing VOS schemes from realizing their full potential. To gain further power efficiency a reduction of the error rate experienced in a given voltage level is required. We propose Lazy Pipelines, a micro-architectural technique that utilizes vacant cycles in a VOS functional unit to extend execution and reduce the error rate.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

11.5.3

HIGH-EFFICIENCY LOGARITHMIC NUMBER UNIT DESIGN BASED ON AN IMPROVED CO-TRANSFORMATION SCHEME
Speaker:
Youri Popoff, ETH Zürich, CH
Authors:
Youri Popoff, Florian Scheidegger, Michael Schaffner, Michael Gautschi, Frank K. Gürkaynak and Luca Benini, ETH Zürich, CH
Abstract
The logarithmic number system (LNS) has always been an interesting alternative for floating point calculations since the implementation of several arithmetic operations such as divisions, exponentiations and square-roots, which are required for computationally intensive nonlinear functions, is greatly simplified in the logarithmic space. However, additions and subtractions become nonlinear operations that have to be approximated using polynomials for area efficient realizations. A particular challenge is the accuracy within the so-called critical region which is encountered for subtractions where the difference between the operands is close to zero. In the literature, several arithmetic cotransformations that reduce the overhead of approximating these operations have been presented. Even so, the main problem with practical LNS realizations is the area overhead when compared to standard FPUs with comparable accuracy. In this paper, we propose a highly hardware-efficient novel cotransformation concept that not only reduces the area requirements by up to 35% when compared to the state-of-the-art, but also allows the LNU to calculate single cycle logarithms and exponentiations within the same datapath. We present comprehensive results for a complete processing system that includes the LNU and an OpenRISC based core in 65nm, and 28nm technologies. We compare this implementation with a system using a standard IEEE compliant FPU and show that the LNS based system can outperform its FP counterpart by up to 4.35x in speed. The final, pipelined LNU system when implemented in 65nm occupies an area of 54.3 kGE, allows 89 MFLOP per second and consumes 15.9- 136.7 pJ per operation at 1.2V under typical conditions and 25°C.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

IP5-14, 521

SEERAD: A HIGH SPEED YET ENERGY-EFFICIENT ROUNDING-BASED APPROXIMATE DIVIDER
Speaker:
Ali Afzali-Kusha, University of Tehran, IR
Authors:
Reza Zendegani¹, Mehdi Kamal¹, Arash Fayyazi¹, Ali Afzali-Kusha¹, Saeed Safari¹ and Massoud Pedram²
¹University of Tehran, IR; ²University of Southern California, US
Abstract
In this paper, a high speed yet energy-efficient approximate divider for error resilient applications is proposed. For the division operation, the divisor is rounded to a value with a specific form resulting in the transformation of the division operation to the multiplication one. The proposed approximate divider enjoys the flexibility of increasing the accuracy at the price of higher delay and hardware usage. The efficacy of the proposed approximate divider is evaluated in comparison to three different implementations of the SRT divider. The results show that the delay and energy consumption of the proposed approximate divider are, on average, 14 and 300 times smaller than those of the Radix-2 SRT with the carry-save reminder computation. Additionally, the effectiveness of the proposed approximate divider is studied in an image division operation performed in image processing applications. The results suggest the appropriateness of the proposed approximate divider for digital signal processing applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:31

IP5-15, 140

IMPROVING PERFORMANCE GUARANTEES IN WORMHOLE MESH NOC DESIGNS
Speaker:
Milos Panic, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Authors:
Milos Panic¹, Carles Hernandez², Jaume Abella², Antoni Roca Perez³, Eduardo Quinones² and Francisco Cazorla⁴
¹Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES; ³Universitat Politècnica de Catalunya, ES; ⁴Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:32

IP5-16, 906

A DATA LAYOUT TRANSFORMATION (DLT) ACCELERATOR: ARCHITECTURAL SUPPORT FOR DATA MOVEMENT OPTIMIZATION IN ACCELERATED-CENTRIC HETEROGENEOUS SYSTEMS
Speaker:
Tung Hoang, University of Chicago, US
Authors:
Tung Hoang, Amirali Shambayati and Andrew A. Chien, University of Chicago, US
Abstract
Technology scaling and growing use of accelerators make optimization of data movement of increasing importance in all computing systems. Further, growing diversity in memory structures makes embedding such optimization in software non-portable. We propose a novel architectural solution called Data Layout Transformation (DLT) associated with a simple set of instructions that enable software to describe the required data movement compactly, and free the implementation to optimize the movement based on the knowledge of the memory hierarchy and system structure. The DLT architecture ideas can be applicable to both general-purpose and accelerator-based heterogeneous systems. Experiment results first show that the proposed DLT architecture can make use of the full bandwidth (>97%) of a wide range of memory systems (DDR3 and HMC) while its implementation cost, in 32nm, is low (only 0.246 mm2 and 75mW at 1GHz). Our evaluation of using the DLT accelerator in accelerated-based heterogeneous system across DDR3 and HMC memory shows that the DLT can enhance system performance in range of 4.6x-99x (DDR3), 4.4x-115x (HMC) which turns out 2.8x-48x (DDR3), 1.4x-39x (HMC) improvement for energy efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:33

IP5-17, 203

OUESSANT: FLEXIBLE INTEGRATION OF DEDICATED COPROCESSORS IN SYSTEMS ON CHIP
Speaker:
Pierre-Henri Horrein, Lab-STICC/Télécom Bretagne, FR
Authors:
Pierre-Henri Horrein, Philip-Dylan Gleonec, Erwan Libessart, André Lalevée and Matthieu Arzel, Lab-STICC/Télécom Bretagne, FR
Abstract
Integration of hardware accelerators in System on Chips is often complex. When dealing with reconfigurable hardware, this greatly limits the attainable flexibility. In this paper, we propose an alternative approach to the Molen paradigm [1]. This approach, named Ouessant, is based on a very simple general purpose instruction set designed for close interaction with dedicated hardware accelerators. This instruction set is used to program a dedicated controler, which commands the accelerator's execution and data transfer with minimal CPU intervention. The resulting architecture is flexible, extensible, and can be easily integrated in System on Chips. Adding new accelerators is also made easier. Implementation of the architecture on different FPGA resources show very low footprint and a very small impact on attainable performance. Ouessant is freely available under an open-source license.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

End of session
Coffee Break in Exhibition Area

Visit us at DATE 2016