12.5 Accelerator Design and Heterogeneous Architectures

Time	Label	Presentation Title Authors
16:00	12.5.1	(Best Paper Award Candidate) A RECONFIGURABLE HETEROGENEOUS MULTICORE WITH A HOMOGENEOUS ISA Speaker: Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul (UFRGS), BR Authors: Jeckson Dellagostin Souza¹, Luigi Carro¹, Mateus Beck Rutzig² and Antonio Carlos Schneider Beck Filho¹ ¹Universidade Federal do Rio Grande do Sul (UFRGS), BR; ²Universidade Federal de Santa Maria, BR Abstract Given the large diversity of embedded applications one can find in current portable devices, for energy and performance reasons one must exploit both Thread- and Instruction Level Parallelism. While MPSoCs are largely used for this purpose, they fail when one considers software productivity, since it comprises different ISAs that must be programmed separately. On the other hand, general purpose multicores implement the same ISA, but are composed of a homogeneous set of very power consuming superscalar processors. In this paper we show how one can effectively use a regular fabric to provide a number of different possible heterogeneous configurations while still sustaining the same ISA. This is done by leveraging the intrinsic regularity of a reconfigurable fabric, so several different organizations can be easily built with little effort. To ensure ISA compatibility, we use a binary translation mechanism that transforms code to be executed on the fabric at run-time. Using representative benchmarks, we show that one version of the heterogeneous system can outperform its homogenous counterpart in average by 59% in performance and 10% in energy, with EDP improvements in almost every scenario. Download Paper (PDF; Only available from the DATE venue WiFi)
16:30	12.5.2	THE NEURO VECTOR ENGINE: FLEXIBILITY TO IMPROVE CONVOLUTIONAL NETWORK EFFICIENCY FOR WEARABLE VISION Speaker: Maurice Peemen, Eindhoven University of Technology, NL Authors: Maurice Peemen¹, Bart Mesman¹, Henk Corporaal¹, Runbin Shi², Sohan Lal³ and Ben Juurlinik³ ¹Eindhoven University of Technology, NL; ²Soochow University, CN; ³TU Berlin, DE Abstract Deep Convolutional Networks (ConvNets) are currently superior in benchmark performance, but the associated demands on computation and data transfer prohibit straightforward mapping on energy constrained wearable platforms. The computational burden can be overcome by dedicated hardware accelerators, but it is the sheer amount of data transfer, and level of utilization that determines the energy-efficiency of these implementations. This paper presents the Neuro Vector Engine (NVE) a SIMD accelerator for ConvNets for visual object classification, targeting portable and wearable devices. Our accelerator is very flexible due to the usage of VLIW ISA, at the cost of instruction fetch overhead. We show that this overhead is insignificant when the extra flexibility enables advanced data locality optimizations, and improves HW utilization over ConvNet vision applications. By co-optimizing accelerator architecture and algorithm loop structure, 30Gops is achieved with a power envelope of 54mW and only 0.26mm^2 silicon footprint at TSMC 40nm technology, enabling high-end visual object recognition by portable and even wearable devices. Download Paper (PDF; Only available from the DATE venue WiFi)
17:00	12.5.3	IMPROVING SCALABILITY OF CMPS WITH DENSE ACCS COVERAGE Speaker: Gunar Schirner, Northeastern University, US Authors: Nasibeh Teimouri, Hamed Tabkhi and Gunar Schirner, Northeastern University, US Abstract This article opens a path toward efficient integration of many hardware Accelerators (ACCs) on a single chip. To this end, the article at first identifies 4 major semantic aspects of ACC communication: data access model, data granularity, marshalling and synchronization. Based on the identified semantics, the article proposes Transparent Self-Synchronizing (TSS) architecture as an extensible architecture template to efficiently integrate many ACCs. In principle, TSS proposes a shift from the current processor-centric view to a more equal, peer view between ACCs and the host processors. It offers a programmable MUX-based interconnect with fine-tuned local buffers per ACC as well as an autonomous control to reduce the synchronization load on the host processor. TSS is mainly suitable for class of streaming applications. Our results using 8 streaming applications demonstrate significant benefits of TSS including 3x speedup over current ACC-based architectures Download Paper (PDF; Only available from the DATE venue WiFi)
17:15	12.5.4	HARDWARE ACCELERATOR FOR ANALYTICS OF SPARSE DATA Speaker: Eriko Nurvitadhi, Intel Corporation, US Authors: Eriko Nurvitadhi, Asit Mishra, Yu Wang, Ganesh Venkatesh and Debbie Marr, Intel Corporation, US Abstract Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30		End of session

Time

Label

Presentation Title
Authors

16:00

12.5.1

(Best Paper Award Candidate)
A RECONFIGURABLE HETEROGENEOUS MULTICORE WITH A HOMOGENEOUS ISA
Speaker:
Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul (UFRGS), BR
Authors:
Jeckson Dellagostin Souza¹, Luigi Carro¹, Mateus Beck Rutzig² and Antonio Carlos Schneider Beck Filho¹
¹Universidade Federal do Rio Grande do Sul (UFRGS), BR; ²Universidade Federal de Santa Maria, BR
Abstract
Given the large diversity of embedded applications one can find in current portable devices, for energy and performance reasons one must exploit both Thread- and Instruction Level Parallelism. While MPSoCs are largely used for this purpose, they fail when one considers software productivity, since it comprises different ISAs that must be programmed separately. On the other hand, general purpose multicores implement the same ISA, but are composed of a homogeneous set of very power consuming superscalar processors. In this paper we show how one can effectively use a regular fabric to provide a number of different possible heterogeneous configurations while still sustaining the same ISA. This is done by leveraging the intrinsic regularity of a reconfigurable fabric, so several different organizations can be easily built with little effort. To ensure ISA compatibility, we use a binary translation mechanism that transforms code to be executed on the fabric at run-time. Using representative benchmarks, we show that one version of the heterogeneous system can outperform its homogenous counterpart in average by 59% in performance and 10% in energy, with EDP improvements in almost every scenario.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:30

12.5.2

THE NEURO VECTOR ENGINE: FLEXIBILITY TO IMPROVE CONVOLUTIONAL NETWORK EFFICIENCY FOR WEARABLE VISION
Speaker:
Maurice Peemen, Eindhoven University of Technology, NL
Authors:
Maurice Peemen¹, Bart Mesman¹, Henk Corporaal¹, Runbin Shi², Sohan Lal³ and Ben Juurlinik³
¹Eindhoven University of Technology, NL; ²Soochow University, CN; ³TU Berlin, DE
Abstract
Deep Convolutional Networks (ConvNets) are currently superior in benchmark performance, but the associated demands on computation and data transfer prohibit straightforward mapping on energy constrained wearable platforms. The computational burden can be overcome by dedicated hardware accelerators, but it is the sheer amount of data transfer, and level of utilization that determines the energy-efficiency of these implementations. This paper presents the Neuro Vector Engine (NVE) a SIMD accelerator for ConvNets for visual object classification, targeting portable and wearable devices. Our accelerator is very flexible due to the usage of VLIW ISA, at the cost of instruction fetch overhead. We show that this overhead is insignificant when the extra flexibility enables advanced data locality optimizations, and improves HW utilization over ConvNet vision applications. By co-optimizing accelerator architecture and algorithm loop structure, 30Gops is achieved with a power envelope of 54mW and only 0.26mm^2 silicon footprint at TSMC 40nm technology, enabling high-end visual object recognition by portable and even wearable devices.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:00

12.5.3

IMPROVING SCALABILITY OF CMPS WITH DENSE ACCS COVERAGE
Speaker:
Gunar Schirner, Northeastern University, US
Authors:
Nasibeh Teimouri, Hamed Tabkhi and Gunar Schirner, Northeastern University, US
Abstract
This article opens a path toward efficient integration of many hardware Accelerators (ACCs) on a single chip. To this end, the article at first identifies 4 major semantic aspects of ACC communication: data access model, data granularity, marshalling and synchronization. Based on the identified semantics, the article proposes Transparent Self-Synchronizing (TSS) architecture as an extensible architecture template to efficiently integrate many ACCs. In principle, TSS proposes a shift from the current processor-centric view to a more equal, peer view between ACCs and the host processors. It offers a programmable MUX-based interconnect with fine-tuned local buffers per ACC as well as an autonomous control to reduce the synchronization load on the host processor. TSS is mainly suitable for class of streaming applications. Our results using 8 streaming applications demonstrate significant benefits of TSS including 3x speedup over current ACC-based architectures
Download Paper (PDF; Only available from the DATE venue WiFi)

17:15

12.5.4

HARDWARE ACCELERATOR FOR ANALYTICS OF SPARSE DATA
Speaker:
Eriko Nurvitadhi, Intel Corporation, US
Authors:
Eriko Nurvitadhi, Asit Mishra, Yu Wang, Ganesh Venkatesh and Debbie Marr, Intel Corporation, US
Abstract
Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

End of session

Visit us at DATE 2016