12.5 Accelerator Design and Heterogeneous Architectures

Printer-friendly version PDF version

Date: Thursday 17 March 2016
Time: 16:00 - 17:30
Location / Room: Konferenz 3

Chair:
Cristina Silvano, Politecnico di Milano, IT

Co-Chair:
Todd Austin, University of Michigan, US

This session presents papers on heterogenous systems with focus on hardware acceleration. The first two papers propose acceleration for general purpose and domain specific computing, respectively. The third paper addresses the issue of system interconnect for many-accelerator systems. The last paper introduces a data oriented accelerator design for sparse matrix operations.

TimeLabelPresentation Title
Authors
16:0012.5.1(Best Paper Award Candidate)
A RECONFIGURABLE HETEROGENEOUS MULTICORE WITH A HOMOGENEOUS ISA
Speaker:
Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul (UFRGS), BR
Authors:
Jeckson Dellagostin Souza1, Luigi Carro1, Mateus Beck Rutzig2 and Antonio Carlos Schneider Beck Filho1
1Universidade Federal do Rio Grande do Sul (UFRGS), BR; 2Universidade Federal de Santa Maria, BR
Abstract
Given the large diversity of embedded applications one can find in current portable devices, for energy and performance reasons one must exploit both Thread- and Instruction Level Parallelism. While MPSoCs are largely used for this purpose, they fail when one considers software productivity, since it comprises different ISAs that must be programmed separately. On the other hand, general purpose multicores implement the same ISA, but are composed of a homogeneous set of very power consuming superscalar processors. In this paper we show how one can effectively use a regular fabric to provide a number of different possible heterogeneous configurations while still sustaining the same ISA. This is done by leveraging the intrinsic regularity of a reconfigurable fabric, so several different organizations can be easily built with little effort. To ensure ISA compatibility, we use a binary translation mechanism that transforms code to be executed on the fabric at run-time. Using representative benchmarks, we show that one version of the heterogeneous system can outperform its homogenous counterpart in average by 59% in performance and 10% in energy, with EDP improvements in almost every scenario.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:3012.5.2THE NEURO VECTOR ENGINE: FLEXIBILITY TO IMPROVE CONVOLUTIONAL NETWORK EFFICIENCY FOR WEARABLE VISION
Speaker:
Maurice Peemen, Eindhoven University of Technology, NL
Authors:
Maurice Peemen1, Bart Mesman1, Henk Corporaal1, Runbin Shi2, Sohan Lal3 and Ben Juurlinik3
1Eindhoven University of Technology, NL; 2Soochow University, CN; 3TU Berlin, DE
Abstract
Deep Convolutional Networks (ConvNets) are currently superior in benchmark performance, but the associated demands on computation and data transfer prohibit straightforward mapping on energy constrained wearable platforms. The computational burden can be overcome by dedicated hardware accelerators, but it is the sheer amount of data transfer, and level of utilization that determines the energy-efficiency of these implementations. This paper presents the Neuro Vector Engine (NVE) a SIMD accelerator for ConvNets for visual object classification, targeting portable and wearable devices. Our accelerator is very flexible due to the usage of VLIW ISA, at the cost of instruction fetch overhead. We show that this overhead is insignificant when the extra flexibility enables advanced data locality optimizations, and improves HW utilization over ConvNet vision applications. By co-optimizing accelerator architecture and algorithm loop structure, 30Gops is achieved with a power envelope of 54mW and only 0.26mm^2 silicon footprint at TSMC 40nm technology, enabling high-end visual object recognition by portable and even wearable devices.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:0012.5.3IMPROVING SCALABILITY OF CMPS WITH DENSE ACCS COVERAGE
Speaker:
Gunar Schirner, Northeastern University, US
Authors:
Nasibeh Teimouri, Hamed Tabkhi and Gunar Schirner, Northeastern University, US
Abstract
This article opens a path toward efficient integration of many hardware Accelerators (ACCs) on a single chip. To this end, the article at first identifies 4 major semantic aspects of ACC communication: data access model, data granularity, marshalling and synchronization. Based on the identified semantics, the article proposes Transparent Self-Synchronizing (TSS) architecture as an extensible architecture template to efficiently integrate many ACCs. In principle, TSS proposes a shift from the current processor-centric view to a more equal, peer view between ACCs and the host processors. It offers a programmable MUX-based interconnect with fine-tuned local buffers per ACC as well as an autonomous control to reduce the synchronization load on the host processor. TSS is mainly suitable for class of streaming applications. Our results using 8 streaming applications demonstrate significant benefits of TSS including 3x speedup over current ACC-based architectures

Download Paper (PDF; Only available from the DATE venue WiFi)
17:1512.5.4HARDWARE ACCELERATOR FOR ANALYTICS OF SPARSE DATA
Speaker:
Eriko Nurvitadhi, Intel Corporation, US
Authors:
Eriko Nurvitadhi, Asit Mishra, Yu Wang, Ganesh Venkatesh and Debbie Marr, Intel Corporation, US
Abstract
Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:30End of session