6.4 Hardware support for microarchitecture performance

Printer-friendly version PDF version

Date: Wednesday, March 27, 2019
Time: 11:00 - 12:30
Location / Room: Room 4

Chair:
Cristina Silvano, Politecnico di Milano, IT, Contact Cristina Silvano

Co-Chair:
Sylvain Collange, INRIA/IRISA, FR, Contact Sylvain Collange

This session deals with hardware mechanisms for high-performance or embedded real-time processors to improve their efficiency or their performance beyond what is possible to achieve by software. The first paper proposes low-overhead hardware support to enhance system interrupts checking multicore contentions. The second paper reduces the costly instruction scheduling operations in aggressive OoO processors. The third paper is about dynamic analysis of instruction flow and generating vectorized code at runtime.

TimeLabelPresentation Title
Authors
11:006.4.1MAXIMUM-CONTENTION CONTROL UNIT (MCCU): RESOURCE ACCESS COUNT AND CONTENTION TIME ENFORCEMENT
Speaker:
Jordi Cardona, Univ. Politècnica de Barcelona and Barcelona Supercomputing Center, ES
Authors:
Jordi Cardona1, Carles Hernandez2, Jaume Abella3 and Francisco Cazorla4
1Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES; 3Barcelona Supercomputing Center (BSC-CNS), ES; 4Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
In real-time systems, techniques to derive bounds to the contention tasks can suffer in multicore build on resource quota monitoring and enforcement. In particular, they track and bound the number of requests to hardware shared resources that each core (task) is allowed to perform. In this paper, we show that current software-only solutions work well when there is a single resource and type of request to track and bound, but do not scale to the more general case of several shared resources that accept different request types, each with a different associated latency. To handle this (more general) case, we propose low-overhead hardware support called Maximum-Contention Control Unit (MCCU). The MCCU performs fine-grain tracking of different types of requests, preventing a core to cause more interference on its contenders than budgeted. In this process, the MCCU also helps verifying that individual requests duration does not exceed their theoretical bounds, hence dealing with scenarios in which requests can have an arbitrarily large duration.
11:306.4.2FIFORDER MICROARCHITECTURE: READY-AWARE INSTRUCTION SCHEDULING FOR OOO PROCESSORS
Speaker:
Mehdi Alipour, Uppsala University, SE
Authors:
Mehdi Alipour1, Rakesh Kumar2, Stefanos Kaxiras1 and David Black-Schaffer1
1Uppsala University, SE; 2Norwegian University of Science and Technology, NO
Abstract
The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the observation that a large number of instructions have both operands ready at dispatch, and therefore do not benefit from out-of-order scheduling. We leverage this to place such ready-at-dispatch instructions in separate, simpler, in-order FIFO queues for scheduling. With such additional queues, we can reduce the size and width of the expensive out-of-order instruction queue, without reducing the processor's overall issue width and depth. Our design, FIFOrder, is able to steer more than 60% of instructions to the cheaper FIFO queues, providing a 50% energy savings over a traditional out-of-order instruction queue design, while delivering 8% higher performance.
12:006.4.3BOOSTING SIMD BENEFITS THROUGH A RUN-TIME AND ENERGY EFFICIENT DLP DETECTION
Speaker:
Mateus Rutzig, UFSM, BR
Authors:
Michael Jordan, Tiago Knorst, Julio Vicenzi and Mateus Beck Rutzig, UFSM, BR
Abstract
Data Level Parallelism has been improving performance-energy tradeoff of current processors by coupling SIMD engines, such as Intel AVX and ARM NEON. Special libraries and compilers are used to support DLP execution on such engines. However, timing overhead on hand coding is inevitable since most software developers are not skilled to extract DLP using unfamiliar libraries. In addition, DLP detection through compiler, besides breaking software compatibility, is limited to static code analysis, which compromises performance gains. In this work, we propose a runtime DLP detection named as Dynamic SIMD Assembler, which transparently identifies vectorizable code regions to execute in the ARM NEON engine. Due to its dynamic fashion, DSA keeps software compatibility and avoids timing overhead on software developing process. Results have shown that DSA outperforms ARM NEON auto-vectorization compiler by 32% since it covers wider vectorized regions, such as Dynamic Range, Sentinel and Conditional Loops. In addition, DSA outperforms hand-vectorized code using ARM library by 26% reducing 45% of energy consumption with no penalties over software development time.
12:30IP3-3, 336DEPENDENCY-RESOLVING INTRA-UNIT PIPELINE ARCHITECTURE FOR HIGH-THROUGHPUT MULTIPLIERS
Speaker:
Dae Hyun Kim, Washington State University, US
Authors:
Jihee Seo and Dae Hyun Kim, Washington State University, US
Abstract
In this paper, we propose two dependency-resolving intra-unit pipeline architectures to design high-throughput multipliers. Simulation results show that the proposed multipliers achieve approximately 2.3× to 3.1× execution time reduction at a cost of 4.4% area and 3.7% power overheads for highly-dependent multiplications.
12:31IP3-4, 832A HARDWARE-EFFICIENT LOGARITHMIC MULTIPLIER WITH IMPROVED ACCURACY
Authors:
Mohammad Saeed Ansari, Bruce Cockburn and Jie Han, University of Alberta, CA
Abstract
Logarithmic multipliers take the base-2 logarithm of the operands and perform multiplication by only using shift and addition operations. Since computing the logarithm is often an approximate process, some accuracy loss is inevitable in such designs. However, the area, latency, and power consumption can be significantly improved at the cost of accuracy loss. This paper presents a novel method to approximate log_2N that, unlike the existing approaches, rounds N to its nearest power of two instead of the highest power of two smaller than or equal to N. This approximation technique is then used to design two improved 16x16 logarithmic multipliers that use exact and approximate adders (ILM-EA and ILM-AA, respectively). These multipliers achieve up to 24.42% and 9.82% savings in area and power-delay product, respectively, compared to the state-of-the-art design in the literature with similar accuracy. The proposed designs are evaluated in the Joint Photographic Experts Group (JPEG) image compression algorithm and their advantages over other approximate logarithmic multipliers are shown.
12:32IP3-5, 440LIGHTWEIGHT HARDWARE SUPPORT FOR SELECTIVE COHERENCE IN HETEROGENEOUS MANYCORE ACCELERATORS
Authors:
Alessandro Cilardo, Mirko Gagliardi and Vincenzo Scotti, University of Naples Federico II, IT
Abstract
Shared memory coherence is a key feature in manycore accelerators, ensuring programmability and application portability. Most established solutions for coherence in homogeneous systems cannot be simply reused because of the special requirements of accelerator architectures. This Interactive Presentation paper introduces a low-overhead hardware coherence system for heterogeneous accelerators, with customizable granularity and noncoherent region support. The coherence system has been demonstrated in operation in a full manycore accelerator, exhibiting significant improvements in terms of network load, execution time, and power consumption.
12:30End of session
Lunch Break in Lunch Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the ""Lunch Area"" to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

  • Coffee Break 10:30 - 11:30
  • Lunch Break 13:00 - 14:30
  • Awards Presentation and Keynote Lecture in ""TBD"" 13:50 - 14:20
  • Coffee Break 16:00 - 17:00

Wednesday, March 27, 2019

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:30
  • Awards Presentation and Keynote Lecture in ""TBD"" 13:30 - 14:20
  • Coffee Break 16:00 - 17:00

Thursday, March 28, 2019

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:00
  • Keynote Lecture in ""TBD"" 13:20 - 13:50
  • Coffee Break 15:30 - 16:00