2.4 Performance and Power Analysis

Time	Label	Presentation Title Authors
11:30	2.4.1	GATSIM: ABSTRACT TIMING SIMULATION OF GPUS Speaker: Andreas Gerstlauer, The University of Texas at Austin, US Authors: Kishore Punniyamurthy, Behzad Boroujerdian and Andreas Gerstlauer, The University of Texas at Austin, US Abstract General-Purpose Graphic Processing Units (GPUs) have become an integral part of heterogeneous system architectures. Ever increasing complexities have made rapid, early performance evaluation of GPU-based architectures and applications a primary design concern. Traditional cycle-accurate GPU simulators are too slow, while existing analytical or source-level estimation approaches are often inaccurate. This paper proposes a novel abstract GPU performance simulation approach that is based on flexible separation of functional and timing models, combining a fast functional execution either on existing simulators or native GPU hardware with a light, fast and accurate abstract timing model. Micro-architecture timing of individual GPU cores is abstracted through static, one-time pre-characterization of code, and only the dynamic scheduling effects are simulated. Using a native GPU for functional execution and excluding pre-characterization, our GPU simulation achieves a throughput of more than 80 MIPS. This is on average 400x faster with 4% error compared to a cycle-accurate GPU simulator for standard GPU benchmarks. Moreover, our simple timing model provides flexibility to target different GPU configurations with little or no extra effort. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	2.4.2	MESAP: A FAST ANALYTIC POWER MODEL FOR DRAM MEMORIES Speaker: Sandeep Poddar, IBM Research, The Netherlands, NL Authors: Sandeep Poddar¹, Rik Jongerius¹, Leandro Fiorin¹, Giovanni Mariani¹, Gero Dittmann², Andreea Anghel² and Henk Corporaal³ ¹IBM Research, NL; ²IBM Research, CH; ³TU/e (Eindhoven University of Technology), NL Abstract The design of an energy-efficient memory subsystem is one of the key issues that system architects face today. To achieve this goal, architects usually rely on system simulators and trace-based DRAM power models. However, their long execution makes the approach infeasible for the design-space exploration of next-generation exascale computing systems. Analytic models, in contrast, are orders of magnitude faster. In this paper, we propose a new analytic memory scheduler-agnostic power model (MeSAP) for DRAM. Our model achieves an average error of 20% for DDR3 and DDR4 memory systems, similar to a state-of-the-art trace-based approach but our analytic model is an order of magnitude faster. Furthermore, we integrate MeSAP into an analytic performance model of general-purpose processors and show its applicability to the design of a computing system targeting scientific image processing applications. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	2.4.3	AFEC: AN ANALYTICAL FRAMEWORK FOR EVALUATING CACHE PERFORMANCE IN OUT-OF-ORDER PROCESSORS Speaker: Kecheng Ji, Southeast University, CN Authors: Kecheng Ji¹, Ming Ling¹, Qin Wang¹, Longxing Shi¹ and Jianping Pan² ¹Southeast University, CN; ²University of Victoria, CA Abstract Evaluating cache performance is becoming critically important to predict the overall performance of out-of-order processors. Non-blocking caches, which are very common in out-of-order CPUs, can reduce the average cache miss penalty by overlapping multiple outstanding memory requests and merging different cache misses with the same cacheline address into one memory request. Normally, memory-level-parallelism (MLP) has been used as a metric to describe the concurrency of memory access. Unfortunately, due to the extremely dynamic dependences among the program memory references, it is very difficult to quantify MLP without time-consuming simulations. Moreover, the merging of multiple cache misses, which makes the average cache miss service time less than the physical DDR access latency, is seldom considered in the existing researches. In this paper, we propose a cache performance evaluation framework based on program trace analysis and analytical models to fast estimate MLP and the effective cache miss service time without simulations. Comparing with the results by Gem5 simulations of MobyBench 2.0, Mibench 1.0 and Mediabench II, the average accuracy of the modeled MLP and the average cache miss service time is higher than 91% and 92%, respectively. Combined with cache misses calculated by the stack distance theory, the average absolute error of CPU stall time (due to cache misses) is lower than 10%, while the evaluation time can be sped up by 35 times relative to the Gem5 full simulations. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00	IP1-5, 88	MODELING INSTRUCTION CACHE AND INSTRUCTION BUFFER FOR PERFORMANCE ESTIMATION OF VLIW ARCHITECTURES USING NATIVE SIMULATION Speaker: Omayma Matoussi, Grenoble INP, TIMA laboratory, FR Authors: Omayma Matoussi¹ and Frédéric Pétrot² ¹Tima Laboratory at Grenoble, FR; ²TIMA Laboratory, Grenoble Institute of Technology, FR Abstract In this work, we propose an icache performance estimation approach that focuses on a component necessary to handle the instruction parallelism in a very long instruction word (VLIW) processor: the instruction buffer (IB). Our annotation approach is founded on an intermediate level native- simulation framework. It is evaluated with reference to a cycle accurate instruction set simulator leading to an average cycle count error of 9.3% and an average speedup of 10. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00		End of session Lunch Break in Garden Foyer Keynote Lecture session 3.0 in "Garden Foyer" 1350 - 1420 Lunch Break in the Garden Foyer On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

Time

Label

Presentation Title
Authors

11:30

2.4.1

GATSIM: ABSTRACT TIMING SIMULATION OF GPUS
Speaker:
Andreas Gerstlauer, The University of Texas at Austin, US
Authors:
Kishore Punniyamurthy, Behzad Boroujerdian and Andreas Gerstlauer, The University of Texas at Austin, US
Abstract
General-Purpose Graphic Processing Units (GPUs) have become an integral part of heterogeneous system architectures. Ever increasing complexities have made rapid, early performance evaluation of GPU-based architectures and applications a primary design concern. Traditional cycle-accurate GPU simulators are too slow, while existing analytical or source-level estimation approaches are often inaccurate. This paper proposes a novel abstract GPU performance simulation approach that is based on flexible separation of functional and timing models, combining a fast functional execution either on existing simulators or native GPU hardware with a light, fast and accurate abstract timing model. Micro-architecture timing of individual GPU cores is abstracted through static, one-time pre-characterization of code, and only the dynamic scheduling effects are simulated. Using a native GPU for functional execution and excluding pre-characterization, our GPU simulation achieves a throughput of more than 80 MIPS. This is on average 400x faster with 4% error compared to a cycle-accurate GPU simulator for standard GPU benchmarks. Moreover, our simple timing model provides flexibility to target different GPU configurations with little or no extra effort.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

2.4.2

MESAP: A FAST ANALYTIC POWER MODEL FOR DRAM MEMORIES
Speaker:
Sandeep Poddar, IBM Research, The Netherlands, NL
Authors:
Sandeep Poddar¹, Rik Jongerius¹, Leandro Fiorin¹, Giovanni Mariani¹, Gero Dittmann², Andreea Anghel² and Henk Corporaal³
¹IBM Research, NL; ²IBM Research, CH; ³TU/e (Eindhoven University of Technology), NL
Abstract
The design of an energy-efficient memory subsystem is one of the key issues that system architects face today. To achieve this goal, architects usually rely on system simulators and trace-based DRAM power models. However, their long execution makes the approach infeasible for the design-space exploration of next-generation exascale computing systems. Analytic models, in contrast, are orders of magnitude faster. In this paper, we propose a new analytic memory scheduler-agnostic power model (MeSAP) for DRAM. Our model achieves an average error of 20% for DDR3 and DDR4 memory systems, similar to a state-of-the-art trace-based approach but our analytic model is an order of magnitude faster. Furthermore, we integrate MeSAP into an analytic performance model of general-purpose processors and show its applicability to the design of a computing system targeting scientific image processing applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

2.4.3

AFEC: AN ANALYTICAL FRAMEWORK FOR EVALUATING CACHE PERFORMANCE IN OUT-OF-ORDER PROCESSORS
Speaker:
Kecheng Ji, Southeast University, CN
Authors:
Kecheng Ji¹, Ming Ling¹, Qin Wang¹, Longxing Shi¹ and Jianping Pan²
¹Southeast University, CN; ²University of Victoria, CA
Abstract
Evaluating cache performance is becoming critically important to predict the overall performance of out-of-order processors. Non-blocking caches, which are very common in out-of-order CPUs, can reduce the average cache miss penalty by overlapping multiple outstanding memory requests and merging different cache misses with the same cacheline address into one memory request. Normally, memory-level-parallelism (MLP) has been used as a metric to describe the concurrency of memory access. Unfortunately, due to the extremely dynamic dependences among the program memory references, it is very difficult to quantify MLP without time-consuming simulations. Moreover, the merging of multiple cache misses, which makes the average cache miss service time less than the physical DDR access latency, is seldom considered in the existing researches. In this paper, we propose a cache performance evaluation framework based on program trace analysis and analytical models to fast estimate MLP and the effective cache miss service time without simulations. Comparing with the results by Gem5 simulations of MobyBench 2.0, Mibench 1.0 and Mediabench II, the average accuracy of the modeled MLP and the average cache miss service time is higher than 91% and 92%, respectively. Combined with cache misses calculated by the stack distance theory, the average absolute error of CPU stall time (due to cache misses) is lower than 10%, while the evaluation time can be sped up by 35 times relative to the Gem5 full simulations.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

IP1-5, 88

MODELING INSTRUCTION CACHE AND INSTRUCTION BUFFER FOR PERFORMANCE ESTIMATION OF VLIW ARCHITECTURES USING NATIVE SIMULATION
Speaker:
Omayma Matoussi, Grenoble INP, TIMA laboratory, FR
Authors:
Omayma Matoussi¹ and Frédéric Pétrot²
¹Tima Laboratory at Grenoble, FR; ²TIMA Laboratory, Grenoble Institute of Technology, FR
Abstract
In this work, we propose an icache performance estimation approach that focuses on a component necessary to handle the instruction parallelism in a very long instruction word (VLIW) processor: the instruction buffer (IB). Our annotation approach is founded on an intermediate level native- simulation framework. It is evaluated with reference to a cycle accurate instruction set simulator leading to an average cycle count error of 9.3% and an average speedup of 10.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

End of session
Lunch Break in Garden Foyer

Keynote Lecture session 3.0 in "Garden Foyer" 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

available at

Visit us at DATE 2017

Booth: 20+21

Booth: 30

Booth: 17

Booth: 26

Booth: 1

Booth: 23

Submissions

2.4 Performance and Power Analysis

DATE Smartphone App

Visit us at DATE 2017