11.3 Microarchitectures and Workload Allocation for Energy Efficiency

Time	Label	Presentation Title Authors
14:00	11.3.1	RESISTIVE CONFIGURABLE ASSOCIATIVE MEMORY FOR APPROXIMATE COMPUTING Speaker: Abbas Rahimi, University of California, Berkeley, US Authors: Mohsen Imani¹, Abbas Rahimi² and Tajana Rosing³ ¹UC San Diego, US; ²University of California, Berkeley, US; ³University of California, San Diego, US Abstract Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GP-GPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%. Download Paper (PDF; Only available from the DATE venue WiFi)
14:30	11.3.2	EXPLOITING CPU-LOAD AND DATA CORRELATIONS IN MULTI-OBJECTIVE VM PLACEMENT FOR GEO-DISTRIBUTED DATA CENTERS Speaker: Ali Pahlevan, École Polytechnique Fédérale de Lausanne (EPFL), CH Authors: Ali Pahlevan, Pablo Garcia del Valle and David Atienza, École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract Cloud computing has been proposed as a new paradigm to deliver services over the internet. The proliferation of cloud services and increasing users' demands for computing resources have led to the appearance of geo-distributed data centers (DCs). These DCs host heterogeneous applications with changing characteristics, like the CPU-load correlation, that provides significant potential for energy savings when the utilization peaks of two virtual machines (VMs) do not occur at the same time, or the amount of data exchanged between VMs, that directly impacts performance, i.e. response time. This paper presents a two-phase multi-objective VM placement, clustering and allocation algorithm, along with a dynamic migration technique, for geo-distributed DCs coupled with renewable and battery energy sources. It exploits the holistic knowledge of VMs characteristics, CPU-load and data correlations, to tackle the challenges of operational cost optimization and energy-performance trade-off. Experimental results demonstrate that the proposed method provides up to 55% operational cost savings, 15% energy consumption, and 12% performance (response time) improvements when compared to state-of-the-art schemes. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	11.3.3	ENERGY EFFICIENCY IN CLOUD-BASED MAPREDUCE APPLICATIONS THROUGH BETTER PERFORMANCE ESTIMATION Speaker: Seyed Morteza Nabavinejad, Sharif University of Technology, IR Authors: Seyed Morteza Nabavinejad and Maziar Goudarzi, Sharif University of Technology, IR Abstract An important issue for efficient execution of MapReduce jobs on a cloud platform is selecting the best fitting virtual machine (VM) configuration(s) among the miscellany of choices that cloud providers offer. Wise selection of VM configurations can lead to better performance, cost and energy consumption. Therefore, it is crucial to explore the available configurations and choose the best one for each given MapReduce application. Executing the given application on all the configurations for comparison is a costly, time and energy consuming process. An alternative is to run the application on a subset of configurations (sample configurations) and estimate its performance on other configurations based on the obtained values on those sample configurations. We show that the choice of these sample configurations highly affects accuracy of later estimations. Our Smart Configuration Selection (SCS) scheme chooses better representatives from among all configurations by once-off analysis of given performance figures of the benchmarks so as to increase the accuracy of estimations of missing values, and consequently, to more accurately choose the configuration providing the highest performance. The results show that the SCS choice of sample configurations is very close to the best choice, and can reduce estimation error to 7.11% from the original 16.02% of random configuration selection. Furthermore, this more accurate performance estimation saves 24.3% energy on average. Download Paper (PDF; Only available from the DATE venue WiFi)
15:15	11.3.4	UNSUPERVISED POWER MODELING OF CO-ALLOCATED WORKLOADS FOR ENERGY EFFICIENCY IN DATA CENTERS Speaker: Juan Carlos Salinas-Hilburg, Universidad Politécnica de Madrid, ES Authors: Juan Carlos Salinas-Hilburg¹, Marina Zapater², José L. Risco-Martín³, Jose Manuel Moya¹ and Jose L. Ayala³ ¹Universidad Politécnica de Madrid, ES; ²CEI Campus Moncloa, UCM-UPM, ES; ³Universidad Complutense de Madrid, ES Abstract Data centers are huge power consumers and their energy consumption keeps on rising despite the efforts to increase energy efficiency. A great body of research is devoted to the reduction of the computational power of these facilities, applying techniques such as power budgeting and power capping in servers. Such techniques rely on models to predict the power consumption of servers. However, estimating overall server power for arbitrary applications when running co-allocated in multithreaded servers is not a trivial task. In this paper, we use Grammatical Evolution techniques to predict the dynamic power of the CPU and memory subsystems of an enterprise server using the hardware counters of each application. On top of our dynamic power models, we use fan and temperature-dependent leakage power models to obtain the overall server power. To train and test our models we use real traces from a presently shipping enterprise server under a wide set of sequential and parallel workloads running at various frequencies We prove that our model is able to predict the power consumption of two different tasks co-allocated in the same server, keeping error below 8W. For the first time in literature, we develop a methodology able to combine the hardware counters of two individual applications, and estimate overall server power consumption without running the co-allocated application. Our results show a prediction error below 12W, which represents a 7.3% of the overall server power, outperforming previous approaches in the state of the art. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	IP5-10, 205	A POWER-EFFICIENT 3-D ON-CHIP INTERCONNECT FOR MULTI-CORE ACCELERATORS WITH STACKED L2 CACHE Speaker: Kyungsu Kang, Samsung, KR Authors: Kyungsu Kang¹, Luca Benini², Giovanni De Micheli³, Sangho Park¹ and Jong-Bae Lee¹ ¹Samsung, KR; ²Università di Bologna, IT; ³École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract The use of multi-core clusters is a promising option for data-intensive embedded applications such as multimodal sensor fusion, image understanding, mobile augmented reality. In this paper, we propose a power-efficient 3-D onchip interconnect for multi-core clusters with stacked L2 cache memory. A new switch design makes a circuit-switched Mesh-of-Tree (MoT) interconnect reconfigurable to support power-gating of processing cores, memory blocks, and unnecessary interconnect resources (routing switch, arbitration switch, inverters placed along the on-chip wires). The proposed 3-D MoT improves the power efficiency up to 77% in terms of energy-delay product (EDP). Download Paper (PDF; Only available from the DATE venue WiFi)
15:31	IP5-11, 898	POWER-EFFICIENT LOAD-BALANCING ON HETEROGENEOUS COMPUTING PLATFORMS Speaker: Muhammad Shafique, Karlsruhe Institute of Technology (KIT), DE Authors: Muhammad Usman Karim Khan¹, Muhammad Shafique¹, Apratim Gupta², Thomas Schumann² and Jörg Henkel¹ ¹Karlsruhe Institute of Technology (KIT), DE; ²University of Applied Sciences, Darmstadt, DE Abstract In order to address the throughput constraints of the system at minimal power consumption, the workload of computing nodes should be balanced. This requires accounting for the underlying hardware characteristics (e.g., power vs. frequency profiles) and throughput sustainable by these nodes. This work provides a workload distribution and balancing methodology of a divisible load under a throughput constraint, on heterogeneous nodes. The power efficiency of each node is considered during load distribution. For load balancing, the frequency of the node is determined which just fulfills the job requirements of the nodes. We functionally verify our methodology by implementing it on an FPGA-based system, with heterogeneous multi-cores and hardware accelerators, and report results for different image processing benchmarks. Compared to a state-of-the-art-approach, our approach results in up to 64% performance improvement for the benchmarks evaluated in this paper. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30		End of session Coffee Break in Exhibition Area

Time

Label

Presentation Title
Authors

14:00

11.3.1

RESISTIVE CONFIGURABLE ASSOCIATIVE MEMORY FOR APPROXIMATE COMPUTING
Speaker:
Abbas Rahimi, University of California, Berkeley, US
Authors:
Mohsen Imani¹, Abbas Rahimi² and Tajana Rosing³
¹UC San Diego, US; ²University of California, Berkeley, US; ³University of California, San Diego, US
Abstract
Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GP-GPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.
Download Paper (PDF; Only available from the DATE venue WiFi)

14:30

11.3.2

EXPLOITING CPU-LOAD AND DATA CORRELATIONS IN MULTI-OBJECTIVE VM PLACEMENT FOR GEO-DISTRIBUTED DATA CENTERS
Speaker:
Ali Pahlevan, École Polytechnique Fédérale de Lausanne (EPFL), CH
Authors:
Ali Pahlevan, Pablo Garcia del Valle and David Atienza, École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
Cloud computing has been proposed as a new paradigm to deliver services over the internet. The proliferation of cloud services and increasing users' demands for computing resources have led to the appearance of geo-distributed data centers (DCs). These DCs host heterogeneous applications with changing characteristics, like the CPU-load correlation, that provides significant potential for energy savings when the utilization peaks of two virtual machines (VMs) do not occur at the same time, or the amount of data exchanged between VMs, that directly impacts performance, i.e. response time. This paper presents a two-phase multi-objective VM placement, clustering and allocation algorithm, along with a dynamic migration technique, for geo-distributed DCs coupled with renewable and battery energy sources. It exploits the holistic knowledge of VMs characteristics, CPU-load and data correlations, to tackle the challenges of operational cost optimization and energy-performance trade-off. Experimental results demonstrate that the proposed method provides up to 55% operational cost savings, 15% energy consumption, and 12% performance (response time) improvements when compared to state-of-the-art schemes.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

11.3.3

ENERGY EFFICIENCY IN CLOUD-BASED MAPREDUCE APPLICATIONS THROUGH BETTER PERFORMANCE ESTIMATION
Speaker:
Seyed Morteza Nabavinejad, Sharif University of Technology, IR
Authors:
Seyed Morteza Nabavinejad and Maziar Goudarzi, Sharif University of Technology, IR
Abstract
An important issue for efficient execution of MapReduce jobs on a cloud platform is selecting the best fitting virtual machine (VM) configuration(s) among the miscellany of choices that cloud providers offer. Wise selection of VM configurations can lead to better performance, cost and energy consumption. Therefore, it is crucial to explore the available configurations and choose the best one for each given MapReduce application. Executing the given application on all the configurations for comparison is a costly, time and energy consuming process. An alternative is to run the application on a subset of configurations (sample configurations) and estimate its performance on other configurations based on the obtained values on those sample configurations. We show that the choice of these sample configurations highly affects accuracy of later estimations. Our Smart Configuration Selection (SCS) scheme chooses better representatives from among all configurations by once-off analysis of given performance figures of the benchmarks so as to increase the accuracy of estimations of missing values, and consequently, to more accurately choose the configuration providing the highest performance. The results show that the SCS choice of sample configurations is very close to the best choice, and can reduce estimation error to 7.11% from the original 16.02% of random configuration selection. Furthermore, this more accurate performance estimation saves 24.3% energy on average.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:15

11.3.4

UNSUPERVISED POWER MODELING OF CO-ALLOCATED WORKLOADS FOR ENERGY EFFICIENCY IN DATA CENTERS
Speaker:
Juan Carlos Salinas-Hilburg, Universidad Politécnica de Madrid, ES
Authors:
Juan Carlos Salinas-Hilburg¹, Marina Zapater², José L. Risco-Martín³, Jose Manuel Moya¹ and Jose L. Ayala³
¹Universidad Politécnica de Madrid, ES; ²CEI Campus Moncloa, UCM-UPM, ES; ³Universidad Complutense de Madrid, ES
Abstract
Data centers are huge power consumers and their energy consumption keeps on rising despite the efforts to increase energy efficiency. A great body of research is devoted to the reduction of the computational power of these facilities, applying techniques such as power budgeting and power capping in servers. Such techniques rely on models to predict the power consumption of servers. However, estimating overall server power for arbitrary applications when running co-allocated in multithreaded servers is not a trivial task. In this paper, we use Grammatical Evolution techniques to predict the dynamic power of the CPU and memory subsystems of an enterprise server using the hardware counters of each application. On top of our dynamic power models, we use fan and temperature-dependent leakage power models to obtain the overall server power. To train and test our models we use real traces from a presently shipping enterprise server under a wide set of sequential and parallel workloads running at various frequencies We prove that our model is able to predict the power consumption of two different tasks co-allocated in the same server, keeping error below 8W. For the first time in literature, we develop a methodology able to combine the hardware counters of two individual applications, and estimate overall server power consumption without running the co-allocated application. Our results show a prediction error below 12W, which represents a 7.3% of the overall server power, outperforming previous approaches in the state of the art.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

IP5-10, 205

A POWER-EFFICIENT 3-D ON-CHIP INTERCONNECT FOR MULTI-CORE ACCELERATORS WITH STACKED L2 CACHE
Speaker:
Kyungsu Kang, Samsung, KR
Authors:
Kyungsu Kang¹, Luca Benini², Giovanni De Micheli³, Sangho Park¹ and Jong-Bae Lee¹
¹Samsung, KR; ²Università di Bologna, IT; ³École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
The use of multi-core clusters is a promising option for data-intensive embedded applications such as multimodal sensor fusion, image understanding, mobile augmented reality. In this paper, we propose a power-efficient 3-D onchip interconnect for multi-core clusters with stacked L2 cache memory. A new switch design makes a circuit-switched Mesh-of-Tree (MoT) interconnect reconfigurable to support power-gating of processing cores, memory blocks, and unnecessary interconnect resources (routing switch, arbitration switch, inverters placed along the on-chip wires). The proposed 3-D MoT improves the power efficiency up to 77% in terms of energy-delay product (EDP).
Download Paper (PDF; Only available from the DATE venue WiFi)

15:31

IP5-11, 898

POWER-EFFICIENT LOAD-BALANCING ON HETEROGENEOUS COMPUTING PLATFORMS
Speaker:
Muhammad Shafique, Karlsruhe Institute of Technology (KIT), DE
Authors:
Muhammad Usman Karim Khan¹, Muhammad Shafique¹, Apratim Gupta², Thomas Schumann² and Jörg Henkel¹
¹Karlsruhe Institute of Technology (KIT), DE; ²University of Applied Sciences, Darmstadt, DE
Abstract
In order to address the throughput constraints of the system at minimal power consumption, the workload of computing nodes should be balanced. This requires accounting for the underlying hardware characteristics (e.g., power vs. frequency profiles) and throughput sustainable by these nodes. This work provides a workload distribution and balancing methodology of a divisible load under a throughput constraint, on heterogeneous nodes. The power efficiency of each node is considered during load distribution. For load balancing, the frequency of the node is determined which just fulfills the job requirements of the nodes. We functionally verify our methodology by implementing it on an FPGA-based system, with heterogeneous multi-cores and hardware accelerators, and report results for different image processing benchmarks. Compared to a state-of-the-art-approach, our approach results in up to 64% performance improvement for the benchmarks evaluated in this paper.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

End of session
Coffee Break in Exhibition Area

Visit us at DATE 2016