Low Power Design of the X-GOLD® SDR 20 Baseband Processor

Wolfgang Raab, Jörg Berthold, Ulrich Hachmann, Dominik Langen, Michael Schreiner
Infineon Technologies AG, Germany

Holger Eisenreich, Jens-Uwe Schluessler, Georg Ellguth
TU Dresden, Germany

I. INTRODUCTION

The X-GOLD® SDR 2x family of programmable baseband processors is designed for hosting multiple standards of mobile communication, connectivity, and reception of broadcast services. Processors from the X-GOLD® SDR 2x family obtain the necessary flexibility from a set of programmable SIMD (single-instruction, multiple-data) processor cores, which exchange data through shared on-chip memories. The processors are supported by few dedicated configurable hardware accelerators for those DSP tasks which require no or little flexibility, by an ARM® core for the execution of the upper layers of the protocol stack and by standard IO-components.

II. ARCHITECTURAL MEASURES FOR LOW POWER OPTIMIZATION

The SIMD cores and accelerators access the shared memory through an on-chip system bus. Profiling of software revealed that a SIMD core typically communicates with only a few other SIMD cores. Hence, we arranged the SIMD cores in several groups, called SIMD clusters. Likewise we grouped the accelerators in an accelerator cluster. The shared memory for the inter-core data exchange is distributed over the various clusters.

The resulting two-level hierarchical system bus - multi-layer local buses between cores and shared memory blocks inside the clusters, and the global interconnect between the clusters - eases the scaling of the architecture. Intra-cluster memory accesses not only exhibit shorter latencies but also consume less power than inter-cluster data transfers. With proper allocation of data objects to the clusters, processing performance is higher and power consumption lower compared to using a non-hierarchical bus.

While the data level parallelism typical for most baseband algorithms is exploited within a SIMD core to reduce area and power consumption, the multiplicity of SIMD cores exploits parallelism on task level. To ensure optimum throughput and power efficiency also on this level additional hardware for synchronization has been implemented.

Traditional multi-processor systems usually achieve synchronization between tasks or threads by means of atomic read-modify-write or comparable operations on the shared memory. This not only reduces the processor throughput by locking the bus, but also the power consumption is increased when software does busy-waits on synchronization variables within the shared memory.

To improve throughput and power consumption we integrated a dedicated hardware infrastructure for synchronization. Firstly, it provides a separate memory for allocation of synchronization variables with narrow point-to-point connections to the cores. Secondly, we added hardware around this memory that ensures atomicity of read-modify-writes and takes over the monitoring of memory locations. As seen from the processor, waiting for a synchronization variable to be released is done by a power-saving clock-gating of a core’s pipeline until the synchronization hardware signals a successful memory access.

Apparently, the number of SIMD cores on the chip must be chosen to satisfy the most demanding combination of communication standards to be supported by the platform. However, most of the time not all cores or even only a few SIMD cores are active. For the sake of battery lifetime of the mobile device the unused cores must be disconnected from the power supply.

We defined a two-level hierarchy of power-gated domains that allows for switching off single SIMD cores but also entire SIMD clusters including the local bus and the respective part of the shared memory. Contrary to the hardware controlled clock-gating, power-gating is software controlled. The real-time operating system provides API functions for triggering the power-down and power-up transitions as well as configuring their timing parameters. Several parts of the power management are handled by the operating system, based on knowledge about resource allocation.

III. LOW POWER IMPLEMENTATION

A couple of technology options were available for the implementation of the SIMD clusters. A standard cell library with a reduced cell height promised smaller area and lower power consumption at the cost of increased cell delay. Transistors with reduced threshold voltage could speed up timing-critical paths at the cost of increased leakage power. The challenge was to derive the power-optimal selection of supply voltage, library, and transistor threshold voltage.

Three key factors enabled this investigation: (1) Based on a predecessor design, the relative contributions of the various sub-blocks of a SIMD core to the power consumption were assessed. (2) Software profiling on a virtual prototype could
give estimates about the proportion of active and idle times of the various sub-blocks while executing representative tasks.

(3) The gate-level netlist for a representative partition of the SIMD core was available very early and could be used for place & route runs and gate-level power simulations.

To handle the high complexity of the design and take advantage of multiple instantiated blocks we followed a hierarchical approach for the physical implementation. The physical hierarchy resembles the power domain structure. Each separately implemented block includes a shut-off domain and an always-on domain for buffering of power control signals. When a block is switched off its outputs to other blocks must maintain defined values. Hence we inserted isolation shells around the power-gated macros in the next higher hierarchy level. The isolation shells were generated by scripts and instantiated in the RTL design.

The power gating function is realized by micro switches distributed over the standard cell area. We validated the density of these switches by IR-drop analysis based on post-layout netlist simulation results. The peak current during powering up must not exceed the peak current of normal operation, to prevent critical IR-drop for the surrounding logic. Therefore, we implemented a control scheme that turns on the micro switches step by step controlled by configurable timing parameters.

IV. Conclusion

The design of the X-GOLD® SDR 20 baseband processor had to cope with the demands imposed by the applications as well as by tight power constraints. The application related demands are reflected by the multiplicity of cores, while the power constraints led to measures on the architectural level as well as to the implementation of low-power techniques. A low-power design approach has been presented which handles the resulting complexity by efficiently exploiting the partitioning of the processor.