Near-Optimal Metastability-Containing Sorting Networks*

Johannes Bund
Saarland University
Saarland Informatics Campus, Germany
Email: s9jobund@stud.uni-saarland.de

Christoph Lenzen
Max Planck Institute for Informatics
Saarland Informatics Campus, Germany
Email: clenzen@mpi-inf.mpg.de

Moti Medina
Max Planck Institute for Informatics
Saarland Informatics Campus, Germany
Email: mmedina@mpi-inf.mpg.de

Abstract—Metastability in digital circuits is a spurious mode of operation induced by violation of setup/hold times of stateful components. It cannot be avoided deterministically when transitioning from continuously-valued to (discrete) binary signals. However, in prior work (Lenzen & Medina ASYNC 2016) it has been shown that it is possible to fully and deterministically contain the effect of metastability in sorting networks. More specifically, the sorting operation incurs no loss of precision, i.e., any inaccuracy of the output originates from mapping the continuous input range to a finite domain.

The downside of this prior result is inefficiency: for $B$-bit inputs, the circuit for a single comparison contains $\Theta(B^2)$ gates and has depth $\Theta(B)$. In this work, we present an improved solution with near-optimal $\Theta(B \log B)$ gates and asymptotically optimal $\Theta(\log B)$ depth. On the practical side, our sorting networks improves over prior work for all input lengths $B > 2$, e.g., for $16$-bit inputs we present an improvement of more than $70\%$ in depth of the sorting network and more than $60\%$ in cost of the sorting network.

I. INTRODUCTION

In the 80’s, Marino proved that any bistable circuit can be forced into metastability [1], an unstable equilibrium state that is neither binary 0 nor 1 (regardless of how the voltage thresholds for these states are chosen). Given that metastability fundamentally disrupts the binary abstraction and, accordingly, the operation of digital circuits, the traditional methods for dealing with metastability are (i) to make sure that timing violations potentially causing metastability are avoided deterministically or (ii) to wait until metastability resolves with sufficiently high probability.

Option (i) is achieved by synchronous design, but can also be guaranteed in asynchronous circuits in many cases. However, crossing boundaries of (unsynchronized) clock domains inherently bears the risk of metastability, requiring to resort to Option (ii). This takes the form of synchronizer chains, which are designed to resolve metastability as fast as possible. For instance, state-of-the-art synchronizers for 65 nm technology achieve roughly a factor $e^{-20}$ decay per nanosecond in the probability of sustained metastability (i.e., $\tau \approx 50 \, \text{ps}$ [2]); however, there is evidence that the resolution time will increase with further miniaturization [3].

The corresponding overhead in delay and buffer size can be avoided if the clock sources of different domains are synchronized. Unfortunately, this is not as straightforward as it seems: Naïve master-slave approaches essentially fuse all clock domains into one and introduce a single point of failure, negating crucial advantages of the multi-domain approach in terms of scalability and reliability.

Clock synchronization: A promising alternative is to run a fault-tolerant clock synchronization algorithm, like the one by Lynch and Welch [4], [5], preserving the benefits of multiple clock domains while removing the need for synchronizers. Inconveniently, it turns out that this does not eliminate the problem, but rather shifts it to the synchronization algorithm: any clock synchronization algorithm needs to measure time differences between the nodes executing it, which entails to map a continuous range to a discrete binary representation;\(^1\) Marino’s impossibility result now implies that this transition from analog to digital must incur the risk of metastability.

Naturally, one could now resort to Option (ii), synchronizers, to resolve this issue. However, this costs time! And losing time decreases the precision of the clock synchronization algorithm, which crucially relies on frequently updated information on clock skews to correct them.

Containing metastability: We follow a novel approach recently proposed by Friedrichs et al. [6]. They establish a theory of metastability-containing circuits, which allows for meaningful calculation under worst-case propagation of metastability of inputs.

Concretely, for the above problem, in an unpublished manuscript [7] it is shown how to perform the time measurements of the synchronization algorithm such that the measurements are encoded as Gray code values with (at most) one metastable bit, which captures the uncertainty of the measurement. For instance, when discretizing a time difference of 75 ps in steps of 10 ps, one may obtain the encodings of 7 or 8 (which for a Gray code differ in exactly one bit), or the bit string for which the bit that is different is metastable. At the heart of the Lynch-Welch algorithm lies the operation of sorting such input strings according to their encoded value, and afterwards applying a respective (analog!) phase shift to the local clock.

\(^1\)Unless, of course, the algorithm is implemented in a fully analog way, which is costly.

* The proofs were omitted and will appear in the full version of the paper. We will be happy to provide them on request.
This enables the following separation of concerns for the logic of the nodes executing the algorithm:

1) Sort the results in a metastability-containing way.
2) Apply a corresponding phase shift (metastability is “absorbed” by the analog correction).

Metastability-containing sorting: For this approach to succeed it is essential that the initial inaccuracy of the measurement is not “amplified” by the sorting logic, despite metastability, i.e., metastability is contained. In [6], it is shown that this is possible in principle, but the corresponding circuits are exponentially large in the number of bits $B$ of the Gray code inputs. Lenzen and Medina [8] proved that one can sort two Gray code inputs (i.e., output $\text{max}\{g, h\}$ and $\text{min}\{g, h\}$ for inputs $g, h$) in this manner with a circuit of depth $O(B)$ and $O(B^2)$ gates. Feeding $n$ inputs into an $O(\log n)$-depth [9] sorting network comprised of such metastability-containing 2-sort elements, yields a metastability-containing sorting network of depth $O(B \log n)$ and $O(B^2 \log n)$ gates.

However, the 2-sort implementation from [8] is still large compared to standard 2-sort circuits, which have $O(B)$ gates and depth $O(\log B)$. Worse, the linear delay of the circuit defeats the purpose of resorting to metastability-containing logic, as the large computational delay eats away at the time saved by discarding all synchronizers. This leads to the following open question posed by Lenzen and Medina [8].

Is there a small metastability-containing 2-sort circuit of logarithmic depth?

Our Contribution: We answer this question in the affirmative, by devising, proving correct, and validating in simulation a metastability-containing 2-sort circuit of depth $\Theta(\log B)$ and size $\Theta(B \log B)$. Given the trivial lower bounds of $\log B$ and $B - 1$ when using fan-in 2 gates, this is close to optimal. The involved constants are moderate, leading to circuits of sufficient performance for use in the Lynch-Welch algorithm, cf. Table V. Plugging our circuit into (optimal depth or size) sorting networks [10], [11], [12], we obtain efficient combinational metastability-containing sorting circuits, cf. Table VI.

II. ENCODINGS AND VALID INPUTS

Due to the potential presence of metastability at the input, we need to carefully choose and make use of suitable encodings. In this section, we formalize the respective notation and summarize basic properties of the encodings.

a) Reflected Binary Gray Code: For $N \in \mathbb{N}$, we abbreviate $[N] := \{0, \ldots, N - 1\}$. For convenience, bit indices are one-based, e.g., the leftmost bit of a binary string $g$ is $g[1]$, the second is $g[2]$, etc. For $1 \leq i \leq j \leq B$ (where $g$ has $B$ bits), let $g[i \cdot j] := g[i]g[i + 1] \ldots g[j]$; in case $i > j$, $g[i \cdot j]$ is the empty string. We assume that $B$ is a power of two. We denote the first and second halves of the string $g$ by $g_0$ and $g_1$, i.e., $g_0 = g[1 : B/2]$, and $g_1 = g[B/2 + 1 : B]$. Let $\text{par}(g)$ denote the parity of the binary string $g$, i.e., $\text{par}(g) = \sum_{i=1}^{B} g[i] \mod 2$.

We denote by $\langle \cdot \rangle$ the decoding function of a Gray code string, i.e., for $x \in [N]$, $\langle \text{rg}_B(x) \rangle = x$. As each $B$-bit string is a codeword, the code is a bijection and the decoding function also defines the encoding function $\text{rg}_B : [N] \rightarrow \{0, 1\}^B$.

Intuitively, $B$-bit binary reflected Gray code can be recursively defined as follows (where a 1-bit code is trivial). Taking two copies $\text{rg}^{(0)}$ and $\text{rg}^{(1)}$ of $\text{rg}_{B/2}$, we encode 0 as $\text{rg}^{(0)}(0) \circ \text{rg}^{(1)}(0)$. For every up-count, we increase $\text{rg}^{(1)}$ until we reach the maximum value $\sqrt{N} - 1$ for a $B/2$-bit code, i.e., $\text{rg}^{(1)}(\sqrt{N} - 1)$. The next up-count increases $\text{rg}^{(0)}$. Afterwards, each up-count decreases $\text{rg}^{(1)}$, until we reach $\text{rg}^{(1)}(0)$ again. This process is repeated until we reach $\text{rg}(N - 1) = \text{rg}^{(0)}(\sqrt{N} - 1) \circ \text{rg}^{(1)}(0)$. Because $\text{rg}^{(0)}$ and $\text{rg}^{(1)}$ are both Gray codes (i.e., each up-count changes at most one bit), the same is true for $\text{rg}$. Table I illustrates this structure at hand of a 4-bit code.

A crucial observation is that whether we are currently “counting up” or “counting down” the $\text{rg}^{(1)}$-part of the string is given by $\text{par}(\text{rg}^{(0)})$: the parity of $\text{rg}^{(0)}$ changes on each of its up-counts, as exactly one bit is changed, and on every such up-count we switch between counting $\text{rg}^{(1)}$ up or down, respectively. An immediate consequence of this property is that $\text{rg}^{(1)}$ always equals the maximum or minimum code word whenever $\text{rg}^{(0)}$ contains a metastable bit. Our approach leverages these insights into the structure of the code, which we formalize mathematically in the full version of the paper.

b) Metastability Characterization of Valid Strings: Following [8], we define the set of valid strings that serve as inputs to our combinational circuits. We represent metastable bits by the symbol $\ddagger$. We utilize an operator that compares two strings and returns a string matching the originals where
they are both stable and equal, and $M$ everywhere else.

**Definition 2.1 (The $\ast$ operator [8]):** Given $B \in \mathbb{N}$, define the operator $\ast : \{0, 1, M\}^B \times \{0, 1, M\}^B \to \{0, 1, M\}$ by

$$\forall i \in \{1, \ldots, B\} : (x \ast y)[i] := \begin{cases} x[i] & \text{if } x[i] = y[i] \\ M & \text{else.} \end{cases}$$

As discussed earlier, we may assume that Gray code inputs from time measurements contain at most one metastable bit [7], and our 2-sort elements will maintain this property. In addition, valid strings guarantee that if the single metastable bit resolves to either 0 or 1, the resulting string encodes either $x$ or $x + 1$ for some $x$, cf. Table II.

**Definition 2.2 (Valid Strings [8]):** Let $B \in \mathbb{N}$ and $N = 2^B$. Then, the set of valid strings of length $B$ is

$$S_{rg}^B := \{\text{rg}_B([N])\} \cup \bigcup_{x \in [N-1]} \{\text{rg}_B(x) \ast \text{rg}_B(x + 1)\}.$$ 

For simplicity, the decoding function $(\cdot)$ is extended to also “decode” valid strings that contain a metastable bit.

**Definition 2.3 (Extended Decoding [8]):** Let $N = 2^B$ for some $B \in \mathbb{N}$. Denote $[N]_M := \{z/2 \mid z \in [2N - 1]\}$, i.e., the half-integers from the range $[0, N - 1]$. For $x \in [N - 1]$, we extend $(\cdot)$ to $S_{rg}^B$ by

$$\langle \text{rg}_B(x) \ast \text{rg}_B(x + 1) \rangle := x + 1/2.$$ 

Accordingly, for $x \in [N - 1]$ we extend $\text{rg}_B$ to $[N]_M$ by setting $\text{rg}_B(x + 1/2) := \text{rg}_B(x) \ast \text{rg}_B(x + 1)$. The following definition captures the span of values that a valid string can attain after metastability resolves.

**Definition 2.4:** For $g \in S_{rg}^B$, we define that $\text{res}(g) := \{\text{rg}_B([\langle g \rangle]), \text{rg}_B([\langle g \rangle])\}$. 

### III. Model and Problem

As we seek to design combinational circuits that accept valid strings as inputs, i.e., we need to specify circuit behavior in face of metastability. In accordance with [6], [8], we assume that gates propagate metastability to their outputs in a worst-case manner, but taking into account logical masking: an AND gate always outputs 0 if one of its inputs is stable 0, and an OR gate always outputs 1 if one of its inputs is stable 1.

**A. Metastability-containing Multiplexers**

Our metastability-containing circuits rely on multiplexers with a special property: if the inputs between which metastable control bits select are identical, the output is stable. Such metastability-containing multiplexers (CMUX) have been introduced in [8]. Hence, we confine ourselves to giving the introduction of a (4 : 1) – CMUX (the only type of CMUX we use) in Table IV. A (4 : 1) – CMUX($B$) exhibits the same behavior, but selects between $B$-bit input strings. Note that it is possible to improve our circuits further by devising better (4 : 1) – CMUX($B$) implementations.

<table>
<thead>
<tr>
<th>a</th>
<th>b</th>
<th>0</th>
<th>1</th>
<th>M</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>M</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>M</td>
<td>1</td>
<td>M</td>
<td>M</td>
<td>M</td>
</tr>
</tbody>
</table>

**B. Problem Definition**

Our goal is to compute the maximum or minimum of two valid strings. To understand what this means, consider the following scenario. Suppose a valid string “encodes” $x + 1/2$ for some $x \in [N - 1]$, i.e., the string contains a metastable bit that makes it uncertain whether the represented value is $x$ or $x + 1$. In this case, a time-to-digital converter (TDC) generating this string has measured a time period corresponding to at least $x$ and at most $x + 1$ delay stages of the TDC. The string will stabilize to either $x$ or $x + 1$. The stabilized string is thus off by at most 1 w.r.t. the time period it represents.

In order to fully maintain the precision of the original measurement, we impose the same constraint on the extensions of $\max$ and $\min$ to valid inputs, which is formalized as follows.

**Definition 3.1 ([8]):** Let $N = 2^B$ for some $B \in \mathbb{N}$. For $g, h \in S_{rg}^B$, define

$$\max_{rg} \{g, h\} := \text{rg}_B(\max\{\langle g \rangle, \langle h \rangle\})$$

$$\min_{rg} \{g, h\} := \text{rg}_B(\min\{\langle g \rangle, \langle h \rangle\}).$$

For example,

- $\max_{rg} \{0111, 0101\} = \text{rg}_B(\max\{5, 6\}) = 0101$,
- $\max_{rg} \{1\text{#}10, 1111\} = \text{rg}_B(\max\{115, 10\}) = 1\text{#}10$,
- $\max_{rg} \{0100, 0100\} = \text{rg}_B(\max\{65, 7\}) = 0100$.

Note that this definition entails that the results are valid strings as well. Our goal is to find (efficient) circuits computing $\max_{rg}$ and $\min_{rg}$. 

### TABLE II

<table>
<thead>
<tr>
<th>g</th>
<th></th>
<th>g</th>
<th></th>
<th>g</th>
<th></th>
<th>g</th>
<th></th>
<th>g</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0.5</td>
</tr>
<tr>
<td>0001</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0011</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>0100</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>0101</td>
<td>12</td>
<td>12</td>
<td>12</td>
<td>12</td>
<td>12</td>
<td>12</td>
<td>12</td>
<td>12</td>
</tr>
</tbody>
</table>

### TABLE III

**LOGICAL EXTENSIONS TO METASTABLE INPUTS OF AN AND GATE (LEFT), AN OR GATE (CENTER), AND AN INVERTER (RIGHT).**

<table>
<thead>
<tr>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

**B. Problem Definition**

Our goal is to compute the maximum or minimum of two valid strings. To understand what this means, consider the following scenario. Suppose a valid string “encodes” $x + 1/2$ for some $x \in [N - 1]$, i.e., the string contains a metastable bit that makes it uncertain whether the represented value is $x$ or $x + 1$. In this case, a time-to-digital converter (TDC) generating this string has measured a time period corresponding to at least $x$ and at most $x + 1$ delay stages of the TDC. The string will stabilize to either $x$ or $x + 1$. The stabilized string is thus off by at most 1 w.r.t. the time period it represents.

In order to fully maintain the precision of the original measurement, we impose the same constraint on the extensions of $\max$ and $\min$ to valid inputs, which is formalized as follows.

**Definition 3.1 ([8]):** Let $N = 2^B$ for some $B \in \mathbb{N}$. For $g, h \in S_{rg}^B$, define

$$\max_{rg} \{g, h\} := \text{rg}_B(\max\{\langle g \rangle, \langle h \rangle\})$$

$$\min_{rg} \{g, h\} := \text{rg}_B(\min\{\langle g \rangle, \langle h \rangle\}).$$

For example,

- $\max_{rg} \{0111, 0101\} = \text{rg}_B(\max\{5, 6\}) = 0101$,
- $\max_{rg} \{1\text{#}10, 1111\} = \text{rg}_B(\max\{115, 10\}) = 1\text{#}10$,
- $\max_{rg} \{0100, 0100\} = \text{rg}_B(\max\{65, 7\}) = 0100$.

Note that this definition entails that the results are valid strings as well. Our goal is to find (efficient) circuits computing $\max_{rg}$ and $\min_{rg}$. 

### TABLE IV

**OUTPUT OF A (4 : 1) – CMUX. THE CONTROL BITS ARE $s$ AND $t$, THE SELECTABLE INPUTS ARE $a, b, c, d$.**

<table>
<thead>
<tr>
<th>t</th>
<th></th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td></td>
<td>a</td>
<td>b</td>
<td>a+b</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td>c</td>
<td>d</td>
<td>c+d</td>
<td></td>
</tr>
<tr>
<td>M</td>
<td>a+c</td>
<td>b+d</td>
<td>a<em>b+c</em>d</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Definition 3.2 ([8]): For $B \in \mathbb{N}$, a (combinational) 2-sort($B$) circuit is defined as follows.

- **Input**: $g, h \in S_{rg}^B$,
- **Output**: $g', h' \in S_{rg}^B$,
- **Functionality**: $g' = \max_{rg} \{ g, h \}$, $h' = \min_{rg} \{ g, h \}$.

We remark that the results of [6] imply that this is the most restrictive specification that coincides with max and min on inputs without metastability and can be implemented by a circuit in our model. That is, any further suppression of metastability has to make use of synchronizers (which ensure probabilistic guarantees only), use analog components, or employ of masking registers (cf. [6]).

IV. THE CIRCUIT

The high-level idea of the circuit is to first determine which of the strings to select in a metastability-containing manner, represented by two control bits. It may happen that one or both of the strings do not become metastable in the output, despite possible metastability of the control bits.

A. Control Signal Circuit

The subcircuit computing the control bits performs a 4-valued comparison: given stable inputs $g, h$, it returns 01 or 10 if $(g) < (h)$ or $(g) > (h)$, respectively. If $g = h$, it returns 00 or 11, depending on the parity of the equal strings. For inputs with metastable strings, the behavior is most conveniently specified by considering all possible resolutions of the inputs, determining the respective outputs, and obtaining the output for the original inputs as the “superposition” of all options under the $\ast$ operator. For example,

- $c(0111, 0101) = 01$,
- $c(1001, 1001) = 00$,
- $c(1M10, 11M1) = 10 \ast 10 = 10$,
- $c(101M, 1000) = 11 \ast 01 = M1$,
- $c(111M, 111M) = 11 \ast 10 \ast 01 \ast 00 = MM$.

a) Specification:

**Definition 4.1**: For $B \in \mathbb{N}$, a (combinational) cont($B$) circuit is defined as follows.

- **Input**: $g, h \in S_{rg}^B$,
- **Output**: $c \in \{ 0,1,M \}^2$,
- **Functionality**: $c := (g', h') \in \text{res}(g) \times \text{res}(h)$, $c(g', h')$, where for $g', h' \in \{ 0,1 \}^B$, $c(g', h')$ is defined as follows.

<table>
<thead>
<tr>
<th>$c(g', h')$</th>
<th>$c(g', h')$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0</td>
<td>$g' = h'$ and $\text{par}(g') = 0$</td>
</tr>
<tr>
<td>0 1</td>
<td>$(g') &lt; (h')$</td>
</tr>
<tr>
<td>1 0</td>
<td>$(g') &gt; (h')$</td>
</tr>
<tr>
<td>1 1</td>
<td>$g' = h'$ and $\text{par}(g') = 1$</td>
</tr>
</tbody>
</table>

$b$) Implementation: The base case is trivial.

**Fact 4.2**: For $B = 1$, the specification given in Definition 4.1 is met by the identity circuit returning output $(g, h)$ for inputs $g, h \in \{0, M, 1\}$.

For $B$ that is a power of 2, we implement the specification recursively. The idea is to recursively use a $B/2$-bit circuit on inputs $g_0, h_0$ and $g_1, h_1$, respectively, and use the result of the second call to resolve a tie from the first call. For keeping track of the parity in case of a tie it is essential to use the result of the second call correctly: if the parity of $g_0 = h_0$ is odd, we need to negate the control bits returned by the second call.

**Lemma 4.3**: Suppose $B = 2^i$ for some $i \in \mathbb{N}$. The circuit depicted in Figure 2 implements the specification of cont($B$) given in Definition 4.1.

**Theorem 4.4**: For $B = 2^i$, $i \in \mathbb{N}$, cont($B$) can be implemented by a circuit of delay and cost:

$$\text{delay}(\text{cont}(B)) = \log(B) \cdot \text{delay}((4:1) - \text{CMUX}),$$

$$\text{cost}(\text{cont}(B)) = 2 \cdot (B - 1) \cdot \text{cost}((4:1) - \text{CMUX}).$$

B. 2-sort($B$) Implementation

Again, we observe that the base case is trivial.

**Fact 4.5**: For $B = 1$, the specification given in Definition 3.2 is met by the circuit returning AND($g, h$) and OR($g, h$) for inputs $g, h \in \{0, M, 1\}$.

We recursively implement the sorting circuit as follows. We determine $(\max_{rg} \{ g, h \})_0 = \max_{rg} \{ g_0, h_0 \}$ by a recursive call (the same for $\min_{rg}$). A second recursive call determines $\max_{rg} \{ g_1, h_1 \}$ and $\min_{rg} \{ g_1, h_1 \}$. Finally, we use a call to cont($B/2$) to compute the control bits selecting which of the strings $g_0, h_0$, $\max_{rg} \{ g_1, h_1 \}$, and $\min_{rg} \{ g_1, h_1 \}$ to select as $(\max_{rg} \{ g, h \})_1$. The correctness of this implementation is proven in the full version of the paper.

**Lemma 4.6**: The circuit depicted in Figure 3 implements the specification of 2-sort($B$) given in Definition 3.2.

Note that the recursive construction requires computing control bits in each level of the recursion. However, the
respective circuit recurs on the same substrings as the sorting circuit, and has slightly smaller delay. This enables reuse of the outputs of the recursive calls of the control circuit as control bits for the recursive calls of the sorting circuit. Exploiting this insight, we arrive at a highly efficient 2-sort implementation.

**Theorem 4.7:** For \( B = 2^i \), \( i \in \mathbb{N} \), 2-sort\((B)\) can be implemented by a circuit of delay and cost:

\[
delay(2\text{-sort}(B)) = \text{delay}(\text{AND}) + \log(B) \cdot \text{delay}((4:1) - \text{CMUX})
\]

\[
cost(2\text{-sort}(B)) = 2B \cdot \text{cost}(\text{AND}) + B \log B \cdot \text{cost}((4:1) - \text{CMUX}) + 2(B - \log(2B)) \cdot \text{cost}((4:1) - \text{CMUX})
\]

### C. Sorting Networks

Given a metastability-containing 2-sort implementation, it is now straightforward to sort multiple inputs using standard techniques. Taking any sorting network, we can plug the outputs of the recursive calls of the control circuit as control bits for the recursive calls of the sorting circuit. Exploiting this insight, we arrive at a highly efficient sorting network scale.

### V. Simulation Results

#### A. Design Flow

We list the tools we used in the design flow: (i) Design entry: Quartus, (ii) Behavioral simulation: ModelSim, (iii) Synthesis: Encounter RTL Compiler (part of Cadence tool set) with the NanGate 45 nm Open Cell Library, (iv) Place & route: Encounter (part of Cadence tool set) with the NanGate 45 nm Open Cell Library.

#### B. Adaptations to the Design Flow due to Metastability

In order to preserve the metastability-containment properties of our circuit, the optimizer in the synthesis had to be disabled. The simplest example is our CMUX. The optimizer realizes that a CMUX is a multiplexer and disregards additional properties introduced by the “redundant” gates in the CMUX. Hence, the optimizer simply transforms a CMUX to a regular multiplexer. A standard multiplexer may output \( M \) even if the two select inputs are 1 in case the control signal is \( M \).

In Figure 4, we show how badly a regular sorting network behaves on inputs which contain metastable bits; shown is the output of ModelSim. Interestingly, Quartus internally mapped parts of the circuit to lookup-tables (LUTs) featuring metastability-containment properties as a side effect of glitch avoidance techniques. This resulted in metastability being masked for many inputs. The given results were obtained by simulating the VHDL code directly in the ModelSim tool.

#### C. The benchmarks

We consider four designs for comparators, two of which meet the specification of 2-sort. (i) the 2-sort circuit by Lenzen and Medina [8], (ii) The 2-sort\((B)\) circuit presented in this paper, (iii) Naive-comp\((B)\): A Gray code comparator which is not metastability-containing, (iv) Binary-comp\((B)\): A regular comparator of binary inputs. In what follows, we elaborate on these benchmarks. The two 2-sort circuits are not optimized and they are as they appear in the schematics given here and in [8], respectively. The Binary-comp\((B)\) design is simply the VHDL comparator of two \( B \)-bit binary strings, allowing for an estimation of the overheads introduced by metastability-containment. Note that the Binary-comp design is optimized for both number of gates and delay in the RTL compiler; as discussed in the conclusion, we expect optimized CMUX designs to dramatically reduce the respective gaps. The Naive-comp\((B)\) design uses the control subcircuit \( \text{cont}(B) \), which is implemented with regular multiplexers. The outputs of this variant of the \( \text{cont}(B) \) circuit were fed to

### Table V

<table>
<thead>
<tr>
<th>B</th>
<th>Circuit</th>
<th># Gates</th>
<th>Area [( \mu m^2 )]</th>
<th>Delay [ps]</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>This paper</td>
<td>34</td>
<td>49.42</td>
<td>268</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>34</td>
<td>49.42</td>
<td>268</td>
</tr>
<tr>
<td></td>
<td>[8]</td>
<td></td>
<td>15.582</td>
<td>145</td>
</tr>
<tr>
<td>4</td>
<td>This paper</td>
<td>160</td>
<td>230.3</td>
<td>498</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>188</td>
<td>272.118</td>
<td>728</td>
</tr>
<tr>
<td></td>
<td>[8]</td>
<td></td>
<td>34.58</td>
<td>288</td>
</tr>
<tr>
<td>8</td>
<td>This paper</td>
<td>504</td>
<td>723.52</td>
<td>827</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>856</td>
<td>1237.425</td>
<td>1684</td>
</tr>
<tr>
<td></td>
<td>[8]</td>
<td>41</td>
<td>73.752</td>
<td>477</td>
</tr>
<tr>
<td>16</td>
<td>This paper</td>
<td>1344</td>
<td>1925.262</td>
<td>1233</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>3632</td>
<td>5247.186</td>
<td>3488</td>
</tr>
<tr>
<td></td>
<td>[8]</td>
<td>81</td>
<td>151.648</td>
<td>422</td>
</tr>
</tbody>
</table>
TABLE VI
SIMULATION RESULTS FOR METASTABILITY-CONTAINING SORTING NETWORKS WITH n ∈ {4, 7, 10} FOR B-BIT INPUTS. 10-sort_
OPTIMIZES GATE COUNT, 10-sort_d DEPTH; FOR n ∈ {4, 7}, THE SORTING NETWORKS ARE OPTIMAL W.R.T. BOTH MEASURES.

<table>
<thead>
<tr>
<th>B</th>
<th>Circuit</th>
<th>4-sort</th>
<th>7-sort</th>
<th>10-sort</th>
<th>10-sort_d</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>G #</td>
<td>Area (um²)</td>
<td>Delay (ps)</td>
<td>G #</td>
<td>Area (um²)</td>
</tr>
<tr>
<td>2</td>
<td>This paper</td>
<td>170</td>
<td>2647.92</td>
<td>364</td>
<td>170</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>40</td>
<td>775</td>
<td>478</td>
<td>128</td>
</tr>
<tr>
<td>4</td>
<td>This paper</td>
<td>50</td>
<td>1500</td>
<td>455</td>
<td>2206</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>01</td>
<td>172</td>
<td>966</td>
<td>364</td>
</tr>
<tr>
<td>8</td>
<td>This paper</td>
<td>250</td>
<td>2817</td>
<td>737</td>
<td>2394</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>208</td>
<td>506</td>
<td>614</td>
<td>1473</td>
</tr>
<tr>
<td>16</td>
<td>This paper</td>
<td>612</td>
<td>3686</td>
<td>775</td>
<td>3396</td>
</tr>
<tr>
<td></td>
<td>Binary-comp</td>
<td>048</td>
<td>530</td>
<td>67</td>
<td>1298</td>
</tr>
</tbody>
</table>

Fig. 4. The top four rows give sets of valid inputs to 4-channel sorting networks using Naive-comp(4) (left) and our 2-sort implementation (right) for comparisons, respectively. Metastable bits are indicated as X. Using Naive-comp(4), all but one output bit is metastable (∞∞∞∞) depicted as a line). Our circuit outputs the input strings, sorted according to Table II.

regular multiplexers, which output the sorted (two) inputs. The purpose of the Naive-comp design is to demonstrate the highly desirable behavior of a non-containing sorting network under valid inputs.

These four designs are used as comparators in optimal sorting networks with 4, 7, and 10 input channels. For 10 channels we considered two sorting networks: 10-sort_g optimizes the number of gates [12] and 10-sort_d optimizes the depth of the sorting network [11].

D. Summary of simulation results

Figure 4 illustrates that it is essential to utilize 2-sort circuits to guarantee correct behavior in the presence of metastability in the input. Table V and Table VI show the chip area, number of gates,3 and the delay4 of our circuits in 45 nm technology.

The improvements are substantial. In particular, for a 10-channel sorting network with B = 16, we obtain (i) a 62.9% improvement in gate count (comparing the network that optimizes the number of gates), (ii) a 63.2% improvement in chip area (comparing the network that optimizes the number of gates), and (iii) a 70.7% improvement in delay (comparing the network that optimizes the depth of the sorting network).

VI. DISCUSSION

In this work, we improve prior results by Lenzen and Medina on metastability-containing sorting networks. We obtain a purely combinational circuit that is asymptotically optimal w.r.t. delay and has an exponentially smaller (logarithmic) gap to optimality w.r.t. the number of transistors. We formally show correctness alongside precise delay and cost bounds as function of B and the properties of basic components. We confirm these assertions by detailed simulations.

The main open questions this paper raises are:

- Are there better CMUX implementations, which would readily improve the constants in our bounds?
- Are there other powerful basic primitives that can be efficiently realized in a metastability-containing way?
- What is the optimum cost of the 2-sort(B) primitive?

The first question is subject to our current work. Preliminary results indicate that transistor-level CMUX implementations can cut down cost, delay, and area of our metastability-containing sorting networks by factors between 2 and 3 compared to the presented numbers. Comparing to the results for (optimized!) non-containing variants, this would result in very small overheads in terms of delay, and small overheads w.r.t. gate count, and area corresponding to the Θ(log B) overhead of the presented 2-sort implementation in comparison to an optimal non-containing solution.

REFERENCES