switch. However, because the flit is buffered at the crosspoint, it does not have to re-arbitrate at the
input if it loses arbitration at the output.
The intermediate buffers are associated with the input VCs. In effect, the crosspoint buffers
are per-output extensions of the input buffers. Thus, no VC allocation has to be performed to reach
the crosspoint — the flit already holds the input VC. Output VC allocation is performed in two
stages: a v -to-1 arbiter that selects a VC at each crosspoint followed by a k -to-1 arbiter that selects
a crosspoint to communicate with the output.
To ensure that the crosspoint buffers never overflow, credit-based flow control is needed. Each
input keeps a separate free buffer counter for each of the kv crosspoint buffers in its row. For each flit
sent to one of these buffers, the corresponding free count is decremented. When a count is zero, no
flit can be sent to the corresponding buffer. Likewise, when a flit departs a crosspoint buffer, a credit
is returned to increment the input's free buffer count. The required size of the crosspoint buffers is
determined by the credit latency - the latency between when the buffer count is decremented at the
input and when the credit is returned in an unloaded switch.
It is possible for multiple crosspoints on the same input row to issue flits on the same cycle (to
different outputs) and thus produce multiple credits in a single cycle. Communicating these credits
back to the input efficiently presents a challenge. Dedicated credit wires from each crosspoint to
the input would be prohibitively expensive. To avoid this cost, all crosspoints on a single input row
share a single credit return bus. To return a credit, a crosspoint must arbitrate for access to this bus.
The credit return bus arbiter is distributed, using the same local-global arbitration approach as the
output switch arbiter.
With sufficient crosspoint buffers, this design achieves a saturation throughput of 100% of
capacity because head-of-line blocking [ 36 ] is completely removed. As the amount of buffering at
the crosspoints increases, the fully buffered architecture begins to resemble a virtual-output queued
(VOQ) switch where each input maintains a separate buffer for each output. The advantage of the
fully buffered crossbar compared to a VOQ switch is that there is no need for a complex allocator -
the simple distributed allocation scheme discussed in Section 6.2 is able to achieve 100% throughput.
However, the performance benefits of a fully-buffered switch come at the cost of a much
larger router area. The crosspoint buffering is proportional to vk 2 and dominates chip area as the
radix increases. Figure 6.5 shows how storage and wire area grow with k in a 0 . 10 μm technology
for v =4. The storage area includes crosspoint and input buffers. The wire area includes area for the
crossbar itself as well as all control signals for arbitration and credit return. As radix is increased, the
bandwidth of the crossbar (and hence its area) is held constant. The increase in wire area with radix
is due to increased control complexity. For a radix greater than 50, storage area exceeds wire area.
HIERARCHICAL CROSSBAR ARCHITECTURE
To overcome the high cost (area) associated with the fully buffered crossbar, a hierarchical switch
architecture can significantly reduce the amount of intermediate buffers required [ 42 ]. A block
diagram of the hierarchical crossbar is shown in Figure 6.6 . The hierarchical crossbar divides the