Networking Reference
In-Depth Information
Similar to the fully-buffered architecture, the intermediate buffers on the subswitch boundaries
are allocated on a per-VC basis. The subswitch input buffers are allocated according to a packet's
input VC while the subswitch output buffers are allocated according to a packet's output VC. This
decoupled allocation reduces HoL blocking when VC allocation fails and also eliminates the need to
NACK flits in the intermediate buffers. By having this separation at the subswitches with buffers, it
divides the VC allocation into a local VC allocation within the subswitch and a global VC allocation
among the subswitches.
With the hierarchical design, an important design parameter is the size of the subswitch, p
which can range from 1 to k . With small p , the switch resembles a fully-buffered crossbar resulting in
high performance but also high cost. As p approaches the radix k , the switch resembles the baseline
crossbar architecture giving low cost but also lower performance. In the next section, we describe the
Cray YARC router [ 56 ] which implements this hierarchical organization with k
64 and p
With increasing pin bandwidth, we are seeing a paradigm shift to many -ported routers, along with
many-core processors. As core count increases, the network ingress ports must also increase to avoid
congestion and lock contention for shared resources at the sending host. This section describes two
high-radix ( k >32) routers, the Cray YARC and Mellanox InfiniScale IV. We focus on these because
they provide raw bandwidth of 2.4Tb/s and 2.88Tb/s, respectively, yet have a fundamentally different
The Cray BlackWidow vector multiprocessor system [ 2 ], described in detail in Chapter 8 , is one of
the first systems to implement a high-radix network and YARC is the high-radix (radix-64) router
used in the network that is based on the hierarchical organization described earlier in this chapter.
The details of the YARC router can be found in [ 56 ], but in this section, we highlight some of
the key differences between the YARC implementation and the hierarchical crossbar organization
described earlier in Section 6.4 .
A block diagram of the YARC router and a die photo is shown in Figure 6.7 . The YARC router
is a radix-64 router and the implementation is partitioned into 64 tiles with each tile containing an
8 subswitch, an input and an output port, and associated buffers which consist of input buffers,
row buffers, and column buffers. The tiles communicate with other tiles through the row bus and
the column channels. The tiled organization of the high-radix router led to a complexity-effective
design as only a single design of a tile is required and is duplicated across the router. The die photo
shown in Figure 6.7 (b) shows the regular structure of the microarchitecture with a tile-based layout
and the perimeter of the layout containing the SerDes (serializer/deserializer) I/O's.
The YARC implementation can be viewed as a two-stage network as shown in Figure 6.8 -
the first stage consisting of the input speedup to the subswitches and the second stage consisting of
output speedup to the output ports. Similar to a crossbar, there is only a single path between an input
Search MirCeyron ::

Custom Search