Figure 8.1: BlackWidow node organization.
8.1.2 HIGH-RADIX FOLDED-CLOS NETWORK
To reduce the cost and the latency of the network, BlackWidow uses a folded-Clos [ 14 ] network
that is modified by adding sidelinks that connect peer subtrees and statically partition the global
network bandwidth. Deterministic routing is performed using a hash function to obliviously balance
network traffic while maintaining ordering on a cache line basis. Machines of up to 1024 processors
can be constructed by connecting up to 32 rank 1 (R1) subtrees, each with 32 processors, to rank
2 (R2) routers. Machines of up to 4608 processors can be constructed by connecting up to nine
512-processor R2 subtrees via side links. Up to 16K processors may be connected by a rank 3 (R3)
network where up to 32 512-processor R2 subtrees are connected by R3 routers. Multiple R3 subtrees
can be interconnected using sidelinks to scale up to 32K processors.
The BlackWidow system topology and packaging scheme enables very flexible provisioning
of network bandwidth. For instance, by only using a single rank 1 router module, instead of two as
shown in Figure 8.1.2 a, the port bandwidth of each processor is reduced in half — halving both the
cost of the network and its global bandwidth. An additional bandwidth taper canbeachievedby
connecting only a subset of the rank 1 to rank 2 network cables, reducing cabling cost and R2 router
cost at the expense of the bandwidth taper as shown by the
4 taper in Figure 8.1.2 b.
The network is built using a high-radix router, which provides 64 ports
3 lanes operating up
to 6.25 Gb/s each lane. Each YARC router has an aggregate bandwidth of 2.4 Tb/s. BlackWidow
scales up to 32K processors with a worst-case diameter of seven hops. YARC uses a hierarchical