This Chapter is created to be a standalone entity; as such, it may repeat some of the concepts
(e.g., flits, phits, routing, etc.) that have already been covered thus far. That is intentional.
We want the reader to see how everything fits together and be able to look back at previous
Chapters if questions arise.
CRAY BLACKWIDOW MULTIPROCESSOR
The Cray BlackWidow (BW) vector multiprocessor is designed to run demanding applications with
high communication and memory bandwidth requirements. It uses a distributed shared memory
(DSM) architecture to provide the programmer with the appearance of a large globally shared memory
with direct load/store access. Unlike conventional microprocessors, each BW processor supports
abundant memory level parallelism (MLP), with up to 4K outstanding global memory references
per processor. Latency hiding and efficient synchronization are central to the BW design, and the
network must therefore provide high global bandwidth while also providing low latency for efficient
synchronization. The high-radix folded-Clos network [ 56 ] allows the system to scale up to 32K
processors with a worst-case diameter of seven hops.
8.1.1 BLACKWIDOW NODE ORGANIZATION
Figure 8.1 shows a block diagram of a BlackWidow compute node consisting of four BW processors,
and 16 Weaver chips with their associated DDR2 memory parts co-located on a memory daughter
card (MDC). The processor to memory channels between each BW chip and Weaver chip use a 4-bit
wide 6.25 Gbaud serializer/deserializer (SerDes) for an aggregate channel bandwidth of 16
Gbytes/s = 50 Gbytes/s per direction — 200 Gbytes/s per direction for each node.
The Weaver chips serve as pin expanders, converting a small number of high-speed differ-
ential signals from the BW processors into a large number of single-ended signals that interface
to commodity DDR2 memory parts. Each Weaver chip manages four DDR2 memory channels,
each with a 32-bit of data, 7-bit error correcting code (ECC), and one “spare bit”. The 32-bit data
path, coupled with the four-deep memory access bursts of DDR2, provides a minimum transfer
granularity of only 16 bytes. Thus, the BlackWidow memory daughter card has twice the peak data
bandwidth and four times the single-word bandwidth of a standard 72-bit-wide DIMM. Each of
the eight MDCs contains 20 or 40 memory parts, providing up to 128 Gbytes of memory capacity
per node using 1-Gbit memory parts.