*** Please note, this page (and web site) are in early development.
Items are certainly not complete, and may be inaccurate.
Your information, comments, corrections, etc. are eagerly requested.
Send e-mail to Ed Thelen. Please include the URL under discussion. Thank you ***

Cray Research

ManufacturerID location Datefrom
Cray ResearchCray-1 A ** 1976Lawrence Livermore Laboratory
Cray ResearchCray 1M/4400 ** 1978Cray Research
Cray ResearchCray-2 ** 1985Lawrence Berkeley Lab

Photo - 64 K Bytes

A new (2021) Cray History website :-)

Architecture

Special features - Cray 1
  • A vector machine, with 8 groups of 64 words
  • You can load a group (vector) of words with one instruction
  • You can multiply or add(subtract) one group (vector) by another vector giving a third vector.
  • There is one product and sum produced each clock - 12 nanoseconds
  • The machine was also very fast in scalar operations - badly beating many competing machines in this also important function. It was about twice as fast as the 7600 in scalar (normal - not vector) operations. 64 scalar registers available for easier code sequencing and optimizing.
  • Pipelines could be linked - you could have the multiply pipeline connect to the add pipeline for the relatively common
    (vector A times vector B) plus vector C giving vector D
  • All the above concurrently, a floating point result every 12 nanoseconds
  • The above gives the sometimes quoted speed of 160 million floating point operations (Mflops) per second
  • And of course while the above computing is going on, a write is sending a previous 64 word "vector" to memory, or filling a 64 word vector with new data from memory.
  • 1 millions words of main memory was an available options
  • Used ECL (Emitter Coupled Logic) - very fast, very power hungry
  • There was no divide instruction! If you needed to divide, you performed a "reciprical approximation" ( makes 1/value ) then multiplied by the reciprical. (Same thing - faster in hardware - a fast divide is the bane of computer designers.)
  • Gorden Bell says the Cray 1 "is remarkably similar to the 6600... and extended for vectors"

Special features - Cray 2
  • Production model had 8 processors
  • Ran in a tank of cooling liquid called fluorinert, the liquid that is used for heart transplant. This liquid was pumped through the Cray 2 and the heat of the modules boiled the liquid which was then cooled by refrigeration and recirculated.
  • This helped keep the system at a uniform, stable temperature and within the speed range of the semiconductors (their speed is temperature dependent)
  • To trouble shoot this system, you needed to drain the coolant. How to run it at a stable temperature for trouble shooting? - Turn the power on and run the machine for 2 milliseconds, then turn off the power for at least 1 second. Examine the results.

Cycle times from http://netlib2.cs.utk.edu/utk/lsi/pcwLSI/text/node9.html#SECTION00410000000000000000
Year of
Introduction
Model Name Cycle Time
in Nanoseconds
1976 CRAY 1 12.5
1982 CRAY X-MP 9.5
1985 CRAY 2 4.1*
1988 CRAY Y-MP 6.5
1992 CRAY Y-MP C-90 4.0
*Instructions could only be issued every cycle,
so the effective cycle time is 8.2 nanoseconds

From Tera, acquired Cray Research assets from SGI April 2000 (SGI acquired Cray Research in August 1996)
The first Cray-1® system was installed at Los Alamos National Laboratory in 1976 for $8.8 million. It boasted a world-record speed of 160 million floating-point operations per second (160 megaflops) and an 8 megabyte (1 million word) main memory. The Cray-1's architecture reflected its designer's penchant for bridging technical hurdles with revolutionary ideas. In order to increase the speed of this system, the Cray-1 had a unique "C" shape which enabled integrated circuits to be closer together. No wire in the system was more than four feet long. To handle the intense heat generated by the computer, Cray developed an innovative refrigeration system using Freon.

In order to concentrate his efforts on design, Cray left the CEO position in 1980 and became an independent contractor. As he worked on the follow-on to the Cray-1, another group within the company developed the first multiprocessor supercomputer, the Cray X-MP™, which was introduced in 1982. The Cray-2™ system appeared in 1985, providing a tenfold increase in performance over the Cray-1.

In 1988, Cray Research introduced the Cray Y-MP®, the world's first supercomputer to sustain over 1 gigaflop on many applications. Multiple 333 MFLOPS processors powered the system to a record sustained speed of 2.3 gigaflops.

Always a visionary, Seymour Cray had been exploring the use of gallium arsenide in creating a semiconductor faster than silicon. However, the costs and complexities of this material made it difficult for the company to support both the Cray 3 and the Cray C90ä development efforts. In 1989, Cray Research spun off the Cray 3 project into a separate company, Cray Computer Corporation, headed by Seymour Cray and based in Colorado Springs, Colorado. (Tragically, Seymour Cray died of injuries suffered in an auto accident in September, 1996 at the age of 71.

The 1990s brought a number of transforming events to Cray Research. The company continued its leadership in providing the most powerful supercomputers for production applications. The Cray C90™ featured a new central processor with industry-leading sustained performance of 1 gigaflop. Using 16 of these powerful processors and 256 million words of central memory, the system boasted unrivaled total performance. The company also produced its first "minisupercomputer," the Cray XMS system, followed by the Cray Y-MP EL series and the subsequent Cray J90™.

In 1993, Cray Research offered its first massively parallel processing (MPP) system, the Cray T3D™ supercomputer, and quickly captured MPP market leadership from early MPP companies such as Thinking Machines and MasPar. The Cray T3D proved to be exceptionally robust, reliable, sharable and easy-to-administer, compared with competing MPP systems.

Since its debut in 1995, the successor Cray T3E™ supercomputer has been the world's best selling MPP system. The Cray T3E-1200E system has the distinction of being the only supercomputer to ever sustain one teraflop (1 trillion calculations per second) on a real-world application. In November 1998, a joint scientific team from Oak Ridge National Laboratory, the National Energy Research Scientific Computing Center (NERSC), Pittsburgh Supercomputing Center and the University of Bristol (UK) ran a magnetic magnetism application at a sustained speed of 1.02 teraflops.

In another technological landmark, the Cray T90™ became the world's first wireless supercomputer when it was unveiled in 1994. Also introduced that year, the Cray J90 series has since become the world's most popular supercomputer, with over 400 systems sold.

Cray Research merged with SGI (Silicon Graphics, Inc.) in February 1996. In August 1999, SGI created a separate Cray Research business unit to focus exclusively on the unique requirements of high-end supercomputing customers. Assets of this business unit were sold to Tera Computer Company in March 2000.


From Yahoo, news wire info
{SGI paid $760 million for Cray Research in 1996}
{In April 2000, Cray (was Tera) paid SGI $58 million for the remnants of Cray Research, SGI lost over 92 percent on their "investment" in 3.5 years. The SGI Cray T3E is based on the Dec Alpha chip. The Cray C090 and T90 were the last of the Cray style vector processing from Cray Research/SGI. }

from http://www.cs.uiuc.edu/whatsnew/newsletter/fall98/chen.html
After earning his MS in 1972, Chen came to Illinois to work with Professor Dave Kuck and graduate student Duncan Lawrie, who were championing the new concept of parallelism in the ILLIAC IV project.
After a year at Floating Point Systems, Chen joined Cray Research as its chief designer, where he led the development of the world’s most commercially successful parallel vector supercomputers, the Cray X-MP, and its successor the Cray Y-MP. Chen began by making some architectural changes to the Cray-1, which was introduced in 1971. In the Cray X-MP (Chen said that the "X" stood for "extraordinary"), Chen introduced shared-memory multiprocessing to vector supercomputing. The machine contained two pipelined processors compatible with the Cray-1 and shared memory. The X-MP series was expanded to include 1- and 4-processor machines. The X-MP4 was the first supercomputer installed at the National Center for Supercomputing Applications (NCSA) at Illinois (summer 1985).
The first of the Y-MP series, Cray’s new multiprocessor vector supercomputer introduced in 1988, contained 1 processor, followed by 8, and then 16. All these machines shared essentially the same architecture, and the majority were designed by Chen and his team. Cray Research enjoyed tremendous growth from 1982–86 as its customer base expanded beyond government laboratories to commercial applications. This was the "heroic age" of the supercomputing industry.
http://wotug.ukc.ac.uk/parallel/documents/misc/timeline/timeline.txt

========1972========

Seymour Cray leaves Control Data Corporation, founds Cray Research
Inc.  (GVW: CDC, CRI)

Details From http://www.cs.umass.edu/~weems/CmpSci635/635lecture16.html
A Case Study: The Cray 1 and Family

The Cray 1 was first delivered in 1976. This was around the same time that 8-bit microprocessors were beginning to gain popularity, typical memory components were 1K bit SRAM and 4 K bit DRAM. Most machines were operating at about a 1 MHz clock rate, had 32-bit words, and large mainframes had 1 MB to 8 MB of RAM.

The Cray 1 had (Baron and Higbie CS manual)

  • 64-bit words
  • 8 MB of RAM
  • 16-way interleaving on low-order bits
  • 50 ns memory cycle
  • 12.5 ns clock cycle (80 MHz)
  • 12 pipelined functional units

The Cray 1 has 3 basic data types: addesses (24-bit integer), integers (64-bit), floating point (64-bit, 48-bit mantissa).

The 12 functional units are divided into four groups.

Group 1 -- Vector units

Vector (integer) Add: 3 stages
Vector Logical: 2 stages
Vector Shift: 4 stages

Group 2 -- Vector and scalar units

Floating Add: 6 stages
Floating Multiply: 7 stages
Floating Reciprocal Approximation: 14 stages

Group 3 -- Scalar units

Integer Add: 3 stages
Logical: 1 stage
Shift: 2 stages
Scalar population count and leading zero count: 3 stages

Group 4 -- Address units

Add: 2 stages
Multiply: 6 stages

The machine itself is divided into six major subsystems

  • Memory
  • Instruction component
  • Address component
  • Scalar component
  • Vector component
  • I/O component
  • Instruction Component

Cray 1 instructions are 32 or 16 bits, so from 2 to 4 instructions can be packed into a word. Instructions are thus addressed on 16-bit boundaries while data is addressed on 64-bit boundaries.

The instruction unit has four 16-word instruction buffers, three instruction registers, and one instruction counter. Each 16-bit field in a word is called an instruction parcel.

The three instruction registers are

  • Next Instruction Parcel -- holds first parcel of the next instruction, prefetched from buffer
  • Current Instruction Parcel -- holds the high-order portion of the instruction to be issued
  • Lower Instruction Parcel -- holds low-order portion of instruction to be issued

For a 32-bit instruction, the low-order portion is fetched to the NIP and then moved to the LIP. There is no mechanism for discarding instructions in the pipe -- once in the CIP/LIP, they will be issued. At most they will be delayed for some time.

The instruction buffers are tied to the memory via the 16-way interleaving, so it is possible to fill a buffer in 4 clock cycles (recall that the clock is 12.5 ns and memory is 50 ns). Buffers are filled on a demand basis in a round-robin pattern. They thus act as an instruction cache of 256 instructions, organized into four lines of 64 instructions. Each buffer has its own address comparator, so we would call this a fully associative cache (easy to implement when there are only 4 lines). The buffers cannot be written to -- a write bypasses the instruction cache and only goes to main memory.

Scalar instruction issue requires that all of the instruction's required resources be free -- otherwise the instruction waits. Vector instruction issue in the Cray involves reserving functional units, including memory, operand registers and result registers, and then releasing an instruction once all of its resources are available. In addition, some data paths are shared between the vector and scalar components, and these must be available.

The control unit is able to detect when a result register for one vector operation is an operand for another vector operation and, if the two vector instructions do not conflict in any other resource requirements, it sets up a vector chaining operation between the two instructions.

Address Component

There are 8 24-bit address registers, 64 24-bit spill registers, an adder, and a multiplier in this component. Its purpose is to perform index arithmetic and send the results to the scalar and vector components so that they can fetch the appropriate operands.

Arithmetic is performed on the address registers directly. The spill registers are used to hold address values that do not fit into the address registers. A set of 8 addresses can be transferred between the address registers and their spill registers in a single cycle. Thus, they bear a certain similarity to the register windows of the SPARC (or vice versa). The spill registers can be thought of as an explicitly managed data cache with 8 lines. Their value is that they reduce the traffic to main memory, freeing that resource for vector operations.

Scalar Component

Similar to the address component, the scalar component has 8 64-bit registers and 64 64-bit spill registers. It has sole access to four functional units: Integer Add, Logical, Shift, and Population Count. The Scalar Component also has access to three functional units that are shared with the Vector Component: Floating Add, Multiply, and Reciprocal Approximation.

Because the scalar component has its own integer units, it can always execute integer operations in parallel with a vector operation. However, for floating point, the vector unit takes priority.

Vector Component

The are 8 64-word vector registers in the vector component. It takes four memory loads to fill a vector register. Normally, this would require 16 instruction cycles. However, careful pipelining in the memory unit reduces the time to just 11 cycles.

A vector mask register contains a bit-map of the elements in a register operand that will participate in an instruction. A vector length register determines whether fewer than 64 operands are contained in a set of vector operands. Manipulating these values is the primary reason for the population and leading zeros counter.

Vector loads and stores specify the first location, the length, and the stride.

I/O Component

The I/O component has 24 programmable I/O channel units. I/O has the lowest priority for memory access.


Cray XM-P

  • Extended the Cray-1 architecture to 4-way multiprocessing.
  • Cycle reduced to 8.5 ns (117 MHz)
  • Increased instruction buffers to 32 words
  • Added a multiport memory system.
  • Redesigned the vector unit to support arbitrary chaining.
  • Added Gather/Scatter to support sparse arrays.
  • Increased memory to 16 M words, 32-way interleave
  • Provides a set of shared registers to support fine-grained (loop-level) multiprocessing. There are N+1 sets of these registers for an N-processor system. They include eight address registers, 8 scalar registers, and 32 binary semaphores.
  • The I/O system was improved and a solid state disk cache was added.


Cray YM-P

  • Extends the XM-P architecture to 8 processors.
  • Cycle reduced to 6 ns (166 MHz)
  • Extends memory to 128 M words


Cray 2

  • One foreground and four background processors.
  • 4.1 ns cycle (244 MHz)
  • Up to 256 M words of memory
  • 64 or 128 way interleave depending on configuration
  • Eliminates the spill registers in favor of a 16K word cache
  • Cache feeds all three computational components with 4-cycle access time
  • Has 8 16-word instruction buffers
  • Foreground processor controls the I/O subsystem, which has up to 4 high speed communication channels (4 Gb/s).


Practical Considerations in Supercomputer Design To achieve such high speeds, high-power (i.e. hot) drivers are employed, signals are detected with specialized analog circuits, conductors are all shielded and precisely tuned in both impedance and length, and data is encoded with error-correcting so that losses can be recovered.

In addition, the circuits are usually designed to operate in balanced mode so that there is no change in power drawn as drivers switch. As one driver switches from low to high, another switches from high to low, so that the power supply sees a DC load and there is no coupling of switching noise back into the logic via the power supply. In addition, using balanced signal lines can increase the signal to noise ratio by 6dB, although these are not often used. In a design such as the Cray-1, roughly 40% of the transistors supposedly do nothing but balance the power loading.

Even so, these machines dissipate large amounts of heat. The IBM 3090 uses special thermal conduction modules in which a multichip substrate is mounted in a carrier with built-in plumbing for a chilled water jacket. CDC used a similar system in its designs, and on one instance a maintenance crew pumped live steam through the building air conditioning system, which crossed over to the processor, with predictable results. This raises the issue that these machines usually need thermal shut-down systems, and possibly even fire suppression gear.

The Cray-1 series uses piped freon, and each board has a copper sheet to conduct heat to the edges of the cage, where freon lines draw it away. The first Cray-1 was in fact delayed six months due to problems in the cooling system: lubricant that is normally mixed with the freon to keep the compressor running would leak through the seals as a mist and eventually coat the boards with oil until they shorted out.

The Cray-2 is unique in that it uses a liquid bath to cool the processor boards. A special nonconductive liquid (flourinert) is pumped through the system and the chips are immersed in this.

Special fountains aerate the liquid, and reservoirs are provided for storing the liquid when it is pumped out for service. This is somewhat remeniscent of the oil cooling bath that was sometimes used in magnetic core memory units.

The ETA-10 was originally going to use a liquid nitrogen bath, but I believe this turned out to be too difficult to implement (on a side note, I have known scientific labs where the researchers deal with cooling problems in air-cooled machines by opening a tank of liquid nitrogen at the inlet, but that's not quite the same).

As a final note, Lawrence Livermore National Labs has announced that it will henceforth buy no more vector supercomputers. The handwriting is clearly on the wall for this breed of system, and all of the major manufacturers are moving, finally, to parallel processing.

Number manufactured

Cray-1 - 85 - http://www.dg.com/about/html/cray-1.html

User Experience
A History of Supercomputing at Florida State University by Jeff Bauer


If you have comments or suggestions, Send e-mail to Ed Thelen

Go to Antique Computer home page
Go to Visual Storage page
Go to top

Updated April 12, 2000