Go to On Line Documents, Go to Go to Antique Computer home page
Return to ETA, Return to Cray Research
A History of Supercomputing at Florida State University
(Written in early 1991)
As the result of an unsolicited proposal by the Florida State University (FSU) to the U. S. Department of Energy (DOE), a collaborative agreement was initiated in late 1984 between the State of Florida, FSU, DOE and Control Data Corporation (CDC). This first brought about the creation of the FSU Supercomputer Computations Research Institute (SCRI) followed by the delivery in March 1985 of interim computer hardware to the FSU Computing Center (FSUCC). By April, the CDC Cyber 205 supercomputer became available to SCRI associates, other DOE researchers around the U. S. and the State University System of Florida.
The intention was to get both users and support personnel up-to-speed on the Cyber 205, which was running the VSOS operating system, in advance of the installation of ETA-10 Serial # 1. In its final configuration, the ETA-10 would provide twelve times the processing power and a compatible VSOS environment. The plan assumed that after a successful period of acceptance testing on the ETA-10, the Cyber 205 and its file server (a Cyber 835) would be returned to CDC.
The first ETA-10 processor was shipped from St. Paul, Minnesota on December 31, 1986. The next year, 1987, was principally a year of installations and monitor mode testing before an operating system was available on the machine. The ETA Operating System (EOS) VSOS environment had become sufficiently mature by January 1988 to allow early user access. However, late in that year, EOS was overtaken in functionality by a native UNIX system, ETA System V, based on the UNIX System V (release 3.0) operating system licensed by AT&T. In April of 1989 the four processor ETA-10G, with the shortest clock cycle time yet (7 ns), was installed. Ironically, the week after the completion of the installation of the G, the parent company of ETA Systems, Control Data, shut down the ETA operations.
The demise of ETA Systems, Inc. forced a re-evaluation of the continued use of ETA hardware. An agreement was reached between Control Data, FSU, and Cray Research, Inc. to exchange the ETA-10G with a four processor Cray Y-MP. The ETA-10G remained in production until March of 1990. A two processor ETA-10Q provided an interim platform until the installation of the Cray Y-MP was completed in early April of the same year. The ETA-10Q was available until November of 1990 as an additional computing resource alongside the Y-MP.
In February of 1990, a Connection Machine-2 was installed at SCRI, providing a different style of supercomputing: massively parallel processing. Researchers and scientists from a variety of disciplines are using the CM-2 to investigate parallel algorithms in high-energy physics, lattice gauge theory, and materials science.
All of these supercomputers provided unique features and abilities that have enhanced the ``high-end'' computer capabilities at FSU for the past six years. In addition, the FSU supercomputer experience gave rise to a number of ``lessons learned'' that are mentioned later, especially with respect to the ETA experiment.
The Control Data Cyber 205
1985 : Our First Supercomputer
In March, 1985, the Computing Center took delivery of its first supercomputer, a CDC Cyber 205, along with a front-end file server CDC Cyber 835. The Cyber 205 system included: CPU with 20-nanosecond clock cycle, 2 vector pipelines, 32 megabytes of central memory and 7.2 gigabytes of on-line disk storage. The Cyber 205 had a theoretical peak performance of 200 MFLOPS, and a LINPACK rating of 17 MFLOPS. The Cyber 835 system added another 20 gigabytes of on-line disk. In addition, four 6250 bpi on-line tape drives were shared between the Cyber 205 and the Cyber 835. Communications between the Cyber 205, its peripherals, and various front-end mainframes were handled by a Loosely-Coupled Network (LCN) consisting of four separate coaxial trunks. In its final configuration, the LCN connected the Cyber 205 to two CDC Cybers, two DEC VAX computers, the ETA-10 and an IBM mainframe.
By April, 1985, the Cyber 205 was running production code from local researchers and DOE researchers around the country. The operating system software was soon upgraded to VSOS 2.2, providing more features and increased stability. Languages supported were: FORTRAN (with vectorizing pre-processor), C with vector extensions, and the Cyber 205 assembly language. Numerous mathematical packages such as CERN, IMSL and MAGEV were installed, as well as the DI-3000 and NCAR graphics packages.
Local System Enhancements
During the first few months of production use, the operating system was gradually tailored to better meet the needs of our user base. Job categories with various system resource limits were created to maximize throughput for our particular job mix. Modifications to the operating system were made in job scheduling and accounting. Utilities were written so that researchers could better manage allocated CPU time and batch job execution.
In the area of job scheduling, two system modifications were made to prevent any single user from monopolizing a job category. The first mod changed the way in which jobs in the input queue were processed. In standard VSOS, jobs were processed on a first-in-first-out basis, thus allowing a user who submitted a large number of jobs at once to block others from running. This mod caused the input queue to be processed on a round-robin basis by user number. The second mod, obtained from Purdue University, limited the number of jobs a user could execute simultaneously in a job category.
In November, 1985, variable rate accounting was implemented. Jobs of different classes were charged different rates and were allocated resources accordingly. Three job classes were defined: standby, normal and high priority. Standby class jobs received one fifth the normal time slice and were charged one fifth the normal rate of one System Billing Unit (SBU) per CPU second. High priority jobs were automatically expedited by the system, given a time slice five times normal, and charged five times the normal rate. To help researchers keep track of their CPU allocation, a message was written to the dayfile at job termination giving the number of SBUs remaining.
Some stand-alone utilities were also written to aid in CPU allocation management and batch job tracking. Companion programs MOVETIME and LISTTIME allowed the ``master user'' of an account to transfer time between sub-ordinals of that account, and list the time remaining for the sub-ordinals. SUPSTAT periodically sent snapshots of various console displays to the front-end. The information included queue statuses, disk and CPU utilization, and system uptime. SUPDROP allowed a user to drop one of his supercomputer jobs from the front end after supplying the appropriate validation data (i.e. user number, password, etc.).
Availability and Usage
The Cyber 205 remained in service until October 24, 1989. During its 4 1/2 years at FSU, the Cyber 205 was available for use over 38,000 hours (95% of wall-clock time, and 97.6% of scheduled uptime). The addition of an Uninterruptible Power Source (UPS) in December, 1986, significantly decreased downtime due to power outages, from 148 hours in fiscal 1986 to only 10 hours in fiscal 1987. The mean-time-between-failure rate went from one failure every 35 hours before the UPS was installed to one every 127 hours afterward.
From the day it came up for production until the day it was shut down, the Cyber 205 CPU was in use 96% of the time it was available. After the implementation of the standby job category in November, 1985, it was in use over 98% of the time available. Though not cutting-edge technology (ours was the next to last one built), the Cyber 205 was a consistent performer, and provided a reasonably stable supercomputing environment for researchers at FSU and around the country.
The ETA Systems ETA-10
1987 : BEFORE AN OPERATING SYSTEM
Installation of the first prototype ETA-10 processor began at the FSU Computing Center on January 5, 1987. The clock cycle time for this CPU was 12.5 nanoseconds and within two weeks it was running a FORTRAN job transferred from the Cyber 205, in monitor mode. This process required the compilation of source code and subsequent loading of object code on the Cyber 205 to create a controllee, or executable file, which was then transferred to the ETA via an Apollo workstation. The binary would then be loaded into memory on the ETA-10 and run directly on hardware, not under the control of an operating system. Limited output could be obtained from the equivalent of core dumps.
A second CPU arrived in the Spring, and by Summer, a four processor (12.5 ns clock) configuration was in place. No user access was available at this stage, but the FSU installation team was able to perform some benchmarking and special purpose testing.
It was in the fall of 1987 that the machine was upgraded to full ETA-10E specifications with a 10.5 nanosecond clock cycle time, 4 million words local memory per CPU, 128 milllion words of shared memory and 14.4 billion bytes of online disk. In October, an ISlNG model was running in multiprocessing monitor mode achieving a new world record for performance (6 giga-flips per spin). Table 1 from the LINPACK report showed the ETA-10 leading the list for performance on full precision, all FORTRAN benchmark. The top three entries were ETA-10E: 52 MFLOPS, NEC SX-2 43 MFLOPS, and Cray X-MP/4 39 MFLOPS.
Work had begun in St. Paul on the `W series' of the developmental EOS operating system and, by the end of the year, prototypes were being evaluated at FSU. In the meantime, supercomputer support personnel from FSUCC and SCRI were still concentrating on the production Cyber 205 service. The user base had built up while expertise had been gained and passed on. FSUCC consolidated operations in December with the relocation to Sliger Building, Innovation Park, of the Cyber 205 and support staff. It is worth noting that the new home for FSUCC had essentially been built around the ETA-10 which was already in place.
1988 : ETA OPERATING SYSTEM - VSOS
By January 1988, the W15 pre-release of EOS was considered mature enough for official release and an ``early user access'' program began. This system was billed as providing a fairly full and relatively stable VSOS environment along with local batch/interactive access, RHF file transfer and a state-of-the-art vector preprocessor (ETA VAST-2). It was released on February 12th as EOS 1.0. X and Y series versions of the operating system were under development by ETA to become EOS 1.1 and 1.2, respectively, over the next two quarters.
At the end of April, EOS 1.1 was installed at FSU. Features included a remote batch facility, large page support and improved handling of multiple users per CPU. Interim EOS releases were received over the next few months with EOS 1.1A in May, 1.1B in July and culminating in EOS 1.1C on September 1. The main emphasis here was on operating system stability and some benefit was observed. However, with more than two users on a processor simultaneously, software crashes were still expected frequently under EOS 1.1C. Furthermore, EOS 1.2 was now looking to be a distant prospect, this system being the first expected to provide multiprocessing support.
It was around this time that plans were made to evaluate ETA System V, the active port by ETA of the UNIX operating system licensed by AT&T. A task force was assembled by FSU and dispatched to St. Paul on September 9, under the observance of DOE. After two solid days of testing on a single liquid-cooled ETA-10 processor, the team returned to Tallahassee to report its findings. The conclusion that the improved stability, response and usability of the pre-released UNIX system outweighed the production performance losses, notably in the area of I/O, against EOS/VSOS. FSU was assured that certain ``buggy features'' under UNIX would be eliminated in the released version.
FSU support personnel were galvanized into action after the official release on October 3 of UNIX 1.0. The EOS service was discontinued at FSU on October 23 and UNIX installed. By then, UNIX 1.0a was available and extensive testing began within the week. Users were given access on November 7 and brief FSUCC Technical Information Packets (TIPs) were distributed. Areas covered included an introduction to UNIX on the ETA-10, the ftn77 compilation system, ed and vi editors, telnet and FTP. Ftn77 provided an integrated system with access to a pre-processor, the ETA VAST-2 vectorizer, the FTN77 compiler (like FORTRAN 200 on the Cyber 205) and a library link editor.
A single processor, air cooled ETA-10Q (19 ns clock) was deployed on December 12 intially to help those ETA users with short term migration problems. The ``piper'', as the air cooled ETA-10 was known, ran the latest version of EOS, namely 1.1C, and allowed two or three local researchers the chance to complete calculations that would not otherwise have been possible on a busy Cyber 205. The system proved to be a success for a user community of this size with production VSOS jobs.
1989 : ETA SYSTEM V - UNIX
Before moving completely into 1989, we mention that UNIX 1.0b became available at FSU on December 29. Some improvement in stability was observed, although few ``UNIX bugs'' were fixed. It is worth recalling that FSUCC maintained an ETA UNIX bugs list from the outset which, by the end of the year, had grown to some fifty local entries.
Despite its problems, UNIX was really working out much better for the FSU supercomputer community at large. The system was more versatile and supported more users at any one time than EOS was capable of. Often, a single processor would be running a dozen interactive sessions, three NQS local batch jobs and the occasional background process. As a result, system stability has remained much the same as with EOS which could only support a couple of jobs concurrently at best. The main difference has proven to be that, under UNIX, the ETA-10 was an interactive supercomputer while, under EOS, the ETA-10 could only be used as a remote batch machine.
Progress was being made in St. Paul on the ETA-10G, the four processor (7 ns clock) replacement for the ETA-10E, so plans were made for interim access to UNIX. EOS was discontinued on the ETA-10Q on February 21, and the machine removed, while installation of a dual processor ETA-10Q ``piper'' began February 27. A pre-release of UNIX 1.1 was mounted for testing on this new piper that provided improved I/O performance, a factor roughly 5-10 times better. This was mostly realized for shared memory to/from disk transfer, although paging from local to/from shared memory had also become more efficient. Moving users from a four processor E series machine to a two processor Q was obviously going to be difficult due to a four-fold reduction in processing power. It was planned that user disks would be moved temporarily onto the piper in advance of the G series machine becoming available.
The ETA-10E was removed from service on March 16, and the next day users had access to pre 1.1 on the piper. The system coped quite well but it was fortuitious that several users took a break from supercomputing for a month. Installation of the ETA-10G began on March 28, by which time its predecessor was out of the way. During the first quarter of 1989, FSUCC supercomputer support staff had been devising a new user guide known as the ``ETA-10 Quick Book''. This was completed at the end of March and distributed to all principal investigators.
As we were planning the announcement of the migration of users to the ETA-10G, word was received on April 17 that Control Data Corporation had closed ETA Systems and terminated its employees. This news came as a shock to FSU who had entered formal negotiations with CDC regarding a potential hardware upgrade to the ETA equipment. In the meantime, it was to be ``business as usual'' at FSUCC and users were duly given access to UNIX on pre 1.1 on the ETA-10G on April 21.
The ETA-10G continued to provide supercomputing UNIX cycles until deinstallation and replacement with a Cray Y-MP in March of 1990, detailed in the next section. The ETA-10Q was used as an interim machine between the deinstall of the G and the install of the Y-MP. The Q continued to provide supercomputing UNIX cycles with no hardware or software maintenance until hardware problems forced a shutdown in November of 1990. It proved to be a useful platform for researchers still requiring ETA-10 cycles while porting their applications to the Y-MP and ran for quite a few months under ``local support''. Some of the peripheral devices taken from the Q have proven useful on other computers.
The Cray Y-MP
1990 : Installation
On November 15th, 1989 it was announced that an agreement had been reached between FSU, Control Data, and Cray Research that the existing ETA-10G would be exchanged for a comparably-equipped Cray Y-MP, to be manufactured and delivered by Cray in late February and early March of 1990. Over the next few months, FSUCC, SCRI and Cray personnel hammered out the fine details of the exact configuration and a time line of events generated. The first item on the time line was training on UNICOS installation for Systems group members, which took place between February 19th and February 21st at Cray's training facility, located in Eagan, Minnesota. The Systems group trainees then went to Cray's manufacturing checkout facility in Chippewa Falls, Wisconsin and, along with Cray analysts, installed UNICOS 5.1 on the yet-to-be shipped Cray Y-MP, serial number 1513.
Events started to really pick up at the end of that week. On March 9th, the disks and tape drives connected to the ETA-10G were moved onto the ETA-10Q2 and Control Data began the removal of the ETA-10G equipment. By March 10th, plumbers and electricians were busy at work running power lines and pipes for the Cray. This work required some machine room downtime due to the interruptions of basic services, like chilled water and electric power, but all were completed and the machine room back in shape by March 12th.
The next two weeks saw the orderly installation of the Cray support equipment, including the motor generator and condensing unit. The raised floor was rebuilt to support the Cray as the ETA-10G's footprint was larger and did not use the raised floor for support. Since the Cray is totally supported on the raised floor, additional pedestals and tiles were installed.
The Cray and all of it's peripherals arrived on Monday, March 26th. By March 28th, the mainframe and peripherals were installed and powered up. Engineers then spent the next few days going through the exhaustive hardware checkout and testing process. Cray analysts then flocked to the machine and, using the file systems and UNICOS kernel built earlier in Chippewa Falls, quickly brought up the operating system for software checkout and testing.
The final piece put into place was the installation and checkout of the Network Systems Hyperchannel gear. This would provide ethernet access to the Cray, as well as a high speed link to the SCRI VAX 8700. FSUCC, Cray, SCRI, and Network Systems personnel worked diligently over the weekend of March 31st to achieve this milestone.
On April 5th Cray was satisfied with the installation and turned the machine over to FSUCC analysts who began the customization process. The next day files were copied from the ETA-10Q2 via magnetic tape and user names created using the password files from piper0 and piper1. The machine was almost ready for users.
On April 9th, as originally scheduled, the Cray became officially available for production with a pre-installed user base of files and user names, although some researchers had been running production programs since April 5th.
1990-1991 : A PRODUCTIVE FIRST YEAR
In August of 1990, an additional 10 GB of on-line mass storage was added to the Cray.
In late March or early April of 1991, the Cray will be upgraded to UNICOS 6.0.
The first year of supercomputing on the Cray Y-MP ends on a high note. The machine has enjoyed a better than 99% of scheduled uptime and usage has been high, with greater than 90% of the time available being busy.
The dramatic difference between the maturity of UNICOS on the Cray versus ETA System V has contributed to a much more stable software platform and enhanced use of the supercomputing environment. All aspects of FSU supercomputing have been aided by the Y-MP presence, including networking, user training and documentation, vendor support, systems monitoring and tuning, and overall hardware reliability.
The Connection Machine
In addition to the more traditional vector supercomputers mentioned earlier, FSU installed in February of 1990 a massively parallel SIMD (Single Instruction Multiple Data) Connection Machine-2 from Thinking Machines, Inc. The CM-2 was installed on the fourth floor of the Dirac Science Library, within the domain of SCRI.
The CM-2 is a 16 dimensional hypercube interconnect of 65,536 single-bit processors, with an additional 2,048 64-bit floating point processors available. It is connected to a front end machine, a VAX 6420 running Ultrix. A 10 GB parallel disk array, the Data Vault, is available for mass storage requirements and a high speed video frame buffer provides real time graphic images.
The CM-2 is being used to solve problems in high-energy physics, lattice gauge theory, and materials science.
The FSU Supercomputing Experience
Florida State University has gained much knowledge regarding the installation, operations, administration, and applications of supercomputers. Along the way, a variety of ``lessons learned'' are worth note.
A user of a computer perceives the success of the equipment in many ways, from how much application software is available to how often the computer is accessible. A top level indicator of the success of providing supercomputer accessibility over the past six years can been seen in the Mean Time Between Failure (MTBF) rates:
Supercomputer MTBF (in hours) Cyber 205 (with no UPS) 34.7 Cyber 205 127.2 Cray Y-MP 2,064.2 ETA-10 25.4
The dramatic differences between the failure rates reflect not only the different vendors ability to create and maintain a particular hardware solution, but they also indicate the increasing aptitude of FSU and the FSU Computing Center in particular for running large-scale computing facilities.
The ETA-10 experience was certainly unique and is worthy of additional comments:
- Although not reflected in the MTBF rate, since it includes hardware and software failures, in actuality the ETA-10 hardware was quite reliable. Even the liquid-nitrogen cooled systems, using cryogenic techniques not traditionally associated with computers (and apparently not since), enjoyed a high amount of availability.
- The largest stumbling block with the ETA-10 was the apparent late start with serious operating system development. The ETA-10 could have been more fully utilized from the beginning if a stable, robust operating system been available.
- The ETA-10 demonstrated excellent use of state-of-the-art and emerging technologies, with the use of custom CMOS VLSI, B.E.S.T. built-in self test logic, a 40+ multilayer board, fiber optics connectivity to I/O devices, cryogenic cooling, and the broad range of available configurations. In retrospect, however, the continued use of the Cyber 205 abstract architecture, with memory-to-memory long vector pipelines supplemented by a somewhat underpowered scalar processor, did not seem justified with respect to the lack of wide acceptance of the earlier 205. It is ironic that even with such careful attention to almost identically matching the instruction set between the 205 and the ETA-10, the approach to operating system development appeared to be an effort almost from scratch, with the subsequent delays and unreliability that any major software effort of that magnitude would experience.
- It certainly did not help ETA that major components of their computer system, such as the custom VLSI logic chips and the high speed memory chips, suffered scheduling problems. This reliance upon other domestic and foreign firms that were unable to produce either sufficient quantity or chips that were fast enough made schedules slide.
- The ETA-10 software experience occurred in the midst of the ``open standards'' cry and hue that arose in the mid to late '80s. ETA was late to jump on the UNIX bandwagon, with software talent distracted in the early years doing EOS development. It is pretty much accepted that if the UNIX effort had been the original operating system then it would have been more timely and widely accepted, perhaps to the point of ensuring ETA's success. Witness the ease of migration between the relatively immature ETA System V UNIX and Cray's UNICOS during the supercomputer switchout -- UNIX allowed the user's files and shell scripts to port over to the Cray with little to no changes.
- Environmental support for a cryogenic supercomputer is not without cost. The original cryogenerator system, which recycled the nitrogen, experienced a higher frequency of maintenance periods and proved to be more expensive than just buying the liquid nitrogen in bulk and allowing the excess to vent off. Even so, over 7,000 gallons a week were required to keep the two cryostats containing the CPU boards at sufficient levels for daily operation.
- Sufficient resources and expertise were not available at FSU to take over software development on the ETA-10. Had Serial # 1 been placed at a large government laboratory, resources may have been brought to bear at an earlier stage to overcome the software problem, perhaps keeping ETA Systems in business. Support for the U. S. supercomputer industry was considered an important element of the FSU/DOE strategy, but it would appear that there is now only one domestic vendor in the market place.
If you have comments or suggestions, Send e-mail to Ed Thelen
Go to Antique Computer home page
Return to top
Updated April 12, 2000