The Vector Gravy Train

 

John Gustafson

 

Amdahl's law is back! According to Peter Gregory [1], we can all rest easy that massively parallel processing will always be merely a niche market... so keep buying all those Cray Y-MP's, please! Jack Worlton [2] warns us against MPP because the infrastructure isn't there to support it. Worlton, who apparently hasn't visited a record store in the last few years, says MPP will show the same technological failure (?) as optical disks.

Gregory and Worlton maintain a long-established Cray tradition. In the mid-1970s, Seymour Cray scoffed at vector arithmetic in computers. "I wouldn't know what to do with it," he said. Soon after, he introduced the CRAY-1 vector processor. With no parity on its memory. "Parity is for farmers!" he quipped. Suddenly, the CRAY-1 got a full error-correction code. In the early 1980's, he told Business Week "No one, and I mean no one, knows how to program these big parallel computers." Then his CRAY-2 turned up with 4 or 8 processors and provisions for programming them to work in concert. Now Gregory and Worlton have lashed out at MPP just as their employer, Cray Research, introduces a massively parallel computer based on DEC's Alpha processor. Perhaps to preserve sales in older product lines, folks at Cray have long ferociously derided the concepts that they end up embracing.

 

Spurious and Valid Reasons for Rejecting MPP

There are valid reasons for continuing to use vector computers, but Amdahl's law, "general purpose" design, and memory are surely not among them! Here are some valid reasons:

 

Notice that I didn't say they were good reasons, just that they were valid reasons. Much work has gone into deflecting any burden from the end user of big vector systems, mostly by shifting it to hardware designers, compiler writers, and taxpayers. Unfortunately, people like Gregory and Worlton have become confused that the traditional approach is inherently less burdensome overall, or inherently more specialized. The Amdahl's law and memory arguments particularly deserve a closer look.

 

Amdahl's Law: The "Maginot Line" of Computing

Amdahl's law says parallel computers can't work. While we're at it, airplanes can't work, either. The law of gravity says that what goes up, must come down. Very short flights might be possible, but general-purpose transportation with heavier-than-air craft is clearly absurd. Scientists as distinguished as Lord Kelvin fell into this trap throughout the 19th century, deriving rigorous arguments from shaky assumptions. To cite Amdahl's 1967 arguments is to assume

 

Larger processor ensembles are used on larger problems. When problems are scaled, it's usually the parallel operations that scale. I/O, load imbalance, and other overheads grow slowly by comparison [3]. You don't use 1000 painters to paint a kitchen, but people who graph "MFLOPS" versus number of processors forget that common sense in discussing MPP. MPP users have long avoided the "serial bottleneck" problem the same way the Germans overcame the heavily-fortified Maginot Line in WWII: go around it. Attack a different problem.

Cost isn't propotional to the number of processors. Suppose we double the number of processors, and only get a 60% performance improvement. Does that mean parallel processing isn't cost-effective? What if doubling processors increases system price by only 50%? Most users will accept the higher performance and reduced cost per calculation, and ignore hand-wringing over the "efficiency." These days, processors are hardly the dominant cost of most computer systems when you figure in software, storage, networking, and people.

Processors need not go idle in an MPP system while there are jobs waiting in the queue. The technology for time-slicing users has been around at least 25 years. On an IBM mainframe, few would be foolish enough to argue that one shouldn't buy the memory required by large jobs because some applications wouldn't need it. Why apply the corresponding argument to processors? A pool of processors is simply another resource to manage.

 

Just How Parallel is a Cray?

Worlton and Gregory both contrast the C90 with MPP approaches. Just how much parallelism is there in a C90? Each of the 16 processors has two multiplier pipes, two adder pipes, a reciprocal approximation pipe, integer and logical pipes, and eight memory pipes. Each pipeline has an average of about ten stages operating in parallel. Call it 400 things happening at once, give or take 100. Is 400 now the number for "moderate" parallelism as defined by the vector computing advocates, while we radicals use "massive" parallelism with over 1000 processors? If a "scalar operation" is a single memory-to-memory operation, then I suspect the C90 and a collection of 1024 microprocessors would look very similar in terms of Amdahl's law.

The issue isn't parallelism in the hardware; it's the software environment. In the Good Old Days, 100% of the burden of providing speed fell on the hardware designer. Then the hardware designer started to hit the physical limits of information transmission and heat removal, and the burden shifted to the compiler designer. During the 1980's, clever compiler writers allowed us to continue to use old-fashioned programs by extracting vector operations, planning delayed branches, and using tiered, shared memory automatically. The hardware designer and the compiler designer are now both pretty near the limit of what they can do without help from a third party: the user. When people use phrases like "parallel processing," they don't mean hardware or things that can be done automatically. They mean extra management the application programmer must do explicitly. Like it or not, that's where the biggest gains lie. A competent programmer willing to change paradigms can get a bigger speed increase in a week's effort than one can get in a decade of hardware improvements acting on stagnant code.

Is MPP specialized? A serial architecture only fits serial problems, wasting time when there is parallelism. A parallel computer uses one processor on the serial problem and lets the other processors do other jobs, wasting little. Worlton brings up "... the fact that there are inherent limitations to the use of parallelism in problem solving" but does not elaborate. Are we supposed to retreat to the harsher limits of serialism in problem solving? When people say MPP is specialized, I ask, "Is a group of 1000 people more specialized than a single person?"

 

Memory Speed and Cost

Now let's look at memory bandwidth, and the cost of disguising the von Neumann bottleneck in traditional computers. Cray memory is massively interleaved, pipelined (20 to 60 clock cycles), and goes through a switching network to associate multiple banks with multiple processors. Peak bandwidth is impressive, but only if there are no bank conflicts, hot spots, or references to recently-used banks. In other words, the bandwidth numbers assume an idealized parallelism that all 1024 words a C90 can move in 8 clock cycles are completely independent. With those assumptions, the C90 has a memory bandwidth of 256 GB/sec. In contrast, Intel Paragon memory provides an aggregate bandwidth of 800 GB/sec for a 2048-processor system. As with the Cray, we must assume independence of memory accesses in figuring that aggregate. Now let's look at the priceo of that fast memory.

 

A Pricing "Benchmark" Experiment

Gregory used the dubious assumptions underlying Amdahl's law to show that MPP is "Not a Cheaper Solution." Worlton cites the opinions of others to support anti-MPP arguments, and warns that MPP isn't economic. I thought I'd get some experimental data. The result of dozens of phone calls to salespeople and marketing departments appears in Table 1.

TABLE 1
Price of a Megabyte of Main Memory from Various Vendors
Computer "Street Price" Source List Price Source
Small Systems
Apple Macintosh Iisi $35 April MacUser $124 Local Apple Dealer
386-based PC $38 April Byte $112 Local PC Dealer
Sun SPARCstation 2 $49 J. Hake, Sun $81 SunExpress Catalog
SGI Indigo $66 M. Stewart, SGI $172 M. Stewart, SGI
DECStation 5000 $100 M. Supple, DEC $220 M. Supple, DEC
IBM RS/6000 $140 B. Butler, IBM $200 IBM Price Sheet
HP 750 $175 G. Kalis, HP $250 HP (800) Number
Large (or Scalable) Systems
nCUBE 2 $192 R. Buck, nCUBE $350 R. Buck, nCUBE
Intel Paragon $258 L. Drevitch, Intel $344 L. Drevitch, Intel
MasPar MP-1 $289 D. Obershaw, MasPar $385 D. Obershaw, MasPar
Alliant $400 F. Powers, Alliant
Convex C-3200,3400 $300 B. Baker, Convex $400 B. Baker, Convex
TMI CM-2 $500 G. Rancourt, TMI
Intel iPSC/860 $375 D. Redman, Intel $500 D. Redman, Intel
Convex C-3800 $578 B. Baker, Convex $770 B. Baker, Convex
Kendall Square** Not yet available as a separate option; 32 MB/node only B. Bickford, KSR
CM-5** Not yet available as a separate option; 32 MB/node only M. Maraglia, TMI
NEC (HSNX) SX-3 $2740 P. Crumm, HSNX
Hitachi EX Series $1500 D. Wunch, HDS $4000 D. Wunch, HDS
CRAY Y-MP $4700 D. Mroz, Cray Research $7800 D. Mroz, Cray Research


**While Kendall Square and the CM-5 don't offer memory upgrades yet, even buying an entire processor from them with 32 megabytes of RAM is less expensive than buying just that much memory for a vector computer. For now, I've placed them pessimistically in the table rather than leave them out entirely.

Except for the catalog-type information shown, my procedure was to telephone each vendor and speak to a sales representative or a marketing executive. I asked, "What is the list price of a memory upgrade, and how many megabytes is that upgrade? Please use the most economical official number. Also, what is the lowest price you would be comfortable with seeing in print for education, government or other discounts that might apply to large purchases from special customers?" The latter I call "street price." In talking to actual customers I occasionally found prices lower than the listed street price, but they were usually tied to software development. So, please don't show this table to your local sales rep and insist that you get the same break.

It was interesting to learn how many computer makers regard their lowest memory price as a deep, dark secret. In what other marketplace are low prices cause for embarrassment? A few others were embarrassed at being too high, and at least one vendor on the list hurredly lowered official prices in response to the preparation of this article.

It's hard to escape the conclusion that vector computers have memory prices that are rather, um, high. Except for Convex, those machines use as main memory the parts that other vendors use for cache: costly static RAM. They have to in order to preserve the illusion of a monolithic memory serving a von Neumann bottleneck. Does it get proportionately higher bandwidth? Not judging from the C90 versus nCUBE and Intel comparison made earlier... the C90 has less bandwidth but uses memory that is 40 times as expensive. Does the Cray have better memory latency? Well, that depends. For an arbitrary memory reference, it looks slightly better. But with any pattern to the memory references, the caches and page mode fetches of the nCUBE and Intel win easily over that long vector conveyor belt.

It's also hard to escape the conclusion that memory for vector processors gets a heftier markup. Although I won't supply you with a picture, you can envision your own metaphor for what I call the "vector processor gravy train." To quote Webster's Third International Dictionary:

 

"Gravy train"
A situation providing abnormal or excessive profits, advantages, or benefits to those occupying it, usually at the expense of some larger group.

The current suppliers of vector computers are doing exactly what every industry with obsolete products has done for decades: Cite lack of support for new-fangled ideas, breed fear about the impact of change and the huge potential loss of investment in existing methods, and hope the situation providing abnormal profits holds up. As people switch to MPP, the gravy train that brings a million dollars a megaword grinds to a halt.

 

System Balance: Memory Size to Match Speed

Amdahl recommended a balance of a megaword of RAM per megaflop [4], and while I don't always concur with Gene Amdahl, I have to admit to agreeing on this point. In 1992, a typical workstation might have 8 megawords and a speed of 8 megaflops (8-byte words). If you give users more than about a megaword per megaflop, they cry out for more speed; if you give them less, they bump up against the top of memory.

So what happens when you want to do some production supercomputing, and you try to balance your system? If you buy a C90, you are promised 16 gigaflops of peak performance. Put a proper amount of memory on that system, and you will pay... over $10 billion. Of course, you can't do that. They don't offer 16 gigawords of memory on a Cray. You're expected to rewrite your program to use scratch I/O. It suddenly becomes clear how vector processors get high bandwidth: with too little memory! Any hardware designer can make memory fast if you don't have to have much of it. If an 8 megaflops workstation had the same system balance as a C90, it would only have one or two megabytes of memory. The difficulty of putting a large enough memory on a vector supercomputer goes back to the CRAY-1, which for all its power came with only 2, 4, or 8 megabytes of memory.

Contrast an 8192-processor nCUBE, with a 64-bit speed rating almost exactly the same as that of the C90: 16 gigaflops. Standard memory with that is 4 gigawords or 16 gigawords. With 16 gigawords, the system has near-perfect balance for a price of under $30 million. Is it any wonder that people are "jumping on the bandwagon" of massive parallelism? That type of cost savings will buy a lot of "flat tire" repairs, Jack.

Who will get to a teraflop first? A teraflop should have a teraword, with 64-bit words. To get that much memory for, say, $50 million would imply a cost of about $6 per megabyte-a factor of a thousand down from Cray's current prices. We'll get there, but the vendors that get to a sustained teraflop first will be those that can get honest high performance out of inexpensive RAM. Expect to see "dense matrix-matrix multiply" speed examples from early teraflop claimants with insufficient memory. For real performance on real applications, I'm betting on the distributed memory computer companies that ask me to rewrite my programs.

 

Why the Panic?

There's a note of panic in defenses of old-fashioned vector computing. Why? When Sandia National Laboratories announced two years ago that it would buy no more vector supercomputers, few noticed. When billions of dollars for the federal High Performance Computing and Communication Program were announced with a strong slant toward MPP, things started to look serious. Then Lawrence Livermore National Laboratories cancelled its order for a CRAY 3 from Cray Computer Corporation, leaving that company with no customers, it seemed like two decades of loyal LLNL support for the designs of Mr. Cray had ended. At the recent supercomputing conference in Paris, Cray Research chairman John Rollwagen admitted that they had failed to sell a single vector computer to the U.S. national laboratories in 1991.

Is this a bandwagon? Or is it more like what has happened to the automobile market in the U.S.? Vendors in Detroit sneered at the possibility of imports making a major dent in their sales, convinced that economic inertia would keep their customers coming. As Japanese and German companies took away market share with superior cost-performance, Detroit tried emotional tactics like saying the foreign cars were unsafe or took away jobs or were impossible to repair, instead of trying to change to improve products and meet demands. It's sad to see a similar denial of market forces in large U.S. computer vendors.

Historians tell us the Industrial Revolution was really two revolutions: the discovery of engines that could turn chemical burning into mechanical power, and the discovery of how to build fractional-horsepower motors. Before the notion of distributing smaller motors where needed, fantastic (and expensive) gear trains and pulleys were used to send motive power long distances. There were doubtless pundits who analyzed the situation and pronounced the centralized engine to be inherently cheaper than many smaller engines which might occasionally go idle on the whole spectrum of applications. Perhaps we are not far from the day when analyses such as Gregory's and Worlton's appear every bit as quaint.

 

References

 

[1] P. Gregory, "Will MPP Always Be Specialized?" Supercomputing Review, March 1992.
[2] J. Worlton, "The Massively Parallel Bandwagon," Supercomputing Review, July 1992.
[3] J. Gustafson, G. Montry, and R. Benner, " Development of Parallel Methods for a 1024-Processor Hypercube"
SIAM J. of Sci. and Stat. Computing, Vol. 9, No. 4, July, 1988.
[4] J. Hennessey and D. Patterson, Computer Architecture: A Quantitative Approach,
Morgan Kaufmann Publishers, Inc., San Mateo, California,1990.

*This work is supported by the Applied Mathematical Sciences Program of the Ames Laboratory-U.S. Department of Energy under contract number W-7405-ENG-82.


[RETURN] Return to "Publications by John Gustafson"

Contact: John Gustafson john.gustafson@sun.com