Many of us in the field of high-speed scientific computing recognize that it is often quite difficult to defend our continued use of traditional vector supercomputers in light of the new highly parallel systems available. This article outlines twelve ways commonly used by the "old guard" to artificially boost apparent serial performance and present these in the "best possible light" compared to these annoying new contenders.
Many of us in the field of high-speed scientific computing recognize that it is often quite difficult to match the run time performance of highly parallel computer systems. It is increasingly a challenge to defend our use of enormous, expensive, archaic computers when these newer systems often run circles around them. But since lay persons and funding agencies do not appreciate the intense pain associated with changing one's thinking and forsaking years of accumulated vector programming tricks, it is often necessary to adopt some tried-and-tested techniques that can be used to convince your audience that "parallel processing is the wave of the future, and always will be." Here are some of the most effective methods, as observed from recent reports from our most generously-funded federal laboratories and their computer vendors.
1. Always use 64-bit arithmetic, even when completely unnecessary.
We all know that it is hard to obtain impressive performance results using 64-bit floating point without a full 64-bit vector architecture, so make sure you use 64-bit arithmetic everywhere. . . and that the parallel computer has to also. Use 64-bit graphic pixels, 64-bit input data from A/D devices, and 64-bit variables even when the numerical method is only good to four decimals. When you need a logical "TRUE" or "FALSE," use 64-bit arithmetic. Having 64 bits all "on" or all "off" will help ensure that you get that Boolean right. That way, when you run the same program on a collection of measly microprocessors, you can rest assured that they are wasting just as much time and storage as you are. After all, you have no choice but to use 64-bit words for everything, so why should anyone else?
2. Present performance figures based on the computer's own timer, and then represent these figures as the performance seen by the user.
It is quite difficult to obtain high performance on a complete large-scale scientific application, if the timing includes Internet connections to mainframe intermediaries, Byzantine data center management, security measures, and data files everywhere but where you need them. Do not count that time. Most importantly, do not count the 8 hours waiting in line for the vector CPU, when the CPU part of the run only takes 5 seconds. Put a call to the system timer near the front of your program, but after the files have been read in. Put another one near the end, but before you close files (since closing them might expose the fakery of buffers held in memory as "output"). When you time the parallel machine, you can include the time to load the processors and express horror at how long it takes to get in and out of these big parallel things.
3. Quietly write your program as a series of compiler-helping loops.
After many years, the beloved vector computers have learned to recognize simple loops and to drop in highly-tuned assembly language for them. So do not hesitate to decimate your program into easily-recognized vector algebra constructs. That way, the microprocessors in the parallel systems will go through loop overhead for every little multiply or add. It is also a good bet that no one will notice if your problem sizes are always multiples of your vector register length. No reason to do a loop of length 65, when 128 will do even better. The parallel processors are often oblivious to such hazards, so it is best not to mention all the tricks you use to help the vector compiler out. Remember, it is the parallel computer that is hard to program efficiently, not your system with massively-interleaved multibank pipelined memory. It might alarm the audience to learn that you used two pages of Fortran just to write the loop for X(I) = A * X(I) + Y(I) so as to obtain respectable performance.
4. Use a fixed problem size on the parallel system, no matter how many processors are used.
Graphs of performance rates versus the number of processors have a wonderful habit of trailing off for a fixed problem size. Even when you go to hundreds of processors, solve the exact same problem that you do on a single processor of the ensemble, no matter how absurd that seems. That way, you can show that the communication costs and serial bottlenecks are just overwhelming. Shake your head sadly and mention Amdahl's law. Besides, if you make the mistake of scaling the problem to match the power of the parallel system, you will be solving problems that will not fit on your vector supercomputer. That can make your vector supercomputer look bad, so always use your vector computer to define what a "reasonable" problem size is.
5. Quote performance results for much less expensive parallel systems.
One can always justify purchase of a $25 million vector computer, along with a huge staff of experts and all the mass storage and networking one needs, but spending more than about $2 million in real money on a parallel computer is frightening and rare. Fortunately, such large parallel systems usually only have a few times the speed and storage of each CPU of the vector computer. Explain that there are eight CPU's in your vector computer, allowing independent jobs at eight times the throughput. If anyone suggests similarly scaling the parallel computer by a factor of eight, remind them that performance does not scale linearly. It would be prudent to quickly throw in a humorous anecdote, like the one where Seymour Cray says "parity is for farmers" so that your audience doesn't notice that you have contradicted yourself.
6. Compare your results against heavily optimized code on Crays.
When you measure performance on the Cray, pull out all the stops. Unroll the loops until your program listing looks like a Persian rug. Carefully adjust every memory stride to be a power of two, plus one. Add vectorization directives, multitasking directives, microtasking directives, and autotasking directives. Since these things all hide in "comment lines," you can wink and say everything is pure Fortran, no extensions. Cray-specific library calls and compiler switches that "in-line" all the subroutines are good, too. Turn every IF statement into a conditional vector merge that will have those wimp parallel programmers racing for the manual set. For the parallel computer, declare message-passing calls and parallel variables to be "nonstandard Fortran." If anyone dares to use an optimized library routine for the parallel system, protest that they're using assembly language. You have spent a major part of your lifetime learning all the tricks of vector supercomputing, and there is simply no reason to let that knowledge become obsolete now.
In benchmarking, use the optimized version on the vector computer but publicly issue a very inefficient method as the definition of the program. For example, you can privately do random number generation with three 46-bit integer instructions, but publish the random number generator as a time-waster with 19 floating point operations. For finite difference operations, keep all the partial sums in memory in the vector version. In the official version of the benchmark, you can recompute every sum and double the work "to save memory."
7. When direct problem size comparisons are required, switch to execution time comparisons instead.
Direct problem size comparisons can be quite embarrassing, especially if your vector supercomputer has significantly less memory than the parallel computers. If someone wants to know how big a 3-D FFT you can compute in one minute, instead cite the number of microseconds for a 1024-point 1-D FFT. Then ask how long it takes the massively-parallel system to do a 1024-point 1-D FFT. You can use 100 by 100 LINPACK the same way. Make sure CPU time reduction is the goal, not the solution of larger problems. Otherwise, trusty old Amdahl's law does not work very well.
8. Quote MFLOPS rates, and base these on the Hardware Performance Monitor.
We know MFLOPS are hard to define, so use this to your advantage. For the parallel system, carefully assess the operations by using an analytical count. For the Cray, use the Hardware Performance Monitor instead. This will let you count absolute values, negations, comparisons, additions of zero, and multiplications by one as 64-bit floating point operations. It also lets you do things like count a square root as 12 operations, or a logarithm as 25. These will greatly increase the apparent MFLOPS and make your vector code look like a real winner. One of the most irritating things that can happen is when a parallel computer manages to do a job using (gasp) integer operations. Such runs can be immediately dismissed out of hand as not being Real Supercomputing. Only Computer Scientists use integers. If you must compare integer performance, at least use 46-bit integers, or whatever just barely fits your hardware but not theirs.
9. Make MFLOPS rather than the application the ultimate goal, and ignore cost-performance.
As mentioned above, the problem size and applications possible using parallel systems are often not favorable to conventional supercomputers. After all, your run on the $25 million vector computer might have been hampered by someone using it to "vi" the Message Of The Day. Thus whenever possible, use other performance measures. The best is MFLOPS, since lay people think of it like Miles Per Hour, where more is better. Thus, a race car that transports pails of sand at 200 MPH is clearly better than a dump truck that moves a ton at 40 MPH and only takes one trip. Also, watch out for measures like MFLOPS per dollar. Although it is hard to believe, there are those who feel there should be a limit to the amount of money spent for computer systems. They simply do not understand the importance of what you do. They think you should take the time to rewrite your program for a paltry savings of tenfold or a hundredfold. As scientists, you can hardly sully yourselves with these "bean-counting" exercises. You should simply be supplied all the tools you need to do your work. If the $25 million vector supercomputer is too busy, buy another one. And another. And another.
10. Require the parallel system to run the serial algorithm, without changes.
It's better to specify an algorithm than a physical problem to be solved. That way, you can impose all kinds of communication costs and data dependencies that cripple the parallel systems. For instance, forbid iterative solvers and require direct, LU factorization with partial pivoting. . . even when the iterative solver runs faster on both types of computer. The LU factorization might be 1,000 times slower than an iterative method on your vector supercomputer, but do not worry. On the parallel computer, it will be 10,000 times slower. Be sure and claim the high moral ground here, noting that computers must run all algorithms at high MFLOPS rates to be considered "general purpose."
11. Measure the vector supercomputers in a standalone environment, or just use CPU accounting charges.
As we all know, the U.S. Government makes sure no one actually pays for Cray usage. That would inhibit research. It all comes from taxpayers, and as a result every Cray in the world is jam-packed with applications running for "free." Even the program conversion is free, since someone else did it. Since parallel systems usually require you to do the conversion, the expense is outrageous. Unfortunately, parallel computers often have the advantage of being run in a dedicated environment while yours compete with every yahoo with a network connection and a federal grant. So measure the vector supercomputer in a standalone environment, or just use the CPU time reported by the accounting system. It is essential that you make the congestion of your traditional supercomputer a positive point, not a negative one. Do not admit that the turnaround time on your supercomputer is no different from that of a workstation. Instead say, "Because the supercomputer is general purpose, it has a community of hundreds of users." If anyone in the audience asks how many users would pay the dollar a second out of their own pocket to use the vector supercomputer, change the subject.
12. If all else fails, tell Paul Bunyan stories about Cray.
It sometimes happens that the audience starts to ask all sorts of embarrassing questions. These people simply have no respect for the authorities of our field ... authorities who still remember how to backspace a card punch or toggle in an operating system from front panel switches. If you are so unfortunate as to be the object of such disrespect, there is always a way out. Simply conclude your technical presentation with some great stories about Seymour and his computers. Talk about the modest, unassuming gentle recluse that burns his sailboat every year. Show pictures of garish-colored loveseats or bubbling coolant. Audiences love the idea of a five-ton refrigerated wire nest that uses as much electrical power as a hundred American homes, especially if your audience consists mostly of physicists. Remind them that Seymour Cray is the only hope we have of holding the line against the Japanese threat of technological superiority. This material often helps deflect attention from the substantive technical issues, and assures continued generous state and federal funding.
The author wishes to thank D. Bailey of "the other Ames" for unintentionally inspiring this tirade.