Slalom Update: The Race Continues

By John Gustafson, Diane Rover, Stephen Elbert, and Michael Carter

Last November, we introduced in these pages a new kind of computer benchmark: a complete scientific problem that scales to the amount of computing power available and always runs in the same amount of time--one minute. SLALOM assigns no penalty for novelty in language or architecture, and runs on computers as different as an Alliant, a MasPar, an NCUBE and a Toshiba notebook PC.

Since that time, there have been several developments:

The number of computer systems in the list has more than doubled.
The algorithms have improved.
An annual award for SLALOM performance has been announced.
SLALOM is the judge for at least one competitive supercomputer procurement.
The massively parallel contenders are starting to unseat the low-end Cray computers.
All but a few major scientific computer manufacturers are represented in our report.
Many of the original numbers have improved significantly.

Most-Wanted List

We're still waiting to hear results for a few major players in supercomputing: Thinking Machines, Convex, MEIKO and Stardent haven't sent anything to us, nor have any of their customers. We'd also very much like numbers for the WaveTracer and Active Memory Technology computers. Our Single-Instruction, Multiple Data (SIMD) version has been improved since the last Supercomputing Review article, so the groups working on those machines might want to check it out as a better starting point (see inset). The only IBM mainframe measurements are nonparallel and nonvector, so we expect big improvements to its performance.

We're awaiting word from the Japanese computer makers: NEC, Hitachi Data Systems and Fujitsu. We also welcome entries from small systems. Anything over 137 FLOPS should be able to finish a small run in less than 60 seconds.

The Superlinear Speedup Effect

A marvelous thing happens with fixed-time benchmarking, not mentioned in our previous report. Many parallel machines more than double their MFLOPS rate when they double the number of processors used. The reason for this is illustrated by the figure below:

Profile vs. Problem Size vs. MFLOPS Rate
(Courtesy Gretchen Vogel, Ames Lab)

Most scientific computers achieve their highest MFLOPS rate doing matrix operations, and lower MFLOPS rate doing setup and miscellaneous tasks. Input/output might not score any MFLOPS at all! As the problem size grows, the matrix operations become a larger and larger fraction of the one-minute run, so the average speed per processor increases. One reason the MFLOPS entries in the table increase so nicely when the number of processors increases is because this effect compensates for some of the usual losses of parallel efficiency.

This is one reason one shouldn't use MFLOPS (or MIPS) to measure performance, of course. Right now we provide MFLOPS in the table to help people translate the performance to something for which they have a feel, but ultimately the only number that matters is the size of the problem solved, here measured by the number of patches.

Preserving Integrity

To answer one question we've been asked repeatedly, we will not accept money in return for optimizing SLALOM performance. But we will provide, for free, unbiased advice and suggestions to anyone trying to get the best possible performance out of their system.

We also do not allow use of these results in advertising without our prior consent. Too often, benchmark data has been "excerpted" in lists that conveniently eliminate certain competing machines or crucial footnotes that tell the full story. If you suspect that SLALOM information is being misused, please contact us immediately. SLALOM is designed to resist "benchmark rot," and we'll fight to preserve its integrity.

Improving the SLALOM Algorithm

A bit of folklore in this business goes something like this: "Half the improvements in computing speed come from hardware advances, and the other half come from algorithm advances." That is, if an application runs 10^8 times faster than it did 35 years ago, it's probably the result of 10^4 times faster hardware and 10^4 times faster algorithms. Starting in 1990, SLALOM began what may be a decades-long experiment to test this aphorism.

At Cray Research, Inc. and elsewhere, a technique known as Strassen multiplication was used in the matrix solver. The idea is that 2 by 2 matrix multiplication can be done with 7 multiplications rather than the usual 23 = 8, and this can be applied recursively to n by n matrices to allow general matrix multiplication in order n^log27 (about n^2.8) floating-point operations. More recent methods take the exponent down as low as 2.4, but are only advantageous for very large n. So we now have "blocked" versions of SLALOM that use matrix-matrix multiplication as the kernel (similar to LAPACK), and you can adjust the size of the matrix and the method used to whatever works best. That was the first major improvement to the SLALOM algorithm.

James B. Shearer, of the IBM T.J. Watson Research Center, has provided a number of ideas for SLALOM, especially in the "SetUp" routines. He discovered a way to reduce the number of calls to the logarithm function, and found ways to re-use some quantities without loss of generality. For very large numbers of patches, he points out, the rounding error in our coupling function will make mathematically approximate formulas computationally more accurate. Fortunately, that will offer only a slight improvement because the setup is only a small part of the one-minute run for such large numbers of patches.

Shearer also suggests that iterative methods for solving the matrix will beat our direct method, at least for the suggested input data. He may be right, but we reserve the right to evaluate systems based on their worst-case-input behavior. If all the walls in the radiosity problem are highly reflective, iterative methods converge very slowly. But we invite the experiment.

These algorithm improvements are another reason not to use MFLOPS to compare performance. These improvements might allow larger problems without higher MFLOPS, and post-mortem operation counts are very difficult with these clever optimizations. For now, we will continue to estimate the minimal operations for the best serial version of the program, which will be accurate to a few percent in most cases.

SLALOM will absorb these changes and others that arise. An upper bound on advances occurs if finer problem decomposition yields no improvement in answer accuracy, for some problem size that can be solved in less than a minute on some machine. This would mean the end of the rate for this first SLALOM experiment, and we would seek different scalable problems on which to base a fixed-time performance comparison. But for now, the SLALOM benchmark looks like it may last for a while.

The SLALOM Benchmark Report
The following list ranks computers that are actively marketed
Computer, environment	Processors	Patches	Measurer	Date
Cray Y-MP8, 167 MHz Fortran+tuned LAPACK solver (Strassen)	8	5120	J. Brooks (v) Cray Research	9/21/90
Cray Y/MP-4, 167 MHz Fortran+tuned LAPACK solver (Strassen)	4	4096	J. Brooks (v) Cray Research	9/21/90
nCUBE 2, 20 MHz Fortran+assembler	1024	3720	J. Gustafson Ames Lab	1/11/91
Cray Y/MP-2, 167 MHz Fortran+tuned LAPACK solver (Strassen)	2	3200	J. Brooks (v) Cray Research	9/21/90
Cray Y/MP-1, 167 MHz Fortran+tuned LAPACK solver (Strassen)	1	2560	J. Brooks (v) Cray Research	9/21/90
nCUBE 2, 20 MHz Fortran+assembler	256	2493	J. Gustafson Ames Lab	1/11/91
Cray-2S/8, 244 MHz Fortran+directives FPP 3.00Z25	8	2443	S. Elbert Ames Lab	9/8/90
Intel iPSC/860, 40 MHz, Fortran+assembler BLAS (pgf77 -03 NOIEEE)	64	2167	T. Dunigan ORNL	1/7/91
MasPar MP-1, 12.5 MHz C with plural variables (mpl)	16384	2047	J. Brown (v) MasPar	11/20/90
iPSC/860, 40 MHz Fortran (-O3 -Knoieee)	32	1920	E. Kushner (v) Intel	1/25/91
MasPar MP-1, 12.5 MHz, C with plural variables (mpl)	8192	1791	M. Carter Ames Lab	1/15/91
Alliant FX/2800 Fortran (-Ogc, KAI Lib's)	14	1736	J. Perry (v) Alliant	1/24/91
iPSC/860, 40 MHz Fortran (-O3 -Knoieee)	16	1671	E. Kushner (v) Intel	1/25/91
nCUBE 2, 20 MHz Fortran+assembler	64	1623	J. Gustafson Ames Lab	4/8/91
nCUBE 2, 20 MHz Fortran+assembler	64	1598	J. Gustafson Ames Lab	1/11/91
Alliant FX/2800 8 Fortran (-Ogc, KAI Lib's)	8	1502	J. Perry (v) Alliant	1/24/91
MasPar MP-1, 12.5 MHz C with plural variables (mpl)	4096	1470	M. Carter Ames Lab	1/14/91
Intel iPSC/860, 40 MHz Fortran+assembler BLAS (pgf77 -O3 NOIEEE)	16	1404	T. Dunigan ORNL	1/7/91
iPSC/860, 40 MHz Fortran (-O3 -Knoieee)	8	1392	E. Kushner (v) Intel	1/25/91
Silicon Graphics 4D/380S 33 MHz, Fortran+block Solver (-O2 -mp)	8	1308	O. Schreiber (v) Silicon Graphics	1/28/91
IBM RS/6000 540, 30 MHz Fortran+ESSL calls XLF V2 prerelease, -O	1	1304	J. Shearer (v) IBM	1/8/91
FPS M511EA, 33 MHz Fortran+LAPACK calls f77 -Oc vec+ -Oc inl+	1	1197	B.Whitney (v) FPS Computing	1/24/91
Alliant FX/2800 Fortran (Ogc DAS -KAI Lib's)	4	1139	J. Chmura (v) Alliant	12/7/90
MasPar MP-1, 12.5 MHz, C with plural variables (mpl)	2048	1119	M. Carter Ames Lab	1/15/91
iPSC/860, 40 MHz Fortran (-O3 -Knoieee)	4	1103	E. Kushner (v) Intel	1/25/91
IBM RS/6000 520, 20 MHz Fortran+ESSL calls, XLF V2 , prerelease -O	1	1091	J. Shearer (v) IBM	1/9/91
Silicon Graphics 4D/380S 33 MHz, Fortran+block Solver (-O2 -mp)	4	1065	O. Schreiber (v) Silicon Graphics	1/28/91
nCUBE 2, 20 MHz Fortran+assembler	16	994	J. Gustafson Ames Lab	1/11/91
MasPar MP-1, 12.5 MHz C with plural variables (mpl)	1024	927	J. Brown (v) MasPar	10/5/90
Intel iPSC/860, 40 MHz Fortran+assembler BLAS (pgf77 -O3 NOIEEE)	4	905	T. Dunigan ORNL	1/7/91
IBM RS/6000 320, 20 MHz Fortran+block Solver (-O -lblas, some -qopt=3)	1	895	S. Elbert Ames Lab	1/30/91
Silicon Graphics 4D/380S 33 MHz, Fortran+block Solver (-O2 -mp)	2	834	S. Elbert Ames Lab	1/30/91
SKYbolt, 40 MHz i860/i960 C+assembler dot product (-O sched vec UNROLL)	1	831	C. Boozer (v) SKY Computers	1/9/91
SKYstation, 40 MHz i860/i960 C (-O sched vec UNROLL)	1	793	C. Boozer (v) SKY Computers	1/29/91
Silicon Graphics 4D/35 37 MHz, Fortran, (-O2 -mp)	1	717	O. Schreiber (v) Silicon Graphics	1/29/91
Alliant FX/2800 Fortran (Ogu -KAI Lib's)	1	693	J. Chmura (v) Alliant	12/7/90
Silicon Graphics 4D/380S 33 MHz, Fortran+block Solver (-O2)	1	676	S. Elbert Ames Lab	1/30/91
IBM 3090-200VF Fortran (Fortvs2n) Unvectorized	1	657	R. Hollebeek U. Penn	11/29/90
iPSC/860, 40 MHz Fortran (-O3 -Knoieee)	1	647	E. Kushner (v) Intel	1/25/91
FPS-500 (33 MHz MIPS + vec. unit) Fortran (FPS F77 4.3,-Oc vec)	1	619	P. Hinker LANL	11/12/90
nCUBE 2, 20 MHz Fortran+assembler	4	596	J. Gustafson Ames Lab	1/11/91
DECStation 5000, 25 MHz Fortran+block Solver (-O2)	1	534	S. Elbert Ames Lab	1/30/91
Silicon Graphics 4D/25, 20 MHz Fortran+block Solver (f77 -O2)	1	507	S. Elbert Ames Lab	1/30/91
DECStation 5000 25 MHz, Pascal (-O2)	1	432	D. Rover Ames Lab	1/31/91
SUN 4/370, 25 MHz C (ucc -O4 -dalign etc.)	1	419	M. Carter Ames Lab	10/8/90
DECStation 3100, 16.7 MHz Fortran+block Solver (-O2)	1	418	S. Elbert Ames Lab	1/30/91
Sil. Graphics 4D/20, 12.5 MHz Fortran+block Solver (f77 -O2)	1	401	S. Elbert Ames Lab	1/30/91
DECStation 2100, 12.5 MHz Fortran+block Solver (-O2)	1	377	S. Elbert Ames Lab	1/30/91
nCUBE 2, 20 MHz Fortran + assembler subroutines (-O2)	1	354	J. Gustafson Ames Lab	8/13/90
DECStation 2100, 12.5 MHz, C (cc -O3)	1	340	M. Carter Ames Lab	1/24/91
Motorola MVME181, 20 MHz Fortran, (OASYS F77 1.8.5)	1	289	R. Blech NASA	10/17/90
Sequent Symmetry 33 MHz C (cc -O -fpa)	1	253	M. Carter Ames Lab	1/3/91
VAXStation 3520 C (cc -O)	1	181	M. Carter Ames Lab	1/24/91
Cogent XTM (T800 Transputer) Fortran 77 (-O -u)	1	149	C. Vollum (v) Cogent Research	6/11/90
Toshiba 1000, 6 MHz 8088 C (Turbo C, with reg/jump option)	1	12	P. Hinker LANL	11/14/90

A "(v)" after the name of the person who made the measurement indicates a vendor. Vendors frequently have access to compilers, libraries, and other tools that make their performance higher than that achievable by a customer.

The CRAY Y-MP runs failed the old SetUp3 tolerance of 5.E-10, but passed with a tolerance of 5.E-8; special LOG and ATAN functions were used with only 11-decimal precision for higher speed. We will enforce the current tolerance uniformly in our next report.

The CRAY 2 figures are low because blocking methods were not used; future runs will use matrix-matrix multiply as the kernel, as was done for the Y-MP.

Contact: John Gustafson john.gustafson@sun.com