Compute-Intensive Processors

**Chapter 5
**

John L. Gustafson

Sandia National Laboratories

**Contents
**5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

5.2 Applications of Array Processors . . . . . . . . . . . . . . . . . . . . . 2

5.2.1 Seismic Data Processing. . . . . . . . . . . . . . . . . . . . . . 2

5.2.2 Medical Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

5.3 Applications of Attached Scientific Processors . . . . . . . . . . . . 4

5.3.1 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.3.2 Analog Circuit Simulation . . . . . . . . . . . . . . . . . . . . . 7

5.3.3 Computational Chemistry . . . . . . . . . . . . . . . . . . . . . 8

5.3.4 Electromagnetic Modeling . . . . . . . . . . . . . . . . . . . . . 10

5.3.5 Seismic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.3.6 AI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.4 Design Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.5 Architectural Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.5.1 Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.5.2 Processor Control . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.5.3 Parallelism and Fortran . . . . . . . . . . . . . . . . . . . . . . . 22

5.5.4 Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.5.5 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5.6 Vector/Scalar Balance . . . . . . . . . . . . . . . . . . . . . . . . 25

5.5.7 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5.8 Architecture Summary . . . . . . . . . . . . . . . . . . . . . . . 30

5.6 Appropriate Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.6.1 Dense Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . 31

5.6.2 An Extreme Example: The Matrix Algebra Accelerator . . 33

5.6.3 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.6.4 Finite Difference Methods . . . . . . . . . . . . . . . . . . . . . 36

5.7 Metrics of Compute-Intensive Performance . . . . . . . . . . . . . . 38

5.8 Addendum: Hypercube Computers . . . . . . . . . . . . . . . . . . . . 40

5.9 Concluding Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

The algorithms appropriate for such processors require special attention, since communication cost is high relative to computation cost, the opposite of most general-purpose computers. The sections that follow discuss the primary applications, design philosophy, architectural techniques, and appropriate algorithms of compute-intensive processors. There is also some discussion of the performance metrics associated with such processors. As an addendum, we apply the issues of compute/communicate ratios to hypercube-based multicomputers.

A large part of one stage of the problem involves two-dimensional FFT’s where one dimension is time and the other is the spatial dimension along which the detectors lie. Since it is representative, consider the problem from a system perspective: Suppose there are 128 detectors and 1024 discrete time measurements in a given time window. If the detectors produce 16-bit data, then the input data size is 256 KBytes for that time window, typically stored on magnetic tape. The operation of forward FFT in the time dimension, then in the space dimension, applying a filtering operator, then doing inverse FFT’s in both dimensions, requires approximately 23 million floating-point operations. This represents 44 operations for every word of input data, a rather high ratio of computation to

The difference between using a compute-intensive processor and a conventional processor can mean the difference between waiting ten seconds and waiting half an hour for the brain scan of a person who has experienced a severe blow to the head. CT scanners soared in popularity once the time for computation was economically brought close to the time required for the scan itself, using compute-intensive processors.

At the time of writing, the seismic and medical imaging applications are perhaps the most important applications of single-precision compute-intensive processors. Both applications are dominated by the Floating Point Systems AP-120B and its architectural descendants, and by the Star Technologies ST-100.

Kenneth Wilson, a physicist at Cornell University, saw the AP-120B architecture as nearly ideal for his work on the Renormalization Group except for the lack of a Fortran compiler to ease the burden of explicitly programming each functional unit. A group at Cornell, led by D. Bergmark, produced a Fortran 66 subset compiler for the AP-120B that proved that high-level compilers for “horizontally microcoded” computers were both possible and practical.

At about the same time, R. C. Young, a mechanical engineer at General Atomic, saw the AP-120B architecture as potentially useful for the

Both Kenneth Wilson’s and R. C. Young’s efforts influenced the next generation of array processors, typified by the FPS-164. The FPS-164 had full 64-bit precision words, more registers, a somewhat more complete crossbar between functional units to assist Fortran compilation, and greatly expanded memories throughout the system. Since it ran entire Fortran programs instead of subroutines, its environment resembled that of the CRAY-1 more than that of the AP-120B, and the term “attached processor” was used to denote this distinction from “array processors.” The applications that follow are appropriate for *attached 64-bit processors*.

**5.3.1 Structural Analysis
**Structural analysis is used to predict the behavior of objects ranging from tiny machine parts to bridges and large buildings. Its numerical methods are mature, and are not confined to stress analysis but can also calculate heat flow or electromagnetic fields to very high accuracy. It relies on the “Finite Element Method,” whereby the structure is broken into geometrically simple domains called “elements” for which an exact analytical solution exists. The global solution is found by mating element boundaries together in a consistent way, which amounts to the construction and solution of a large system of equations called the “global stiffness matrix.” For sufficiently large models, factoring the global stiffness matrix dominates run time.

With the exception of the matrix setup and factoring, structural analysis is very much a problem for

To illustrate the magnitude of the matrix solution step, suppose one wishes to model a cubical elastic body divided into cubical elements, 100 on a side. Each cube corner has three translational and three rotational degrees of freedom, so there are roughly 6 x 10

SPICE does its own memory management from within Fortran, dimensioning all of available memory as a single array, and using integers as pointers to allocate and de-allocate regions of memory at run time. The dynamic sizing of arrays that this permits is essential for circuit simulation, since it is difficult to estimate memory requirements before the run begins. This memory management from Fortran severely burdens the memory bus of a computer, and speed at SPICE is all too frequently a function of the MBytes/second rating of the memory and the

The elements of the circuit have current-voltage properties that require substantial calculation; a transistor element evaluation might involve several pages of source code containing exponentials, tests, branches, polynomial evaluations, and square roots. Machines with data caches such as the VAX perform relatively well on SPICE, since modeling each element means bringing in a small amount of memory and using just that memory intensively. The original version of SPICE makes no attempt to group devices by type so as to permit vectorized evaluation; vectorized versions for compute-intensive processors such as CRAY and the FPS-264 group the devices, use conditional merges to replace branching, and work very well on large-scale circuits where there is a great deal of repetition of device type.

Analog circuit simulation spends most of its time alternating between the setup of a large, very sparse asymmetrical matrix, and solving that matrix. An element of the matrix

On a general-purpose processor, the division of time between matrix factoring and matrix setup (device property evaluation) is roughly half and half. On a compute-intensive processor, the division is more like 20% for the factoring and 80% for the setup! It is worth noting that a vectorized version of SPICE for the CRAY-1 only achieves about 100 times the speed of the VAX-11/780, despite the much higher ratio (about 1000:1) in peak 64-bit arithmetic speed. Analog circuit simulation, even when substantially reorganized from the Berkeley version, is memory-intensive and best handled by compute-intensive processors that have a very short path to main memory.

The process is well-suited to compute-intensive processors, with one exception to the paradigm; the primary feature required other than 64-bit floating point operations is the ability to move data quickly to and from secondary storage, with large secondary storage. Some compute-intensive processors can

The processing of the integrals usually involves repeated matrix multiplications, the canonical operation of compute-intensive processors. If the architecture can efficiently exploit sparsity, then there are great performance benefits, because a typical matrix in computational chemistry only has 10% nonzero elements. A 10% sparse-10% sparse matrix multiply requires, on the average, only 1% of the multiply-adds of a dense-dense matrix multiply, so the potential speedup is on the order of a hundredfold. One approach, known as “Direct Configuration Interaction” (CI), consists almost entirely of matrix multiplication, which was a motivation for the FPS-164/MAX special-purpose architecture (see Section 5.6.2).

Even the most highly-specialized compute-intensive processor can be used efficiently on the radar cross-section application. The vectors are long, hundreds of operations are performed for every communication of data to or from disk, and the inner loop of the factoring is pure multiply-add arithmetic: a complex dot product. Setting up the matrix is a more scalar-intensive operation, but is usually less than 1% of the total work. The problem

In the “acoustic model,” there is only one degree of freedom at each discrete space point, representing pressure. A far more accurate method recognizes the earth as an elastic medium, with independent stresses and torques in every dimension. The elastic model uses principles from Finite Element Methods, but avoids factoring a global stiffness matrix at every timestep; each degree of freedom can be computed

As an example,

Another example is satellite image processing. A large amount of arithmetic computation goes into refining data from satellites prior to being able to use AI methods to distinguish structures (

Finally, an example of a compute-intensive AI application is that of the “Neocognitron” neural network model for visual pattern recognition developed by Nippon Telephone and Telegraph researchers. It refines handwritten Arabic numerals described on a 19 by 19 array of pixels through twelve stages to a final 1 by 10 array representing digits 0 to 9. By “teaching” the network using a small number of distorted examples, the Neocognitron can accurately recognized a wide range of written numerals. Each “neuron” performs essentially a floating point inner product followed by a binary decision, with a very high ratio of computation to communication. Since neurons are known to integrate information that is digital in amplitude but analog (best represented by floating point numbers) in time, it seems likely that compute-intensive processors will play a role in AI applications based on neural networks.

Table 5.1 illustrates this fact using a list of the most-used instructions on the venerable IBM/370 family [IBM 86]:

**Table 5.1 IBM/370 Instruction Frequencies
**

**% of Total
^{Instruction Instructions
}**Branch on Condition 20.2

Load from Memory 15.5

Test under Mask 6.1

Store to Memory 5.9

Load Register 4.7

Load Address 4.0

Test Register 3.8

Branch on Register 2.9

Move Characters 2.1

Load Halfword 1.8

This “top ten” list accounts for 66% of the instruction usage on the IBM/370, but contains not a single mathematical operator. A similar study has been done for Digital Equipment’s VAX-11/780:

**Table 5.2 VAX-11/780 Instruction Frequencies
**

**% of Total
^{Instruction Type Instructions
}**Simple (Moves, etc.) 83.6

Bit Manipulation 6.92

Floating Point Math 3.62

Call/Return 3.22

System Management 2.11

Character Manipulation 0.43

Decimal Instructions 0.03

Scientific and engineering applications are such a sharp contrast to this distribution of instruction usage that they demand an entirely different approach. These applications are typically heavily laden with arithmetic operations. When run on a general-purpose machine without floating-point hardware, most of the time will be spent in the software or firmware routines that combine basic integer operations to accomplish a floating point operation. Time spent fetching data becomes negligible by comparison, leading to algorithms that minimize total work simply by minimizing the number of floating point operations.

Extensive hardware is necessary to reduce floating point calculation time to a few processor clock cycles. Such hardware can roughly double the number of gates in a processor! Since doubling gate count might nearly double hardware cost, manufacturers of general-purpose computers prefer to leave out such hardware since most of the their customers would never use it, and would find such machines less cost-efficient for business tasks.

This dilemma has brought about the separation of computers into two classes: those optimized for business processing, typified by the mainframes built by International Business Machines, Unisys, and Digital Equipment, and those optimized for scientific processing, typified by mainframes built by Control Data and Cray Research. An alternative approach is a compute- intensive processor that attaches to a general-purpose mainframe to provide floating point speed via an optional peripheral device. These attached processors, typified by those made by Floating Point Systems and Star Technologies, are able to leave out many hardware functions that are taken care of by the host processor, recovering cost-efficiency that would otherwise be slightly worse than a solution based on a single scientific processor. Figure 5.2 illustrates the tradeoff between generality and compute-intensive efficiency:

The CRAY-1 is very nearly in the attached processor class, requiring a host processor to manage it but differing from attached processors in that it runs its own compiler and linker. At installations where CRAY usage is heavily subsidized, complete time-sharing and even text editing are done directly on the compute-intensive processor despite the very poor cost-efficiency of that approach.

A glance at the instruction set for a general-purpose computer reveals operations for converting binary number representations to character representations, rotating and shifting the bits of words, and prioritized interrupt handling to assist in sharing machine resources among multiple users. Compute-intensive processors can have greatly simplified designs through the elimination of such instructions. Where necessary, those functions can be built somewhat inelegantly from other operations, just as some general-purpose processors build floating-point operations. A compute-intensive processor will typically have instructions such as floating-point multiply, add, subtract, reciprocal (not necessarily a divide!), square root, and conversions between floating-point and integer representations.

Besides the presence or absence of instructions, compute-intensive processors have very unusual ratios of communication speed to computation speed. On general-purpose machines, it typically takes one or two orders of magnitude longer to perform a floating-point operation than to fetch or store the operands/result; on compute-intensive processors, the ratio is closer to unity. We will later discuss an example in which 248 floating-point operations occur (usefully) in the time required for a single memory reference! In the case of a polynomial evaluation, computing the norm of a vector, or multiplying a matrix times a vector, the ratio of speed for

Memory Reference : Floating Point Add : Floating Point Multiply

can in fact be 1:1:1. Most compute-intensive processors approximate this ratio. Examples include the FPS-264, the CRAY-1 (but not the CRAY X-MP), the Convex C-1, and the Alliant FX series. The CRAY X-MP actually has multiple ports to main memory, making its ratio 3:1:1 and tending toward more general-purpose computing.

**5.5 Architectural Techniques
**Although the last 50 years of computer technology show steady and dramatic improvements in cost, size, speed, and packaging, history repeats itself with respect to computer

• How many different functional units should there be?

• How can they be controlled by an instruction stream?

• How much parallelism can be used by Fortran-type environments?

• How much main memory is needed to keep the processor busy?

• Can registers or cache successfully substitute for main memory?

• What is the optimal ratio of vector speed to scalar speed (pipeline length)?

• What is the ideal precision (32-bit, 64-bit, etc.)?

The following sections offer some arguments that have been used in successful designs, illustrating tradeoffs that must be made.

A(I,J) = 0.5*(SQRT(FLOAT(I/2)))*B(J)+3.0/C(I,J)

could be evaluated efficiently by having a processor that contains a machine for the integer divide , a machine for the FLOAT, a machine for the SQRT, two floating-point multipliers, a floating-point divider, a floating-point adder, and address calculation units for the (I,J) indexing, all capable of cooperating on this task! Hardware cost grows with the number of functional units, faster than linearly because of the need to heavily interconnect the functional units (the classic crossbar switch problem). Studies of compute-intensive kernels [Charlesworth 81] reveal that a typical mix of operation is weighted as

^{Operation Relative Weight
}

Operand Reference 7

Integer Add/Subtract 1 to 2

Integer Multiply < 0.1

Floating Point Add/Subtract 1

Floating Point Multiply 1

Divides < 0.1

Square Roots < 0.1

It is an engineering heuristic that a processor is most cost-efficient when no single part is a conspicuous bottleneck. Therefore, the preceding list implies that a well-balanced compute- intensive processor should have an integer adder, a floating point adder/subtracter, a floating point multiplier, and should handle the other operations less directly. For example, divides can be accomplished (with a slight increasing in rounding error) by fetching a reciprocal approximation of the denominator from a table in Read-Only Memory, refining the precision with Newton-Raphson iterations (using the floating point Adder and Multiplier), and finally multiplying by the numerator. This is perhaps an order of magnitude slower than an add or multiply, but is acceptable for the typical mix of operations shown above.

The list also indicates that there should be the ability to fetch or store about seven operands for every time the floating point multiplier and adder units complete an operation. But doing this from main memory is exactly the communication cost that compute-intensive processors seek to avoid! Since the adder and multiplier each take two inputs and produce one output, it is evident that at least six operand references are necessary to sustain anything close to the peak speed of the arithmetic. The solution usually makes use of registers and the interconnect between functional units rather than memory bandwidth, the most limited resource in a von Neumann design. For example, a processor could provide two register reads, two register writes, a main memory reference, a local “table memory” reference, and connect the output of one floating-point functional unit to the input of another, providing a total of seven operand references, of which only one actually burdens the main memory bus.

**5.5.2 Processor Control
**When is such a processor supposed to obtain

There is a price to pay for the multiple instruction units in the instruction word. The choice is to either “hard-wire” a set of basic operations such as dot product, vector add, etc., and use the instruction to select one of those basic operations, or else provide a

DO 10 I=1,N

10 SUM = SUM + A(I) * B(I)

and the round, “Row, Row, Row Your Boat”:

Fetch A(I) *Row,
*

Fetch B(I) *Row,
* (Memory pipeline wait)

(Memory pipeline wait)

(Multiplier pipeline wait)

Test I=loop end?

Both the program and the music have been arranged in such a fashion that a conflict-free

Note that the role of the singers in the music above is

On a “vector processor,” the type of loop shown above would be in microcode that the user can invoke but never alter. This greatly reduces the burden on the instruction bandwidth, but restricts the set of operations. For example, to perform a

DO 10 I=1,N

10 SUM = SUM + A(J(I)) * B(I)

is only slightly more work for a software-pipelined machine, but at least twice as hard for a vector machine that must construct the operation from a Vector Gather followed by a Vector Multiply Reduction Add. The compiler burden shifts to one of *recognizing* vector forms expressed as DO loops by template matching; the compiler must then attempt an efficient construct from the basic forms available on the machine. The hardware to manage vector forms is expensive, but so is programming assembly language on a software-pipelined machine. The argument has yet to be firmly settled in the community of computer users.

**5.5.3 Parallelism and Fortran
**Fortran, despite its historical development as a language for serial architectures, shows a fair amount of

DO 10 I=1,N

10 R(I) = A(I) + B(I)

involves N completely independent additions. Optimization for this goes hand-in-hand with the multiple functional unit idea, since the units can be scheduled at

The philosophy in compute-intensive processors has traditionally been to put the burden on the compiler or other software tools such as libraries of vector operations, thereby lessening the amount of hardware needed to detect and make use of concurrency at run time. This affects the Edit-Compile-Link-Run-Debug cycle in which the user sits, because on small problems the time saved in the Run will be consumed by the additional time during the Compile. For this reason, attached processors and array processors are inherently less interactive than general-purpose computers when traditional programming languages are used.

How big should the main memory be? This is highly application-dependent, but the answer stems from the need to amortize the time to load and unload the data for a particular problem. If it takes five seconds to load a program and its data set, then the computation should take at least as long. Although initialization grows linearly with the amount of memory to move, the amount of computation usually grows superlinearly. For example, an

Load Time = Compute Time / 100

which implies that

(2*N*^{2 }words)/(0.1 MWords/second) = (2*N*^{3} operations)/(10M operations/second) x 1%

This is satisfied when *N* is 10000 words. This is quite a bit smaller than the typical memories of general-purpose processors, suggesting that for some applications, simple array processors can improve cost efficiency by reducing main memory size. Attached scientific computers must have larger memories since they contain the entire application program (or large overlays of it), not just compute-intensive kernels, and must use physical memory rather than virtual memory for both program and data.

General-purpose computing is characterized by problems for which the number of computations grows proportionately to the amount of data If desired run time is on the order of seconds, then a general-purpose processor capable of *X* operations per second should have about *X *words of memory. The above argument, however, shows that compute-intensive processors can often succeed with less than this. The CRAY-1, for example, is capable of over 150 million floating-point operations per second, yet has only 1 million words of memory. The reason behind this is that scientific computing is characterized by problems (such as matrix multiplication) for which the number of computations per data point grows much faster than the number of data points.

**5.5.5 Registers
**Since data caches appear difficult to make efficient for the whole spectrum of memory reference patterns used for compute-intensive processing, one must have a large set of registers close to the functional units. For example, the FPS-164 has 64 integer registers and 64 floating point registers, with the capacity for several simultaneous reads and writes per clock cycle. The CRAY-1 uses multiple vector registers, where each vector contains 64 floating point elements, to keep operands within a few nanoseconds of the functional units. Besides the obvious design difficulties and parts cost of having a large number of fast registers, state-save becomes much slower, making context-switching a clumsy operation. Compute-intensive processors are thus

If a pipelined functional unit has

Suppose, statistically, that 75% of an application is vector-oriented and the rest is scalar, as timed on a completely scalar processor. Then a machine that is three times faster at vector processing than scalar processing will spend equal time in both types of code, and neither will appear as a bottleneck. If the same functional unit performs both scalar and vector operations, then the use of 3-stage pipelines is suggested. If 90% of the application is vector-oriented, one would wish for 9-stage pipelines, although there is a limit as to how finely one can pipeline tasks like multiplication and addition! One manufacturer of array processors, CSP Inc., discarded the idea of pipelining entirely, with the argument that scalar units are easier to compile to. FPS opted for pipeline lengths of 2 or 3 stages in the functional units of the AP-120B and FPS-164 processors, and the pipelines in the CRAY-1 are up to 14 stages long. The memory pipeline on the original CRAY-2 is roughly 60 stages long, which is perhaps a record. Applications are generally scalar-bound on such machines. The CRAY machines generally fit the definition of a compute-intensive processor, since a scalar floating-point operation is faster than a memory reference, and vector memory bandwidth (full pipelines) is just fast enough to keep up with the peak speed of the vector floating-point functional units.

The current generation of VLSI floating-point parts uses mostly four-stage pipelines for the multiplier and adder. Note that it is convenient to have the number of stages be a power of two, since if there are 2

The term “array processor” stems from the fact that these small, simple compute-intensive processors are frequently used to perform highly repetitive operations on arrays of numbers,

Attached processors have the additional problem that users are generally unwilling to tolerate either loss of precision or loss of dynamic range in passing data from the host to the attached processor. Therefore, an attached processor should have at least as many mantissa bits

Required precision is a function of application. Some generalizations are listed below:

**32-bit precision: 64-bit precision:
** Most signal processing Structural analysis (Finite Element Methods)

Most image processing Circuit simulation (stiff ODE’s)

Lattice-gauge calculations

Missile simulation (simple ODE’s) Electromagnetic wave modeling (inverse problems)

Most graphics (rendering) Seismic simulation (forward modeling)

Seismic migration High-order or hyperbolic finite difference schemes

Coarse-grid, elliptic Computational fluid dynamics

finite difference schemes Oil reservoir simulation

Data acquired from physical phenomena (seismic data reduction, satellite image processing) rarely have more than four decimals of precision or four orders of magnitude dynamic range, and hence signal and image processing usually need no more than 32-bit precision

Lattice-gauge simulations are unusual among compute-intensive simulations in that errors accumulated using 32-bit arithmetic on small unitary matrices can be periodically removed by renormalizing to keep determinants close to unity.

Missile simulation and other models involving Ordinary Differential Equations (ODE’s) with few degrees of freedom frequently have very stable numerical methods, and errors caused by use of single precision damp out over repeated timesteps.

Rendering methods for generating photographic-quality simulated images (such as ray-tracing) need only compute with the precision that will be relevant to a raster-scan display (about one part in 1000). An exception occurs when one must compute the intersection of nearly parallel plane sections, which is an ill-posed problem mitigated by judicious use of 64-bit precision.

Finite difference schemes for solving Partial Differential Equations (PDE’s) often have inherent numerical error proportional to some low power of the grid spacing. For instance, the usual five-point operator applied to an *n* by *n* grid only has accuracy proportional to 1/*n*^{2}, so one usually runs out of memory and patience when *n* is increased before one runs out of precision! This applies to static problems such as elliptic PDE’s where errors do not propagate with time.

Structural analysis methods use finite element methods, which are in principle far more accurate than typical finite difference methods. A large problem might lose seven or eight digits in the process of solving the global stiffness equations, however, so 64-bit precision or better is essential. The need for high precision is easily seen by considering the bending of a beam: although the overall bending might be quite significant, the strain on an element in the beam is perhaps one part in a million. Since the overall bending is computed by accumulating element strains, small errors caused by precision translate into major macroscopic errors.

Analog circuit simulation can easily run into ill-posed sets of nonlinear ODE’s for which high precision helps convergence. Iterations can get “stuck” on the jagged steps of 32-bit precision, even though far from the best solution of the nonlinear system.

Computational chemistry involves the computation of eigenvalues (which represent energy levels of molecular orbitals) where the eigenvalues span many orders of magnitude, and operations on matrices where the sum of many small quantities is significant relative to a few large quantities; some computational chemists are even forced to use 128-bit precision, for which hardware support is rare.

Electromagnetic wave modeling, used to simulate radar reflections from metal objects, is currently done using boundary integral techniques that give rise to very large (10000 by 10000) dense systems of linear equations; since the bound on roundoff error is proportional to the number of equations times the relative error of the floating point representation, 64-bit precision is preferred.

Seismic simulation is a large hyperbolic PDE model, not to be confused with seismic migration. One is the inverse of the other. Fine-grid finite difference schemes involving multipoint operators can be accurate enough that single-precision floating point error exceeds the error of the numerical method; hence double precision is needed.

Oil reservoir simulation at first glance appears feasible using 32-bit precision, since coarse grids are used; however, the methods repeatedly calculate pressure differences at adjacent grid points, where subtraction of large similar uncertain numbers removes many significant figures.

In summary, there are a variety of reasons why high-precision arithmetic might be needed by an application. As models become larger or more detailed, some low-precision applications are moving into the high-precision camp.

**5.5.8 Architecture Summary
**Table 5.3 summarizes tradeoffs for various compute-intensive architectural techniques:

**Table 5.3 Architectural Tradeoffs
**

**
METHOD ADVANTAGES DISADVANTAGES
Multiple** Higher peak speed; Extra hardware cost;

performance reduced scalar speed

operations

A1. For column

A2. Find the element

A3. Exchange the row

A4. For row

A5. Compute

A6. For row element

A7. Compute

A8. Next

A9. Next

The usual work estimate for this algorithm is ^{2}/_{3}*n*^{3} + 2 *n*^{2} + *O*(*n*) floating point operations. Line A7 is executed ^{1}/_{3}*n*^{3} times and constitutes the kernel of the algorithm. Now consider the *memory burden* imposed by that kernel. The computer must reference *A** _{ij}* ,

The simple act of reordering the operations by changing loop nesting in the preceding algorithm can remove this problem. By arranging the loop over *k* to be the innermost loop, the kernel can be changed to an inner product, as shown below for the inner two loops:

B1. Save *A** _{ij}* in a table memory,

B2. For

B3. Set

B4. For

B5. Set

B6. Next

B7. Scale by diagonal pivots.

B8. Next

The kernel, line B5, can keep

If the floating point adder is pipelined with

Both matrix multiplication and matrix factoring involve at least

1. Re-use of data (temporal bandwidth reduction)

2. Broadcast of data (spatial bandwidth reduction)

In multiplying a matrix

The MAX allows up to 15 extra processor boards, each with two multiply-add pipelines, or 30 pipelines total. The scalar unit of the main processor participates in the column operations using its table memory as a column store, so the actual number of parallel pipelines is 31. Thus, the 11

directly requires

This illustrates the flaw of conventional algorithm analysis, which ignores memory references in the work assessment. An FFT must eventually reference every storage location

DO 1 I=2,N-1

DO 1 J=2,N-1

1 F(I,J)= U*F(I,J)+V*(F(I-1,J)+F(I+1,J)+F(I,J-1)+F(I,J+1))

where U and V are predefined constants and F(I,J) is a two-dimensional array representing the physical state of the system. For hyperbolic PDE’s, this type of equation might compute the next timestep of the spatial state of the system; for elliptic PDE’s, it might be a relaxation method that produces an F(I,J) that eventually becomes sufficiently close to the solution. In any case, the kernel involves six main memory references and six floating point operations. The memory references are “nearest neighbor” in two dimensions, so the technique of *loop unrolling* can improve the ratio of arithmetic to memory references:

DO 1 I=2,N-1

DO 1 J=2,N-1,2

F(I,J)=U*F(I,J)+V*(F(I-1,J)+F(I+1,J)+F(I,J-1)+F(I,J+1))

1 F(I,J+1)=U*F(I,J+1)+V*(F(I-1,J+1)+F(I+1,J+1)+F(I,J)+F(I,J+2))

Using registers for expressions that occur more than once, there are 10 memory references and 12 floating point operations. More unrolling allows the ratio to approach 2:3 at the cost of register usage (and possibly exceeding the size of the program cache), which is one argument in favor of a large set of scalar registers. Computers with a 1:1:1 ratio of fetches to multiplies to adds can approach 50% of peak speed with no unrolling, or 66% with extensive unrolling.

An interesting contrast with either of the above methods is the one favored by highly vector-oriented computers. A vector computer might run the problem more efficiently in the following form, where TEMP is a vector register:

DO 10 I=2,N-1

DO 1 J=2,N-1,2

1 TEMP(J)=F(I-1,J)+F(I+1,J)

DO 2 J=2,N-1

2 TEMP(J)=TEMP(J)+F(I,J-1)

DO 3 J=2,N-1

3 TEMP(J)=(TEMP(J)+F(I,J+1))*V

DO 4 J=2,N-1

4 F(I,J)=U*F(I,J)+TEMP(J)

10 CONTINUE

If there is enough bandwidth to either memory or vector registers to keep the floating point functional units busy, and if the vector operations shown are in the repertoire of the hardware that manages the arithmetic, then the preceding method can approach 75% of the peak *Mflops* rate of machines with one adder and one multiplier functional unit. On the other hand, if TEMP must reside in main memory, then this is a disastrous approach for memory-bound processors. There are twice as many memory references as in the “unvectorized” approaches, causing a multiply-add processor to achieve at most 25% of peak speed. This is why one occasionally observes the phenomenon that the CRAY or CYBER type of processor does relatively better on the vectorized version of an application program, whereas the FPS type of processor does relatively better on the unvectorized version. This is a source of difficulty for the developers of application software, who naturally seek a simple-to-maintain single method that works well on a wide range of compute-intensive machines.

**5.7 Metrics for Compute-Intensive Performance
**Ultimately, the only performance questions of interest to a computer user are:

1. How long will the application take to run?

2. How much will the run cost?

This assumes, of course, that answers are within required accuracy tolerance. On compute-intensive processors, there is usually a significant hurdle to answering the above questions by direct measurement. Architectures vary greatly, and only considerable conversion effort shows the performance that is ultimately possible. Since compute-intensive processors trade ease-of-use for cost-performance, predicting overall speed and cost (including reprogramming cost amortized over the useful life of the application) is a difficult proposition.

As a result, the industry has come to use simpler metrics that claim correlation with application performance. At the simplest (and most inaccurate) level are machine specification such as clock speed, peak

A compromise is to use standard problems that are much simpler than a complete application but exercise a variety of dimensions of machine performance. Compute-intensive processors have their own set of industry-standard benchmarks, of which the following are examples (arranged in rough order of increasing complexity and usefulness at gauging performance):

• Matrix multiplication (no particular size)

• The 1024-point complex FFT.

• Whetstones (a synthetic mixture of operations designed to test scalar speed only).

• Solution of 100 equations in 100 unknowns, with partial pivoting.

• The Livermore Kernels (24 Fortran excerpts from application programs).

• The “four-bit adder” simulation in the SPICE test suite.

• Solving structural analysis problem “SP3” in Swanson’s ANSYS

To compare actual costs, it is similarly possible to estimate total computer cost by considering all hardware costs, software costs, facilities requirements, and conversion effort. The ultimate measure of cost can be determined by the price of the application run at a computer service bureau, which must determine charges accurately enough to remain in business and yet be competitive with alternatives. It is rare for compute-intensive processors to be compared using applications that exercise memory bandwidth or disk speed, for all the reasons stated in the Introduction. This is perhaps unfortunate, since they figure prominently in the performance of applications such as analog circuit simulation and large-scale structural analysis.

In practice, at least one channel is used for communication to a host or other I/O device, and the commercial hypercubes have all made this variation on the original prototype built at Caltech in the early 1980’s.

Processors cooperate on a particular task by passing messages explicitly to one another; there is no “shared memory” in the system and hence no need to deal with the problem of multiple writes to a memory location. This latter problem is a major source of indeterminism and programming difficulty in shared memory designs. Both shared memory and hypercube designs pay a cost for global memory accesses that grows logarithmically in the number of processor-memory pairs; the hypercube, however, rewards locality of memory accesses.

Replicated Very Large-Scale Integration (VLSI) is ideal for the nodes of hypercube computers, now that VLSI density has reached the point where a complete processor can fit on a single chip. The early hypercubes use a mixture of small, medium, and large-scale integration to put a node on one or two boards. The implementation by NCUBE, Inc. integrates a complete VAX-style processor with floating point arithmetic, 11 communications channel pairs, and an error-correcting memory interface, on a single chip. With no “glue logic,” this chip plus six memory chips form a complete node with 512 KBytes of memory.

Available hypercubes tend to be either highly compute-intensive or general-purpose. They are largely characterized by three times: communication time, memory-move time, and arithmetic time. The ratio of the last time to the first two determines how compute-intensive the design is.

Communication time is the time to send a message of length

Memory-move, or “gather-scatter” time, is the time to move data within the memory of a single node. Usually, the communications channels use only a fraction of the memory bus, so moves within a single node are much faster by the reciprocal of that fraction. Currently, time to move a contiguous array from one location to another ranges from 0.1 to 2 microseconds per byte.

Arithmetic time is the time to do a floating point operation (two inputs, one output) by whatever means is provided on the node. For 64-bit operands, the current range is rather wide: from 0.1 to 20 microseconds. The ratio of Arithmetic Time : Communication Time varies from about 1:100 or 1:200 for the FPS T Series or the Intel iPSC/VX, to about 1:1 for the NCUBE. The T Series and iPSC/VX are extremely compute-intensive and use vector arithmetic managed by a microprocessor, whereas the NCUBE uses integrated scalar arithmetic.

The arguments of earlier sections regarding system balance apply here, with the added consideration of “surface area.” Ideally, each node communicates and computes at the same time, and sees no bottleneck from either activity. On the T Series or the iPSC/VX, every word brought over a channel must participate in over 100 floating point operations if the arithmetic is to be kept busy. Suppose that the application is the Synthetic Seismic problem, modeling the wave equation in three dimensions. Each grid point requires eight floating-point operations per timestep in the simplest acoustic model. When the three-dimensional domain of the problem is partitioned into

This illustrates a growing problem in computer design: memory technology lags floating point arithmetic technology by two years or more, so users must await fourfold or sixteenfold increases in commodity memory chip densities before the announced compute-intensive ensembles reach the minimum main storage needed for reasonable system balance. The advent of floating point chip sets such as those made by AMD and Weitek have made it easy to add several

The feature that will remain distinctive of compute-intensive designs is that they trade ease-of-use, interactivity, and generality for cost-performance. They are neither “inferior” nor “superior” to general-purpose computers; they simply have a different feature emphasis. The availability of computers that make this tradeoff will continue to bring large-scale scientific applications within the reach of users who would otherwise not be able to afford to do them.

[2] “An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family,” Alan E. Charlesworth,

[3] “Introducing Replicated VLSI to Supercomputing: the FPS-164/MAX Scientific Computer,” Alan E. Charlesworth and John L. Gustafson,

[4] “Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine,” J. Dongarra, F. Gustavson, and A. Karp,

[5] “Algorithms for Concurrent Processors,” G. C. Fox and S. W. Otto,

[6] “Concurrent VLSI Architectures,” C. L. Seitz,

[7] “A Fortran Compiler for the FPS-164 Scientific Computer,” R. Touzeau,

[8] “Neocognitron: A Neural Network Model for a Mechanism of Visual Pattern Recognition,” K. Fukushima, S. Miyake, and T. Ito,

#