This document is intended to provide an informal introduction to the basic guiding principles behind most of H. Dietz's research. It also tries to explain why those fish are swimming around on Hank's Home Page.
Since 1987, we have used the term "compiler-oriented architecture" to identify our approach to designing high-performance computer systems. As the name suggests, our approach includes what is now know as "hardware/software codesign," but is based on a much more fundamental principle. Our goal is to take advantage of the fact that static and dynamic mechanisms within a computer system have inherently different abilities:
| Static | Dynamic | |
|---|---|---|
| Mechanisms | Compilers, Assemblers, Linkers, Loaders |
Architecture (hardware), Operating Systems, Runtime support, Firmware |
| How much can be transformed? |
Whole program | Window around the current instruction |
| How accurate is information? |
Probabilities | Exact |
| How much time for analysis? |
Complex, slow, analysis is acceptable |
Analysis is time-critical |
| Is information time-sensitive? |
Only time-invariant properties |
Information can vary over time |
Given these differences, very few aspects of system performance can be managed equally well using either static or dynamic mechanisms. Thus, the trick is simply to attack each problem in the proper domain, and to decompose complex performance problems into static and dynamic components from which a better complete solution can be synthesized.
Funded in part by ONR N00014-91-J-4013, we have developed a wide range of research products by first identifying the information needed to improve system performance and then integrating new compiler techniques and architectural features to obtain and use that information. We find opportunities for improving performance by examining the basic assumptions behind existing practice; we base our solutions on direct measures and techniques, discarding them for approximate methods only after they have been shown to be (currently) infeasible.
The rather odd looking group of "fish" that you saw at the top of
my home page
is actually borrowed from our research group's 1987 tee shirt design.
Because the group was called CARP, Compiler-oriented Architecture
Resarch at Purdue, the fishy logo seemed appropriate; however, you
may have noticed that each fish is actually a set of related words.
The following sections briefly describe what each fish represents.
In scalable parallel computers, one of the most fundamental issues is the basic model of parallel execution. Flynn's characterizations of SIMD and MIMD form the two best known models, with the more recent VLIW and Barrier execution models bridging the gap between SIMD and MIMD. In fact, we view SIMD, VLIW, Barrier, & MIMD as a smooth progression from the most static to the most dynamic parallel execution structure.
Because these parallel execution models differ so dramatically in what
aspects are controlled statically vs. dynamically, compiling a
parallel program to efficiently use a particular parallel execution
model can be very difficult. Likewise, the effectiveness of these
execution models depends critically on the compiler's ability to
generate code that makes good use of the available parallelism.
SIMD, Single Instruction stream Multiple Data stream, parallelism is characterized by the concept of all processing elements (PEs) performing the same operation at the same time, each on its own local data. A very simple hardware implementation can easily scale to thousands of PEs, and coding is simplified in that a SIMD parallel program can be viewed as an essentially sequential program with parallelism "burried" within each operation. However, the SIMD model is also the most static, most contrained, parallel execution model; it is very difficult to find or create the correct type of parallelism in many applications. While it is difficult for a compiler to generate the appropriate parallelism, at least the compiler knows exactly how the parallel execution with proceed, and can accurately predict the effectiveness of a particular encoding.
Although very similar to SIMD, VLIW, Very Long Instruction Word, machines differ in that each PE has the autonomy to perform a different operation independent of the other PEs. However, the grouping of the sets of operations that PEs will execute together is entirely static; the choice of which operations will execute simultaneously is made at compile time. One also hears VLIW PEs commonly refered to as "function units," because some PEs within a VLIW machine may support only certain types of operations.
In any case, the close coupling of compiler code scheduling technology and VLIW architecture is well known.
Although the word "superscalar" seems to have originally been a way to avoid infringing on the Multiflow trademark on "VLIW," the term has grown to be distinguished from VLIW by the fact that instructions are packaged on the fly from the superscalar's sequential instruction stream, while true VLIW instructions are encoded by the compiler and are delivered intact by the memory system. One hears a lot of arguments along the lines that superscalar is better because codes do not need to be changed to run on a machine with a different set of function units, and this is true, but performance can suffer markedly compared to what a recompile would yield. Further, the runtime packaging of operations requires a lot of hardware, and superscalar instruction memory cannot as easily be compressed to reduce the required VLIW instruction memory size and bandwidth requirements.
Barrier synchronization is a mechanism that allows a group of independent, asynchronous, processors to coordinate their actions by delaying completion of each processor's barrier wait operation until all processors have joined in the barrier. Barrier synchronization is thus a symmetric N-way synchronization, originally concieved to ensure that one "phase" of an asynchronous parallel program is completed before any processor begins the next "phase." What is not quite so obvious is that efficient hardware implementation of barrier synchronization allows an asynchronous (MIMD) architecture to duplicate the purely synchronous static timing properties of SIMD/VLIW execution.
The first machine to take advantage of this fact was Purdue's PASM (PArtitionable Simd Mimd) prototype. PASM's asynchronous processors simulate SIMD execution by imposing a crude type of hardware barrier synchronization on each memory access. Since 1987, our research group at Purdue has greatly extended the functionality of the barrier mechanism (e.g., dynamic vs. static barriers, aggregate communications as side-effects), and also have significantly simplified and improved the architecture of the barrier mechanism.
Perhaps more importantly, we have developed sophisticated compiler timing analysis and static scheduling techniques to take advantage of barrier hardware and its ability to freely mix SIMD, VLIW, and MIMD style execution. Sure, you can simply treat a barrier MIMD machine as a SIMD, VLIW, or MIMD; however, the chamelion-like abilities of barrier MIMD machines to change parallel execution model to match the needs of an algorithm can yield dramatic improvements in performance. Put another way, barrier MIMD machines allow the compiler to determine how much asynchrony is appropriate for each program construct and to cause the machine to act accordingly. More asynchrony can help balance load; less asynchrony can reduce overhead of interactions between processors.
MIMD, Multiple Instruction stream, Multiple Data stream, parallelism is characterized by the concept that each processing element is really a processor operating asynchronously and independently. This meshes well with the goal of building parallel supercomputers using conventional processors, or even workstations or PCs, as processing elements.
Traditionally, MIMD has been viewed as more flexible than SIMD or VLIW because it allows a wider range of parallel control flow constructs to be directly implemented; however, MIMD asynchrony yields a multitude of problems that neither SIMD nor VLIW machines evidence. One problem lies in the fact that it can be very expensive for processors within a MIMD machine to communicate with each other, which often results in use of MIMD parallelism unexpectedly slowing down the program because communication overhead exceeded speed-up achieved by parallel execution. The static timing properties of SIMD and VLIW make it much simpler to statically know what parallelism will actually speed-up a program. Another key problem is that the asynchrony of MIMD processors can yield timing faults that depend on the precise timing of a particular program run, thus creating a debugging nightmare.
Thus, static management associated with MIMD machines focusses on predicting and minimizing communication overhead. The debugging nightmare has thus far been managed primarily by either providing languages that are inherently free of timing faults or simply telling the programmer to "be careful." A lot more work is needed on static analysis to aid in debugging MIMD programs.
One of the best ways to achieve modest performance increases is to
decompose operations into smaller units which can be independently
optimized or rescheduled. Both RISC and Pipeline hardware structures
are based on this principle, and require carefully integrated compiler
technology to achieve the full benefit.
The RISC, Reduced Instruction Set Computer, concept simplifies the processor instruction set so that each operation that the hardware implements as a single instruction corresponds closely to the smallest units of computation that the compiler's intermediate form would typically represent. This nearly one-to-one mapping allows the compiler to be far more effective in applying optimizations based on analysis of the intermediate form. At the same time, the simplicity of the instructions tends to yield faster, simpler, hardware implementations.
Pipelining refers to the concept of dividing time-consuming operations into multiple "stages" so that execution of these operations can partially overlap the execution of other instructions. For example, if a memory reference requires 4 clock cycles to complete (i.e., has a latency of 4), pipelining such an instruction with single-cycle stages (i.e., am enqueue time of 1) may allow as many as 3 other instructions to be initiated before the processor needs to wait for the memory access to complete. This also can achieve higher hardware utilization, because the hardware not being used in the current stage of of one instruction can be used by another instruction. However, these benefits occur only if there are instructions available that do not interfere with or need the result of any pending instruction, which leads to a variety of interesting interactions between compiler code scheduling and hardware pipelining.
In some informal sense, we all know that one of the best ways to make
sure that something is available when you need it is to keep it
nearby. In computers, cache and registers are the primary structures
used to take advantage of potential "locality" of both instructions
and data. In truth, registers are not often used to hold instructions,
but there is no reason that they cannot be. Likewise, modern machines
tend to have more than one level of cache, all backed by a
virtual-memory paging mechanism that acts as a cache for the disk.
The goals, and management, of all these mechanisms are closely related.
Cache is a relatively small, fast, memory that is used to hold associatively-addressed copies of recently accessed instructions and/or data. The associativity of a cache is in memory address space; when a processor generates a memory address reference, the cache is first checked to see if it has a more rapidly accessible copy of the object associated with that memory address. If the object is present, access is fast; otherwise, the block containing the object is typically obtained from main memory, copied into cache, and the object is extracted from the cache.
One often hears this described as "transparent" operation -- but it is not transparent when one considers how the cache must be managed. Clearly, caching copies of data read from memory locations that can be accessed by other devices (e.g., I/O ports, DMA channel buffers, shared memory communication with other processors) can yield incorrect results unless some access protocol is imposed. However, even without such complications, the basic reality is that accessing an uncached object through the cache usually takes longer than accessing it directly from main memory. Further, loading a block into cache has only negative effects if that block will not be referenced again before it is flushed from cache. In such a case it is better to bypass the cache. Because only static analysis can predict future cache references, cache management is really a task for the compiler. In fact, development of compiler-driven cache management in 1986-1987 was one of the first results of compiler-oriented architecture.
Unlike cache, processor registers are not associatively addressed, but have their own independent namespace. Thus, it has long been accepted that the compiler should manage the allocation of registers and there are a variety of well known techniques that can be used effectively.
Unfortunately, unlike cache, if two objects are loaded into registers and happen to be aliased with each other (e.g., two registers are independently loaded with values pointed to by two pointers that just happen to reference the same location), changes to one register will not automatically be reflected in the other register. This inability to associatively maintain coherence is really the fundamental limit on the utility of registers.
However, for any objects that are not subject to statically ambiguous aliasing, registers are the prime mechanism. This fact can be used by ensuring that register loads bypass the cache and that register spills dump their data into cache, thus integrating register and cache management. Further, there is no reason why statically allocated "instruction registers" could not be used to more efficiently manage instruction fetch than an instruction cache; after all, instructions are only subject to ambiguous aliasing in self-modifying code, which is relatively rare these days.
This page was last modified July 20, 1995. [an error occurred while processing this directive]