A Thesis
Submitted to the Faculty
Purdue University
by
Michael J. Phillip
In Partial Fulfillment of the
Requirements for the Degree
Master of Science in Electrical Engineering
May 1989
This thesis is dedicated to Mitzi.
I would like to thank my thesis adviser, Dr. Henry Dietz, for his encouragement, expertise, and always open door.
I would also like to thank Dr. Tom Casavant and Dr. Dave Meyer for serving on my committee and providing invaluable feedback.
I thank Dr. H. J. Siegel and Wayne Nation for their input regarding the past, present, and future status of PASM.
A special thanks to Dr. James Kuehn for the many hours of guidance and constructive conversations regarding compilers and Parallel-C.
Finally, I thank Dr. Paul Schneck, Dr. John Riganati, and the research staff at the Supercomputing Research Center for providing the opportunity and environment to develop a suitable thesis topic.
Phillip, Michael J., M.S.E.E.,
Purdue University. May 1989. Unification Of Synchronous And
Asynchronous Models For Parallel Programming Languages. Major
Professor: Dr. Henry G. Dietz.
Most parallel languages can be associated with either a synchronous or asynchronous programming model. Synchronous languages tend to infer semantics from a particular execution model, while asynchronous languages are often difficult to implement efficiently on existing hardware. Each model encompasses a different form of parallelism, but neither is capable of expressing both synchronous and asynchronous parallel constructs.
In this thesis, a number of existing parallel languages are examined with respect to the precise semantics of the parallel constructs and their impact on the type of parallelism that can be expressed in the programming model. From this examination, a unified programming model is derived that allows parallel programs to be mapped efficiently onto both SIMD and MIMD architectures. Fundamental to this model is the separation of synchronization from control flow and communication issues, as well as the capability to allow nondeterminism in parallel algorithms.
As a concrete illustration of the unified model, an explicitly parallel C language, XPC, is proposed. A BNF grammar for the language is given, and the semantics of parallel constructs are informally discussed. XPC extends all communication and control constructs of the C programming language to allow concise specification and management of both synchronous and asynchronous parallelism.
The HTML version of this thesis differs from the official copy in several ways, including a variety of minor formatting changes. The official copy is available through the usual channels for a Purdue MS Thesis. However, this HTML version is complete and very convenient for browsing. Most WWW browsers will even allow you to print this version by automatically converting it into postscript.
This HTML
document was generated using a script that replaces figures
with in-line gif images that are hypertext
linked to full-resolution postscript versions; the
(nK .ps.Z) at the upper
right of each in-line image provides both the size of the
compressed postscript file and the hypertext link. There
are also a variety of hypertext links inserted for
navigation within the document, including a Hypertext Index
at the end of this document (click here to jump to the Hypertext
Index).
As parallel processing systems become more widely available over the next several years, the need for adequate software development environments will become increasingly visible. Programming languages proposed or implemented thus far tend to support a particular type of parallelism and are often closely associated with a particular architecture. Issues such as scalability and portability -- fundamental in sequential languages -- are often ignored in parallel programming models. There is a need to provide the programmer with a model that encompasses many forms of parallelism, while still retaining the "feel" of a more familiar sequential language.
To this end, this thesis proposes a parallel language, based on the C programming language, that will allow programmers to explicitly specify and manage parallelism on a broad class of architectures. A foundation for such a language is built by first examining a number of explicitly parallel languages with respect to the semantics of parallel constructs and the interaction of such constructs during program execution. From this examination, aspects of a unified programming model are derived, and inconsistencies in existing language semantics are exposed. This unified model provides a unique perspective of parallel programming, and identifies language features that are desirable in a general-purpose parallel programming language.
The sequential model of programming has spawned a great number of languages and algorithms, all tied intricately to von Neumann style architectures. With the evolution of parallel hardware has come the introduction of languages that support the specification and management of parallelism. These languages have found moderate acceptance, but remain difficult to use due to a number of shortcomings including a lack of machine independence, a lack of scalability, and a lack of programming support, such as source-level debugging. Hence, the primary challenge of hardware and software designers alike is to provide an environment where parallel programs can be designed, implemented, and debugged by programmers who have both natural and learned tendencies to decompose problems in a sequential manner. Thus far, three approaches have been pursued in an attempt to meet this challenge:
It is interesting to note that all three approaches map the problem of programming for parallel execution into a sequential domain. Synchronous models allow the problem to be decomposed into a series of small steps that may be executed in parallel, while asynchronous models treat the problem as a set of nearly independent sequential tasks that can proceed concurrently. Although the third technique can potentially be applied in conjunction with the first two approaches, the synchronous and asynchronous models have typically been considered to be mutually exclusive.
Clearly, this exclusion is not inherent in the problem itself, but is instead a result of applying a particular set of constraints to the formation of algorithms that map the problem onto a particular class of machines. This view has brought about an important dilemma: if a solution contains both synchronous (SIMD) and asynchronous (MIMD) parallelism, and the underlying hardware can support both models, how can the programmer specify both types in a single programming language?
Implicit in the statement of this dilemma is an assertion that many current parallel architectures can, in fact, support both types of programming models. It is ironic, then, that the emphasis in discussions of parallel programming models is most often placed on whether a synchronous or asynchronous model should be chosen. If parallelism is the only way to achieve higher effective speed, it is no longer acceptable to discard one flavor of parallelism simply because the programming language uses a different model of computation.
In order to better understand how the synchronous and asynchronous constructs differ, a brief description of the machine models most closely identified with each language type is presented below. A number of other hardware configurations exist in parallel architectures, but the SIMD and MIMD models provide the most natural division for the discussion of programming models to be presented in this document.
As originally defined [Fly66] SIMD refers to a Single Instruction stream, Multiple Data stream model of execution. In such a model, only a single sequence of instructions is present, but each instruction may be simultaneously executed in an arbitrary set of processors, with each processor operating upon its own data. A typical block diagram for a SIMD machine is shown in Figure 1.1.
Figure 1.1. SIMD block diagram
Typically, program instructions are fetched and decoded by the control processor (CP). Operations to be performed in parallel are then broadcast to those processing elements (PEs) that have been enabled for the current instruction. Processors that are disabled must remain idle while the instruction is executed in enabled PEs. The parallelism in a SIMD model is at the lowest level of the program hierarchy, or the instruction level. Thus, SIMD machines can efficiently implement fine-grain parallelism. This efficiency is due to the "perfect" scheduling of operations that is possible in a SIMD model; no interference occurs between processing elements as the program is executed.
A Multiple Instruction stream, Multiple Data stream (MIMD) machine is characterized by several sequences of instructions, or processes, that are distributed in some manner on a set of processors. These instruction streams may differ in both content and structure, and can effectively be viewed as being executed by a set of uniprocessors that are able to communicate with each other through use of special software and hardware support. A generic block diagram for a MIMD model is illustrated in Figure 1.2.
Figure 1.2. MIMD block diagram
The means of communication between processors is generally used to conceptually divide MIMD architectures into a number of subclasses. For example, some memory models have a shared address space to facilitate communication, while others must send messages through a network. These communication techniques are also used in SIMD execution models, but are typically "hidden" from the programmer. Thus, since processes need not be executing the same instruction at the same time, and since communication uses up valuable execution time, large-grain parallelism is generally preferred in the programming model. The delays due to communication are also present in SIMD execution models, but the relative timing between the participating processes can be determined statically.
In this document, the emphasis is not placed on a competitive evaluation of synchronous and asynchronous programming models; clearly, both encompass desirable forms of parallelism. Instead, the goal is to examine the impact of several synchronous and asynchronous language features on the ability to combine desirable aspects of SIMD and MIMD machine models. The divergence of SIMD (synchronous) and MIMD (asynchronous) programming models has by no means been accidental, but rather the result of differing design goals and implementation constraints.
A SIMD model of execution leads to hardware that is relatively straightforward to build and reasonably easy to understand. Most SIMD machines simply extend von Neumann architectures to allow simultaneous operation of multiple functional units. As a result, most synchronous languages have been built around a particular machine, providing a near optimal match between language constructs and the underlying hardware. Examples of such language/machine pairs include GLYPNIR [LaL75] / Illiac IV [BaB68] , Vector-C [Li84] / Cyber 205 [Lin82] , Parallel-C [KuS85b] / PASM [SiS86] , and C* [RoS87] / Connection Machine [Hil85] . In each of these cases, at least some aspect of the language semantics have been inferred from the execution model.
A MIMD model, by contrast, is theoretically much more powerful, but is often more difficult to implement in hardware. This is especially true if communication is supported with shared memory hardware. Asynchronous languages tend to be designed as tools for examining abstract theories of computation; hence, they often do not map well onto parallel hardware. In many respects, the SIMD model is actually a proper subset of the more general MIMD model.
As indicated above, explicitly parallel languages tend to fall into one of two classes based on the execution model of the target architecture. These classes have generally been considered incompatible, and tend to favor certain types of applications and specific modes of parallelism. This document seeks to identify common features of both language classes, with the goal of specifying a unified programming model that can be applied to a broad range of parallel algorithms and machines. The approach taken is to first examine existing languages from the perspective of the programming model, rather than a particular implementation. Although descriptions of these languages have appeared elsewhere, discussion typically revolves around the language/machine pair, thus obscuring issues related to the underlying programming model. When taken out of the context of a particular machine environment, inconsistencies in the language semantics often appear, making it difficult for the language to be implemented and used on other systems. The discussion of existing languages is not intended to be comprehensive, but does attempt to describe in some detail those constructs that affect the expression and use of parallelism. The manner in which synchronous and asynchronous language classes are compared is believed to be unique, as is the attempt to unify the two programming models.
Chapter 2 surveys four synchronous programming languages with respect to the semantics of parallel constructs that support computation on SIMD architectures. The survey not only describes the salient features of each language, but also points to inconsistencies in the model of parallelism used within individual constructs. An attempt has been made to divide constructs into several categories, based on the impact each category has on parallel execution. Thus each language discussion contains subsections covering the expression of parallelism, the use of conditional and iterative constructs, and communication between processors within the system, as well as a brief summary of the strengths and weaknesses of the model.
Chapter 3 addresses asynchronous languages, ranging from message passing models to shared memory paradigms that suppress explicit naming of processes. As in Chapter 2, the survey attempts to focus on semantics of constructs that are desirable in a unified programming model. Specifically, subsections cover the specification and management of processes and the means by which processes communicate with one another.
The synchronous and asynchronous models examined in the previous chapters are merged into a unified parallel programming model in Chapter 4. This unification takes place by extracting language features from previously described languages and providing a possible set of more general semantics for parallel computation. The key to the specification of a unified programming model is the separation of interdependent characteristics that are often inherently linked in synchronous or asynchronous models. Thus, issues such as synchronization and communication are treated separately, leading to a discussion of determinism, a fundamental aspect of parallel programming models.
Chapter 5 presents a concrete illustration of the concepts proposed in Chapter 4 by describing a new parallel language, XPC. The language is similar in style to many of the languages previously discussed, but the semantics of constructs are independent of the execution model of the target architecture. The language is also unique in that it is intended to be compatible with current compiler technology, including automatic parallelization.
Finally, Chapter 6 summarizes the work presented in this thesis, and outlines future research directions in the area of parallel language development.
As described in the previous chapter, most explicitly parallel languages are associated exclusively with either a synchronous (SIMD) or asynchronous (MIMD) model of execution. Constructs added to synchronous languages to support parallel execution can be divided into two classes: constructs that deal with the expression of parallelism, and those that handle assignments and communication between processors. This chapter will examine both classes of constructs for a series of four synchronous languages.
Synchronous languages allow parallelism to be expressed in the following ways:
1. variable declarations (data structures and data layout) 2. enabling/disabling of PEs (algorithm topology)
Variables are usually declared as either being distributed among all processors or local to the control processor; the keywords differ from language to language, but the meaning is the the same. For example, Glypnir uses the descriptor CU to define data and variables that reside in the control unit, and PE to designate parallel variables for the processing elements. In Parallel-C, the keywords for such declarations are scalar and parallel, respectively.
Algorithm topology refers to the manner in which a program maps onto a parallel execution model, including the control flow and data flow of an algorithm across many processors. Ideally, algorithm topology is independent of machine topology, at least within a given class of parallel architectures.
Communication between processors is deceptively straightforward in synchronous languages. Given that data in SIMD machines is classified as being either scalar or parallel, most synchronous models specify semantics governing each combination of data classifications. Some languages only allow certain types of assignments; others attempt to support arbitrary communication between all processing elements, including the control processor. Inherent in all such semantics, however, is the provision for synchronization within communication constructs. It is this linkage of potentially independent issues that prevents most synchronous languages from being implemented on other classes of architectures, and often introduces inconsistencies into the programming model.
Glypnir [LaL75] was developed as a programming language for the Illiac IV parallel processing system [BaB68] . At the time of its development, compiler technology could not provide the optimization necessary to allow both efficient execution and topology independence. Thus, a Glypnir program specification is closely associated with the Illiac IV hardware. The Illiac IV consists of 64 processing elements (PEs) and a single control unit (CU). Each processing element may directly access 2048 words of memory, and may indirectly access any address within the system. Interprocessor communication can take place through an interconnection network that allows PEs to directly communicate with other processors that have logical distances of +-1 or +-8.
Glypnir allows parallelism to be expressed through variable declarations and the use of conditional and iterative constructs. Most standard variable types are supported in Glypnir, including real (floating point), integer (fixed point), and alpha (character). Aggregate variables of these basic types can be defined with the vector keyword, and the pointer directive allows memory indirection to be expressed in conjunction with any of the above types.
In order to support parallelism at the level of variable
declarations, Glypnir uses the descriptor cu to define data
and variables that reside in the control unit, and pe to
designate parallel variables for the processing elements.
The format of these declarators is as follows:
/* scalar variable in control unit */
cu integer i
/* parallel array in PEs */
pe real vector v[10]
Parallel variables in Glypnir are of a fixed size and extent, resulting in parallel data layout that closely resembles the Illiac IV topology. These data structures are referred to as swords (super words) and slices, representing one word of data from each of the 64 processing elements of the Illiac IV. A group of 64 words, each from a different PE, but with the same address within that PE, is called a sword. A slice is similar to a sword, but the addresses within a PE need not be identical. Thus, the variable v above actually declares a vector of 10 swords, or, equivalently, a two-dimensional "array" (64 x 10) of integers. If the vector is indexed with a cu variable, the resulting value is a sword. If the index is instead a pe variable, the result is a slice; each PE uses its own value of the index to access an element of the vector.
An additional data type, boolean, is provided for use in conditional expressions, which are described in the following section.
In Glypnir, conditional constructs have a style similar to
those found in Algol, with the semantics extended to support
control of identical parallel instruction streams. The
format of these statements is as follows:
if <BE> then <S1> else <S2>
for all <BE> then <S1>
go to <LABEL>
The notation <BE> represents a Boolean expression consisting of one Boolean value for each of the 64 PEs in the machine model of the Illiac IV. Thus, all conditional constructs in Glypnir are treated as parallel statements, with each PE evaluating its own element of the Boolean expression. The semantics of these constructs specify that statement <S1> is to be executed in those PEs where <BE> is evaluated as being true, followed by execution of statement <S2> (when present) in PEs where <BE> is determined to be false. The semantics of the for all statement are identical to those of if-then; the statement simply provides an alternative set of mnemonics. Nested conditional statements are supported by means of a run-time execution stack that contains a properly nested sequence of the Boolean expressions encountered at each level of the conditional hierarchy. The Boolean expressions can therefore be viewed as enable patterns, and each level of nesting results in a logical AND operation of the current Boolean expression with the inherited enable pattern. While this semantic interpretation is not completely general for parallel computation, it is analogous to the semantics of conditional constructs in a sequential programming model. Glypnir also addresses the issue of go to statements inside of conditional constructs, presenting a solution that is consistent, but incomplete. Whenever a go to statement appears in the body of <S1> or <S2> of a conditional statement such as those above, a potential problem arises.
Consider the following code sequence (where x, y and max are
sword variables):
if x > y then begin
max <- x
go to LABEL1
end else begin
max <- y
go to LABEL2
end
This would appear to imply that some PEs branch to LABEL1 while others branch to LABEL2, thus violating the imposed SIMD model of execution. To resolve this apparent conflict, the Glypnir semantics require that all PEs execute the go to statement in <S1> if any PE does. Therefore, if the expression x > y is evaluated to be true in any of the 64 PEs, then all PEs will branch to LABEL1 after assigning the variable max equal to the value of x. It is not clear whether this assignment takes place in all PEs or only in those where x > y, but it is clear that statement <S2> in the else clause will not be reached in any PE. As a result, these semantics can lead to confusing (but predictable) code segments that have no analog in a sequential language. As will be shown later, these semantics are overly restrictive, and can be relaxed to allow a more general set of conditional constructs to be specified.
Another problem arises when the target of a branch is located inside of a conditional or iterative construct. For instance, in the above example LABEL1 may actually refer to code in the body of another if statement. It is difficult to infer the semantics of such an occurrence from the Glypnir specification.
Glypnir provides a number of iterative control constructs,
each extending the semantics of conditional statements to
support repeated execution of a block of code. The
following are valid iterative statements in Glypnir:
loop i <- i1, i2, i3 do <S>
thru i do <S>
for i <- i1 step i2 until i3 do <S>
do <S> until <BE>
while <BE> do <S>
Like the Fortran do loop, the Glypnir loop statement performs a comparison of the control variable (i) with a conditional limit (i3), with the statement <S> being executed repetitively until the condition i > i3 is met. Unlike Fortran, however, the conditional test is performed before the loop statement is executed. The control variable (i) must be a cu integer variable, while the initial value, increment, and limit (i1, i2, and i3, respectively) can be arithmetic expressions that yield a single value - swords and slices are disallowed. The Glypnir for statement is similar to the loop statement, except that the control variable, initial value, increment and limit may be swords. As a result, the statement <S> may be executed a different number of times in different PEs, with each PE using its own values of the control variable, initial value, increment, and limit.
The thru statement repeats execution of the statement <S> for the number of iterations specified by the cu variable (or constant) i.
The do and while statements in Glypnir allow explicit parallelism in loops by requiring that a Boolean expression be specified as the conditional test. A do loop in Glypnir will repetitively execute the statement <S> until the Boolean expression <BE> is false, and the while loop will execute <S> as long the Boolean expression is true. Thus, the statement <S> in a do loop will be executed one or more times, whereas the body of a while loop is executed zero or more times.
Assignment statements in Glypnir are relatively straightforward, since all parallel variables must be of the same extent. Arbitrary communication between PEs is supported through the use of pointers, which may either be local to a specific memory module or global to the system memory. Pointer variables may be either a single cu pointer or a sword of pe pointers. A number of allocation and deallocation statements are provided in the language to support the use of pointers.
Glypnir was an early attempt to provide high-level programming support for a particular parallel architecture. Clearly, an engineering approach was used in the design of the language, since most tradeoffs in the language favor a practical implementation over conceptual elegance in the programming model. Perhaps foremost of these tradeoffs was the choice to generate the most efficient code that compiler technology was capable of at the time, at the cost of losing topology independence. No attempt is made to differentiate between the programming model of Glypnir and the Illiac IV machine model. The lack of parallel data abstraction is a well-documented problem with the language. There are also problems with the control flow model, as described above, but the language does attempt to provide a complete, abstract control structure.
Communication between processors is highly machine- dependent, but the inclusion of pointers as a means of interprocessor communication provides a bit of foreshadowing for subsequent languages based on the C programming model. Of course, Glypnir pointer variables must be identified at the point of declaration, thus avoiding some of the ambiguous aliasing issues present in C, where the address of any object may used in expressions and assignment statements.
Glypnir provides many insights into the fundamental issues of language specification for parallel architectures. Although there are flaws in the language model, it remains a valuable starting point in the study of synchronous programming languages.
A few years after the introduction of Glypnir, Actus [Per79] was proposed as a general purpose synchronous programming language for a number of SIMD machines. The primary goal in the development of Actus was to provide an explicitly parallel programming language that was independent of both the underlying machine architecture and existing parallel algorithms. Instead, Actus allows the programmer to concisely specify the parallelism of a problem, provided that the problem can be specified within the bounds of a synchronous programming model. The language is based on Pascal, with strong typing and structured programming constructs extended to support parallelism.
In Actus, only array data structures can be declared to have
any extent of parallelism, although the parconst keyword is
included to declare parallel constants. An Actus array may
contain both serial and parallel dimensions, as illustrated
below:
const
P = 100;
S = 10;
var
(* parallel *)
pvar : array [1:P] of integer;
(* scalar *)
svar : array [1..S] of integer;
(* mixed *)
grid : array [1..S, 1:P] of real;
In this example, P denotes the maximum extent of parallelism that may be assumed by the variables pvar and grid.
Parallel arrays may be defined for any data type in the language, including structured types such as Pascal records and sets. The use of the ":" character signifies that the associated index is to be interpreted as the parallel dimension of the array. Actus allows only a single dimension of parallelism to be expressed for an array, although no such limitation is placed on the number of scalar dimensions permitted. Thus, Actus provides a means of defining all Pascal variables and types for a one- dimensional array of processing elements. The conceptual view of parallel data structures is quite similar to that presented in the Glypnir programming model. Unlike Glypnir, however, the extent of parallelism in Actus is not tied to any particular machine.
Actus introduces the notion of
index sets as a mechanism to specify the extent of
parallelism in statements and expressions. Through the use
of index sets, the extent of parallelism can be changed each
time a data structure (array) is referenced. Index sets are
declared much like ordinary variables, but must specify
constants as lower and upper bounds. Processors may be
selectively omitted from an index set by taking the union
(+), intersection (*), or difference (-) of two or more
index sets. Regular patterns may be expressed through the
use of square brackets and a constant increment. The
following example illustrates the declaration of index sets
in Actus:
index
(* odd PEs from 1 to 99 *)
odd = 1:[2]99;
(* PEs 6 to 8 and 25 to 32 *)
subset = 6:8 + 25:32;
In many respects, Actus is similar to Glypnir in its
handling of conditional constructs; both languages derive
their semantics from sequential counterparts of similar
appearance. The following are valid conditional constructs
in Actus:
if <E> then <S1>
if <E> then <S1> else <S2>
case <CS> of
<L1>: <S1>;
<L2>: <S2>;
...
<LN>: <SN>
end;
The key difference between the Actus semantics and those of
Glypnir involves the specification of the conditional
expression. In Actus it is not necessary that the
expression <E> be evaluated in all PEs. In fact, the
expression need not exhibit any parallelism at all. If the
conditional expression is scalar, however, the extent of
parallelism for any variables in the body of statements
<S1> and <S2> must be explicitly specified or
inherited from an enclosing construct of known extent. This
inheritance can take place either through a nesting
mechanism similar to that of Glypnir, or by use of a within
construct. The within construct has the form
within <EOP> do <S>
where <EOP> represents the extent of parallelism for the statement <S>. When the conditional expression is parallel, it is evaluated for each element present in the current extent.
Actus does not directly address the issue of goto statements appearing inside the body of conditional constructs, opting instead to disallow their use. This solution is certainly self-consistent, but results in some loss of generality in the programming model.
Actus provides three iterative constructs, corresponding to
the Pascal while, repeat, and for statements.
while <E> do <S>
repeat <S> until <E>
for <INDEX> := <LOWER> by <INC> to <UPPER> do <S>
As with the conditional constructs, the extent of parallelism appearing within the body of iterative statements must be explicitly specified or inherited whenever the conditional expression <E> is scalar. When the conditional expression contains parallel variables, the body of the iterative statement takes on the extent of the conditional expression. Thus, the while loop becomes more flexible than the repeat-until loop in Actus due to the fact that the conditional expression in the while loop is evaluated before execution of the statement body, allowing the extent of parallelism in the loop to vary over iterations. As in Glypnir, the extent of parallelism may never increase - once a processing element drops out of a loop it must remain idle until all other active processors have completed execution of the loop.
The semantics of the for statement differ somewhat from those of the previously described iterative constructs. In an Actus for loop, each of the parameters (index, lower bound, increment, and upper bound) may be parallel, provided that the extent of parallelism is specified or inherited. By allowing the index to be parallel, the for loop may be transformed into a number of potentially independent loops executing in parallel, each performing a different number of iterations. This interpretation is similar to the Fortran doall construct, which is included in a number of parallel implementations of Fortran.
Assignment statements in Actus must involve a single extent of parallelism that is less than or equal to the maximum extent of the variables used within the statement. The set of selected processing elements may vary on the left and right sides of the assignment, but the extent of the variables and expressions must be the same.
Thus, data alignment may take place in assignment statements
through use of the shift and rotate operators. The usage of
these operators is as follows:
<EOP> shift <INT EXPR>
<EOP> rotate <INT EXPR>
where EOP> is the extent of parallelism of the variable
or expression, and <INT EXPR> is an integer
expression. The following example illustrates the use of
data alignment in Actus assignment statements.
var
x: array [1:10] of integer;
index
part = 3:5 + 7:8;
...
x[part] := x[part shift -2] + x[part rotate 1]
In this example, element 3 receives the sum of elements 1 and element 4; element 8 receives the sum of elements 6 and 3, and so forth.
Of all the languages considered in this chapter, Actus provides the cleanest and most self-consistent specification. Unfortunately, it does so by imposing constraints that limit the effectiveness of the language for the broad class of problems that cannot be described in terms of parallel data structures with a high degree of regularity. The control flow model is fairly restrictive in Actus, since goto statements are disallowed and the semantics of conditional constructs impose a particular order of evaluation for expressions.
One of the important contributions of Actus is the use of index sets as a means of aligning data. In Actus, the extent of parallelism is always specified as part of the construct. By providing a concise method of specifying interprocessor communication without the arbitrary use of pointers, Actus is able to retain much of its expressiveness without compromising its goal of strong data typing. The notion of index sets is extremely useful in a general parallel processing model, as will be seen in later chapters.
Parallel-C [KuS85b] , as originally proposed, was a superset of the C programming language that included extensions to support both synchronous and asynchronous language constructs. In particular, Parallel-C was intended to provide an explicitly parallel high-level language for the PASM parallel processing system. The original language specification includes syntax for synchronous constructs, and a discussion of suggested approaches for asynchronous constructs.
PASM [SiS86] is a Partitionable SIMD/MIMD computer consisting of a number of control processors and processing elements. The system may be dynamically reconfigured to allow either synchronous or asynchronous execution of parallel programs.
The style of earlier languages such as Glypnir and Actus became the basis for a number of explicitly parallel synchronous languages. Languages such as Lucas [Fer82] and Parallel-C take the notion of index sets one step further, allowing arbitrary selection of parallel shape and extent through the use of selectors. In Parallel-C, the keyword selector acts as both a type specifier and a storage class by defining a "vector of bits" that corresponds to the set of processors that may participate in parallel statements and expressions. A selector variable can be defined for any shape or extent of parallelism, and becomes the primary method of selection when used in conjunction with parallel variables. As with index sets in Actus, selectors may be combined with set operators such as union, intersection, and difference to form more complex shape descriptors. Unlike Actus, Parallel-C allows any variable to be parallel, requiring only that the shape and extent of the variable be specified at declaration.
Parallel-C extends all conditional constructs of C to a
SIMD programming model. As in Actus, conditional
expressions may be either scalar or parallel, with run-time
evaluation determining the extent of parallelism. The
format of the Parallel-C conditional constructs is as
follows:
if (<E>)
<S1>;
else
<S2>;
switch (<E> {
case <C1> : <S1>;
case <C2> : <S2>;
...
default: <SN>;
}
goto <LABEL>;
In a strict SIMD environment, the semantics of the Parallel- C if-then-else construct closely resemble those of Actus. If the expression <E> is parallel, the statement <S1> is executed in those PEs where <E> is evaluated to be non-zero. All other PEs are disabled, and must wait for <S> to be completed in all enabled PEs before proceeding with execution. ( In some cases it may be preferable to execute the body of an if statement and the corresponding else statement concurrently in an asynchronous manner, provided that there are no unacceptable side effects of one statement body on the other. Parallel-C does not prohibit such an occurrence, but nor does it specify the exact semantics of asynchronous (MIMD) operation. )
Nested if-then-else statements are handled by maintaining a run-time stack of PE address masks generated through evaluation of the conditional expression. As in Actus, the set of enabled processors for nested conditional constructs must always be a subset of those enabled in the enclosing conditional statement. Unlike Actus, however, Parallel-C does not require that the extent of parallelism within the body of a conditional statement match that of the conditional expression.
The switch statement in Parallel-C is significantly more complicated than its counterpart in Actus, due to the possible (but not necessary) inclusion of break statements within a case statement. In essence, break statements act as a special instance of a goto statement, where the destination of the branch is the textual point immediately following the closing brace of the switch statement. In a strictly synchronous model of execution, each statement block must be executed separately; PE masks must be generated for each case statement before any of the case statements are actually executed in any of the processing elements. Since some case statements may not contain break statements, a logical OR operation of two or more PE masks may need to be performed to attain the proper set of enabled processors for a particular case statement. Likewise, the PE mask for the default statement, if present, is the complement of the result obtained by performing an "or" operation of all prior case statement masks.
At first glance, these semantics appear to be consistent with a synchronous model of execution. A closer look, however, reveals some subtle inconsistencies between the semantics of if-then-else constructs and those of a switch statement. The fact that processor masks are generated before execution of any case substatements means that each substatement is executed no more than once. Execution can conceptually pass through multiple case statements, each consisting of a different shape of parallelism. The flow graph associated with a particular switch statement could also be generated with a series of if-then and goto statements. However, the semantics of the latter code sequence would prevent the execution of multiple case statements, since no provision for changing the shape of parallelism within the if-then statement is supported.
The following three iterative constructs have been extended
to support explicit parallelism in Parallel-C:
while (<E>)
<S>;
do
<S>
while (<E>);
for (<E1>; <E2>; <E3>)
<S>;
The semantics of these constructs are basically the same as those of the conditional constructs, with the statement <S> being executed in those PEs where the conditional expression is evaluated to be true. As in previous languages, the set of active processors may not increase over the course of execution. The body of the do-while loop is always executed at least once in all PEs that encounter the loop, whereas the statements in the while and for loops are executed zero or more times, depending on the initial evaluation of the conditional expression.
Synchronous languages have taken quite different approaches to the specification of semantics for assignment statements. Languages such as Glypnir and Actus require that a single extent be associated with any assignment statement. Parallel-C relaxes this constraint somewhat, defining semantics for the following cases:
These semantics have several implications on the model of execution. First, the architecture must have some means of promoting a scalar value to a parallel value of arbitrary shape. This process usually involves some form of global broadcast from the control unit to the selected processing elements. Second, the machine must have a path from each PE to the control unit in order to facilitate parallel to scalar assignments. The Parallel-C semantics for this case are fairly restrictive, yet they do not adequately define this type of assignment. Consider the case where a parallel value returned from a function call is used as the selector in an assignment expression that has a scalar variable on the left side. In such a case it may not be possible to determine at compile time how many elements will be selected in the parallel value, thus the scalar result must be assumed to be undefined. Of course, an explicit check can be made by the programmer, or a warning message can be generated during the flow analysis phase of compilation, but neither of these options appears to be a desirable solution. Finally, the semantics governing assignments between parallel variables disallow implicit communication between PEs by requiring that the assignment take place only in those PEs selected on both sides of the expression. An exchange of data between PEs must take place by means of system function calls that can perform the desired permutations. This is actually quite analogous to the sequential C programming model, where no standard I/O functions are included in the language definition. The semantics also imply that a pointer may not be assigned to values outside of the memory space associated with the processing element in which it is defined. This is a subtle, but important, point.
Casting the type and shape of parallel variables is supported in the language model of Parallel-C, but the semantics associated with such an operation appear to be inconsistent with the rules governing assignments. Without specifying exactly how the reshaping takes place, however, it is impossible to determine whether data is actually moved between processors.
Parallel-C attempts to provide explicitly parallel constructs that can be used in both a synchronous and asynchronous execution environment. Most of the actual discussion of the language, however, involves only the synchronous constructs. The use of selectors in parallel expressions is an elegant means of specifying parallel processes, and can likely be incorporated into a unified programming model. Unfortunately, the control flow semantics in the language are inconsistent at times, and do not appear to be extendible to an asynchronous execution model.
The communication model used in Parallel-C requires that assignments between parallel variables of differing shapes take place only in elements selected on both sides of the statement. Other assignments can still take place, but must use library or system calls to perform the permutation. This conservative approach may not yield the most general programming model, but it does simplify the implementation of the language considerably.
C* [RoS87] was proposed and implemented as a parallel extension of C for the Connection Machine [Hil85] . Its programming model is intended to allow easy manipulation of "data-level parallelism" through use of special parallel data structures and operators. The Connection Machine consists of a large number of simple processing elements that are effectively connected in a "hypercube" configuration. Access to the machine is provided through an attached front-end host that acts as the control processor for the system. Functionally, the Connection Machine behaves as a SIMD model, although C* is intended to support both large and fine-grained parallelism.
As in other synchronous languages, C* distinguishes between
variables and data that reside in the front end, or control
unit, and those which are allocated in the processing
elements. C* uses the terms mono and poly, respectively,
but also introduces the concept of domains. A domain is a
structure type that describes the memory layout of a data
processor; its syntax closely resembles that of a class in
the C++ language [Str86] . By definition, all code that
belongs to a domain is parallel, while all other code is
serial. Poly variables are only parallel if accessed in
parallel code. In C*, only code that is activated within a
domain may be executed in parallel. This activation is
performed with a selection statement, as defined below:
[domain tag] . statement
The selection statement activates each instance of the specified domain, allowing parallel execution of the subsequent statement. When the statement is completed, all instances of the domain are deactivated.
The control flow model in C* is perhaps the most complex (and general) set of semantics yet proposed for control flow in an explicitly parallel language. All standard C conditional constructs are supported in C*, although the goto statement has not yet been implemented.
The C* programming model provides a single locus of control, referred to as the "master PC." The "master PC" is analogous to the program counter in a sequential architecture, and is consistent with the synchronous (SIMD) model of execution presented by the Connection Machine. In parallel code (e.g. code invoked within a domain), each data processor selected by the current domain is also assigned a "latent PC" that is designated as being either "active" or "waiting." An active latent PC is always at the same point in the code as the master PC, while a waiting latent PC is elsewhere in the program. The desired effect of this model is that parallel code should be executed as if each processor is executing a separate copy of the code, but with all processors remaining synchronized. To this end, C* proposes a number of rules to govern the semantics of conditional and iterative constructs, including the Rule of Merging Control and the Block Synchronization Rule.
The Rule of Merging Control asserts that whenever the master PC passes a point during execution where one or more latent PCs are waiting, all such latent PCs become active and continue execution with the master PC.
The Block Synchronization Rule is the heart of the C* control flow model. This rule states that whenever the master PC reaches the end of a statement or substatement, every active latent PC becomes inactive and must "wait" at the point immediately following the statement. At this point, execution is halted and the master PC is transferred to a new point in the code. The new point is selected by examining the just completed substatement and then searching outward through the lexical hierarchy of nested statements until a waiting latent PC is found. Control is then transferred to this point via the master PC and execution continues.
In C*, iterative constructs are handled exactly as conditional constructs, by invoking the Block Synchronization Rule. As in Parallel C, all standard C constructs supporting repetitive evaluation of a conditional construct are supported. The set of active processors participating in a loop may only decrease monotonically.
C* attempts to provide a more general set of assignment semantics, but still tends to infer too much from the execution model of the target architecture. Assignment semantics in C* can essentially be summed up in two rules: the Replication Rule and the As-If-Serial Rule.
The Replication Rule states that a scalar value is "automatically" replicated where necessary to form a parallel value. The word "automatically" again implies the presence of a broadcast mechanism in the hardware, a rather reasonable assumption shared by Parallel-C (and many other parallel languages).
The As-If-Serial rule asserts that some sequential ordering will be applied on an operator-by-operator basis for every active processor in a statement. In other words, operations on poly (parallel) variables take place as if the active processors performed the operations in some unspecified (and unpredictable) order. These semantics are critical, for example, when performing reduction operations where a number of processors are updating a single mono (scalar) variable. The As-If-Serial rule is said to facilitate parallelism by imposing sequential semantics. While this may seem intuitively true for cases where the left-hand side of an assignment is a mono variable, it less clear in the case of parallel assignments. Application of the As-If-Serial rule also brings up an interesting case of broken symmetry, where statements that would be equivalent in a sequential language lead to different results when the operands are parallel values.
The C* semantics for assignment statements differ from those presented in Parallel-C in two important ways. First, assignments from parallel to scalar variables are supported through invocation of the As-If-Serial Rule. Thus, C* allows multiple reduction and multiple broadcast operations to take place among values of arbitrary shape. The second difference involves the use of pointer indirection as a means of interprocessor communication. In C*, the statement
*p = *q;
sends a "message" containing the value stored at location q to location p. Although this appears to be a desirable feature of the language, it infers a great deal from the underlying machine model.
C* makes an ambitious attempt to specify a completely general synchronous model of computation. The language semantics are fairly self-consistent, with the possible exception of some of the control flow constructs, as described above. The primary difficulty with C* involves the inherent SIMD synchronization present in the semantics of the communication model. By specifying that operands to assignments are evaluated on each side of the statement for all instances before any assignments are made, C* prevents asynchronous execution from ever taking place. If an ordering of evaluations is desired, the programmer could simply specify appropriate code to accomplish what the As- If-Serial Rule requires.
For instance, the assignment:
x = y;
where x and y are defined as poly variables, will first evaluate x in each active processor, then evaluate y in a similar manner, and finally make the appropriate assignments.
Instead of requiring such an ordering, the same effect could
be attained by replacing the above assignment with:
tmp = y;
x = tmp;
The less restrictive semantics needed for this second case are preferable in a general parallel programming model.
Based on the above discussion of synchronous languages, a couple of observations can be made concerning the expression of parallelism through variable declarations. First, variable declarations should provide a means of "hiding" the machine topology by directly expressing the data relationships of the algorithm to be executed. This characteristic is especially important if the program is to be scalable and portable. The second observation involves the use of domains, as in C*. Although domains are a reasonably elegant way to describe the memory layout of processors, they are not (as currently specified) easily extended to an asynchronous programming model.
One of the problems inherent in all of the control flow models discussed thus far is the failure to differentiate between three basic attributes of parallel computation. In a sequential language, semantics need only specify how a construct is to be interpreted in the context of a given expression or statement. In a parallel language (explicit or otherwise), the semantics must specify not only how, but also when and where a construct can (or cannot) be applied. It is hardly a coincidence that a compiler needs exactly this information in order to perform automatic parallelization.
In most synchronous programming languages, the semantics for conditional and iterative constructs provide a reasonably consistent definition of how and when a particular construct is to be executed, but they do not always address the issue of where the construct may be applied. Consider, for example, the goto statement. Most languages either forbid the use of such a statement in parallel code (e.g. Actus), or restrict the control flow model in some asymmetric way (e.g. Glypnir).
Languages such as Parallel-C and C* attempt to define a more
general control flow model, but fail to specify self-
consistent semantics for arbitrary control flow patterns.
For instance, the C* language specification does not
restrict the use of a goto statement within parallel code,
yet it is unclear what would result of a branch from the
body of one conditional statement to that of another,
especially if the set of active processors differs in the
two code sequences. The C* code sequence of Listing 2.1
illustrates this point.
extern int random();
domain grad_student {
short salary;
double hours;
int knowledge;
int passed_qualifier;
mono char *poly adviser;
} Jim, Mike, Scott, Carla, Todd, Greg;
[domain grad_student].{
TEST: passed_qualifier = (random() & 1);
if (!passed_qualifier) {
knowledge = 0;
goto STUDY;
} else {
knowledge += 100;
}
while (knowledge < 1000) {
++knowledge;
if (knowledge > 500 && !passed_qualifier) {
goto TEST;
}
STUDY: hours += 2.0;
}
}
Listing 2.1 Arbitrary control flow in C*
Admittedly, this is a rather contrived example, but it could be executed on a sequential machine. In a parallel model of execution, however, the set of processors that execute the if substatement will likely differ from those enabled in the body of the while statement. In this case, the C* control flow model appears to support contradictory semantics. The semantics for iterative and conditional constructs state that the set of active processors for each iteration can only decrease monotonically, yet the Rule of Merging Control asserts that the latent PCs "waiting" at the STUDY label will become active as the master PC passes that point in the code. Clearly, an ambiguity results. A similar situation could result when a goto statement branches from one domain to another, although the latent PCs that exited the domain via the goto statement would presumably be "destroyed" upon completion of the code block containing the goto statement (as a result of the Block Synchronization Rule).
Although virtually all synchronous languages share the problem illustrated above, the means of specifying parallel data in such models can be abstracted to a more general model where no distinction is made between a process and the processor on which it executes. This concept will be explored in greater depth in Chapter 4.
While synchronous languages tend to express parallelism through data layout and algorithm topology, asynchronous languages focus on processes and the means of interprocess communication.
Two basic methodologies have been developed to deal with asynchronous parallelism, one based on the model presented in CSP [Hoa78] , and another that is often referred to as a single program, multiple data (SPMD) model. Examples of each language type will be considered in this chapter, including Occam, Concurrent C, Linda, and the Force.
Although the two models differ somewhat in style and implementation, both share a common means of expressing parallelism through asynchronous constructs:
Unlike synchronous constructs, most of the problems with asynchronous constructs lie in the difficulty of efficiently implementing the language model on existing parallel machines and in the fact that static analysis of program behavior is both difficult and of limited use.
Occam [Inm86] was developed as a language for programming systems constructed using multiple Inmos Transputer chips. Although the transputer chip only provides for local memory (memory directly accessible to just one processor), the transputer chip also incorporates four high-speed serial links that can be used to send information to other transputers.
Occam is based on the Communicating Sequential Processes [Hoa78] model for parallel execution, in which input and output primitives are treated as basic building blocks of structured communication between otherwise sequential processes. Nondeterminism in the CSP model is handled through the use of guarded commands that permit input and output specifications to be integrated with control flow constructs.
The primary means of specifying executable code sequences in Occam is a process. Occam contains three primitive processes, each of which can be combined with other primitives to form more complex sequential or parallel processes. The primitive processes defined in Occam are assignment, input and output.
Assignment simply involves changing the value of a variable,
much like assignment statements in sequential languages.
Occam variables are declared without type, and are
restricted in scope to the process in which they are
defined. For example, the Occam statement
a := b + c
simply sums the values of variables b and c and assigns the result to the variable a.
The input and output primitives, signified by the symbols ?
and !, respectively, are used to pass values between
processes. The Occam statements
in ? x;y
out ! x+y
will input values for the variables x and y from the channel named in, add the two values, and output the sum to the channel named out. In general, any valid Occam expression may be specified as the value to be passed to another process via an output statement. The use of the input and output primitives are discussed further in the following section.
Primitive processes can be combined through the use of constructors. The three basic Occam constructors are seq, par, and alt, corresponding to sequential, parallel, and alternative processes. In a sequential process, primitive component processes are executed in a textually specified sequence; the process terminates upon completion of the final component process. Parallel processes allow component processes to execute concurrently, with the parallel process terminating only when execution has completed in all component processes. Parallel processes are differentiated on the basis of the indentation of the code. The textual order of the processes is insignificant, since the processes proceed concurrently.
A slightly different process specifier is the alt constructor. The alt constructor specifies that one of the component processes is to be arbitrarily selected and executed. Each component of an alternative process has a "guard" associated with it as a means of selecting one of the component processes for execution. The guard consists of an input primitive along with an optional Boolean expression. An alternative process remains idle until at least one guard condition has been met. If the condition is satisfied in more than one process, one of the processes is selected at random. Execution of the selected process then takes place.
A simple Occam program is illustrated in the Occam code
segment of Listing 3.1.
chan ina, inb:
chan outa, outb:
chan intesta, intestb:
seq
prod1 := 0
prod2 := 0
par
seq
var testa:
intesta ? testa
while testa
var a:
seq
ina ? a
prod1 := prod1 * a
outa ! prod1
seq
var testb:
intestb ? testb
while testb
var b:
seq
inb ? b
prod2 := prod2 * b
outb ! prod2
Listing 3.1. Occam process specification
The two while loops in Listing 3.1 are executed in parallel, but the code within each loop is executed sequentially. The Boolean expression true is evaluated as part of the repetitive while process, and, in this case, indicates that the loop processes will never terminate. In additional to the repetitive while process, Occam provides a conditional process, signified by the if keyword. Since code within an Occam process is local to that process, the semantics of conditional and repetitive processes are quite similar to those found in sequential languages; these constructs are not discussed further here.
Occam processes can be named in a manner analogous to
subroutines in a sequential language, although the
implementation more closely resembles that of "macro"
substitutions. For example, the two parallel processes in
Listing 3.1 could be specified as follows:
where in, out, intest, and prod are formal parameters repre-
senting the channels and variable unique to each process.
The parallel invocation of the multipliers could then be
specified as:
proc multiplier (chan in,out,intest var prod) =
seq
var test:
intest ? test
while test
var x:
seq
in ? x
prod := prod * x
out ! prod
par
multiplier (ina,outa,intesta,prod1)
multiplier (inb,outb,intestb,prod2)
It is also possible to specify a group of parallel processes using a replicator to create "an array of concurrent processes" all executing the same code:
chan in[4], out[4], intest[4], prod[4]:
par i = [0 for 4]
multiplier (in[i],out[i],intest[i],prod[i])
would create four processes, one each for values of i equal to 0, 1, 2, and 3. This notation is concise, but the Occam scheme for specifying processes can lead to a great deal of inefficiency.
Consider a parallel algorithm that is to perform an operation on each element of a 64 element array, on a computer that has only 16 processors to devote to the problem. Given the Occam model, 64 processes must be created, despite the fact that only 16 will be executing at any given time. No provision is made in Occam for generating code that would divide the workload evenly among the 16 processors as just 16 processes. This lack of dynamic load balancing capability becomes extremely inefficient when the time needed to perform operations on individual elements varies widely. In fact, the effective execution rate in such cases becomes bounded in a manner that is similar to execution on a SIMD architecture.
In Occam, a process is generally associated with a processor, and connections between processes are referred to as channels. A channel connects two parallel processes such that one process can "send" a value onto a channel and a second process can "receive" that value from the channel. A process can communicate with any number of other processes, but communication is only one-way; a second channel must be established between two processes if two-way communication is to take place. Consider, for instance, the following Occam code sequence:
chan in, out:
var x:
seq
in ? x
out ! x
This sequence defines two I/O channels, IN and OUT. The program first reads a value from the channel called IN (using the input operator, "?") and stores this value into the variable called X. The value is then sent out on the channel called OUT (using the output operator, "!").
By treating communication primitives as programming objects that can be manipulated as data and variables, Occam has provided a clean, if not overly elegant, method of expressing communication between processes. The problem with communication in Occam, however, is that transactions between processes can only be expressed through a channel. The destination of a message between processes can not be directly specified, but instead must be explicitly routed through intermediate processes. This scheme not only lowers the efficiency of processes that must pass messages along from neighboring senders and receivers, but also requires that an algorithm be carefully mapped onto the machine topology. Thus, Occam programs are typically tied to a particular interconnection pattern and cannot be easily moved between machines.
Although Occam provides a concise representation of the CSP programming model, it is very closely linked with the Inmos transputer for which it was implemented. The necessity of explicitly specifying communication channels leads to code that becomes difficult to move between systems or even machine configurations. The synchronous message passing model also presents difficulties in that synchronization and communication issues are once again inseparable.
Occam process control is flawed in that the programmer must either bear the inconvenience of writing scheduler code or suffer inefficient execution due to ineffective process management.
Concurrent C [GeR86] extends the CSP model used in Occam to provide a more general (and complex) concurrent programming model. Like Occam, Concurrent C chooses a synchronous message passing scheme to facilitate parallelism, with processes proceeding independently except when communication takes place. Concurrent C rejects a shared memory model of asynchronous computation on the basis of intended applications, such as a collection of workstations connected by a local area network.
The first distributed implementation of Concurrent C targets several identical computers connected by an Ethernet. In this implementation, however, only processes on the same computer can share global data.
Concurrent C separates the definition of a process from its instantiation. A process definition consists of a type (specification) and a body (implementation). Each instantiation of a process definition is referred to as a process. Concurrent C processes are sequential in nature, and follow directly the semantics of a sequential C program. Each process effectively has its own set of resources, such as a stack, registers, and program counter. The scheduling of processes on the underlying architecture is not specified by Concurrent C, as this is considered to be an implementation issue. In the definition of a process, only the type is visible to other processes; the body of a definition cannot be externally accessed.
A process type is defined as:
process spec process-type-name (parameter declarations)
{transaction declarations};
For instance, the declaration
process spec semaphore (int count)
{
trans int inc();
trans int dec();
}
defines a process that has one parameter and two associated transactions.
A process body specifies the statements that are to be
executed whenever the process is invoked. In many respects,
a process body is similar to the body of a function in C,
except that a process cannot return a value. The format of
a process body is:
process body process-type-name (process-parameter-names)
statement
A process is invoked through use of the create operator as
follows:
create process-type-name (initial-values) [priority(p)]
The integer expression p is an optional parameter that gives
the process a priority relative to some implementation-
dependent standard. A number of built-in functions are also
included in Concurrent C to assist in the handling
priorities, but these fall beyond the scope of the current
discussion. Each created process returns a unique id that
can be stored in a process variable of the same type.
Arrays of process variables and pointers to processes are
also valid in Concurrent C. For instance, given the process
definition of semaphore above, the following variables could
be defined:
process semaphore s, sa[8], *sp;
A Concurrent C program begins with a single active process that calls the function main(); execution of the program proceeds from the body of the main function.
A process can exist in one of three states at any given time. A process is active as soon as it is created, and remains so throughout the period during which the statements of the process body are being executed. A completed process is one that has executed a return statement in the process body or has reached the end of the body statements. Finally, a process is said to be terminated when the process and all processes created by it have completed. A process can also become terminated by executing a terminate alternative as part of a select statement. (The select statement will be discussed in the following section).
As in Occam and CSP, communication in Concurrent C is
performed by a synchronous message passing protocol. The
message passing primitives collectively form a
rendezvous between two processes, where one
process acts as the client and the other acts as the server
of the message. Whereas communication in Occam is
unidirectional (a simple rendezvous), Concurrent C supports
bidirectional communication referred to as an
extended rendezvous, or a transaction. A
transaction in Concurrent C has the format:
process-value.transaction-name (actual-parameters)
where process-value specifies a specific process, and transaction-name designates a transaction that is contained in the process type of the chosen process. Arguments are passed by value as in C function calls.
A transaction is initiated, or called, by the client process, which is then "blocked" from further execution until the requested server process accepts the transaction call. When a rendezvous is established, the message is passed from the client to the server. The client remains idle while the server performs the requested task. Upon completion of the task, the server returns the results, if any, to the client process, which may then proceed with execution.
For each transaction within a process, the process type
specifies a name for the transaction, the types of the
arguments passed to the server, and type of value returned
to the client. Thus, in the semaphore process defined
above, both the inc and dec transactions return an integer
value but have no arguments. The following example
illustrates the creation of a process and a call to the
member transactions of the server process s.
int i;
process semaphore s;
s = create semaphore(8);
s.inc();
...
s.dec();
This example demonstrates a pair of transactions from the
perspective of the client. Since Concurrent C transactions
are bidirectional, it is also necessary to examine the
language mechanisms provided to service a transaction
request. The accept statement is a transaction as seen from
the side of the server, or called process. The accept
statement has the form:
accept transaction-name (parameter-names) [suchthat (e)][by (ae)]
where suchthat specifies an optional boolean expression that involves one or more of the accept statement parameters, and by specifies an arithmetic expression that denotes an ordering of the transactions to be accepted (the default is FIFO order). In a sense, the suchthat clause serves the same purpose as processor address masks in synchronous languages, since only those transaction calls that have an expression that evaluates to be true are considered for acceptance.
Concurrent C provides a select statement to handle
alternative, nondeterministic events. The format of the
select statement is:
select {
[(guard1):] alternative1
or
[(guard2):] alternative2
...
or
[(guardn):] alternativen
}
where a guard is a boolean expression and an alternative is block of Concurrent C statements. The order of evaluation of guard expressions is unspecified, and all guards are not guaranteed to be executed. Exactly one alternative is chosen and executed by a select statement. If guard expressions for multiple alternatives are evaluated as true, a series of precedence conditions are followed based on the type of the alternatives. The type of an alternative is determined by the textually first statement appearing within each block. Highest priority is given to an alternative that begins with an accept statement for which a call is pending. Next in line is an "immediate" alternative (an alternative that contains no accept, delay or terminate statements).
Thus, communication between processes is intricately interleaved with control flow in the Concurrent C model. Nondeterminism is inherent in the control flow structure, and the programmer must take care to ensure that desired program behavior results when using Concurrent C.
Concurrent C introduces a fairly complex set of new operators and program structures to support asynchronous parallel programming. As in Occam and CSP, the semantics of process specification are fairly consistent, and code within a process body usually executes as if it were on a uniprocessor. A synchronous message passing model is used for interprocess communication, but unlike Occam, this communication can be bidirectional. Concurrent C manages to avoid some of the pitfalls of Occam in regards to process connection patterns, but still falls short of the ideal of independence from machine topology. Through the use of transaction pointers and process variables, Concurrent C processes are able to communicate directly with other processes, regardless of the physical position of the processes in the system. Transactions need not traverse neighboring processes that otherwise would not be involved in the exchange of information in order to reach a destination. Where Occam tends to map only to a particular machine (transputer), Concurrent C maps to a class of machines - namely, those that do not use a shared memory model of execution. ( Actually, Concurrent C can be implemented on a shared memory system, but constructs for managing shared memory were deliberately left out of the language to ensure portability. )
The primary problem with the Concurrent C model is the inability to express synchronization and communication independently. As currently specified, the synchronous message passing model requires synchronization in all cases, including those in which synchronization may not be necessary.
A dramatic departure from the synchronous message passing model of Occam and Concurrent C is presented in Linda [GeC85] .
Linda was developed as a programming tool to simplify the specification of explicitly parallel programs. The primary goal of Linda is to uncouple the spatial and temporal aspects of process management in a parallel processing environment. The Linda system consists of a preprocessor and compiler that translate language constructs into host language code, and a host language-independent runtime kernel that handles interprocess communication. Linda constructs have been added to both C and Fortran host languages, and the sytem kernel has been implemented on AT&T Bell Labs' S/Net multicomputer, a MicroVAX network, and the Intel iPSC hypercube.
Linda, despite its sophisticated management of communication, does not stress the specification of parallel processes. The most unique feature of the Linda system is the use of a shared tuple space (TS) as the memory model. Rather than accessing conventional memory by addresses, Linda uses a logical tuple name to refer to data objects. In some respects, Linda deliberately avoids the specification of processes, opting instead to use distributed data structures as the means of expressing parallelism. Unlike many asynchronous languages, Linda allows shared data objects to be accessed by any processor. There is no notion of a "manager process" or "monitor" that accesses shared data on behalf of a requesting process. A Linda program is not partitioned into processes to be assigned to processors. Instead, Linda employs a "replicated worker" model that replicates a program sequence several times to form processes that execute simultaneously on distributed data structures. Processes in Linda are therefore uncoupled, with no aspects of a machine or algorithm topology present in the model.
Linda's process management scheme relies heavily on the ability of the underlying system to perform dynamic task scheduling, especially when the time required to complete independent tasks varies significantly. The Linda model attempts to support dynamic scheduling by providing a distributed data structure referred to as a "task bag." All processes receive work assignments from the task bag, and add to the bag any additional tasks that are created during the course of executing the current assignment. This scheme also leads to scalable programs, since increasing the number of available processors will likely increase, and never decrease, system performance.
The model of communication in Linda is far more abstract than that of Occam or Concurrent C. Through use of the shared tuple space, all processes may directly communicate with one another. Hence, rather than sending data to a destination (Concurrent C) or through a channel (Occam), a Linda program places tuples of data into the tuple space. Once a tuple has been placed in the TS, any process may examine or remove the tuple from the TS. A tuple in Linda consists of character string that acts as a logical name, as well any other specified variable fields. Access to a particular tuple is granted by matching the name tag and all other tuple fields. Hence, the tuple space is essentially an arbitrarily large, fully associative, tuple memory shared by all processes.
Parameters specified without type are assumed to be actual parameters; the keyword formal may also be used to treat a previously declared parameter as an actual value during an operation on the tuple space.
The tuple space may be accessed through the use of three primitive operations: out(), in(), and read(). The out(t) and in(t) operations cause a tuple t to be added to or withdrawn from the tuple space. The read() operation makes a copy of the requested tuple parameters, if found, without actually removing the tuple from TS. Thus, successful completion of either the read() or in() operations is dependent upon an exact match of the requested tuple parameters and those of the named tuple currently residing in the tuple space. A fourth, non-primitive operation, eval(), will allow unevaluated tuples to be added to the tuple space.
An example of communication via the tuple space is illustrated below:
/* place ("abc", 5, 6) into TS */
out("abc", 5, 6);
/* get tuple called "abc" from TS */
/* here, i is set to 6 */
in("abc", 5, int i);
There are three basic difficulties in using this model. The first is that sequential segments of a Linda program are difficult to describe statically. For instance, it is difficult to determine which process will in the tuple that another process put out to TS. Although the awkward constraint of Occam that communication take place exclusively between neighboring processes is removed, the fact that Linda provides only a single shared name space for tags makes it difficult to ensure that multiple processes will not unexpectedly generate the same tag.
The second problem is that the fully associative nature of the processing of TS tuples makes it difficult to construct an efficient implementation. While shared memory models do not necessarily imply the presence of physically shared memory in hardware, an associative tuple space requires special hardware that is not available in most systems. This limits the portability of the Linda system.
Finally, the use of a single shared tuple space requires all values to be globally visible. As a result, optimizations based on locality of reference and lexical scoping cannot be used for programs written in Linda.
Linda represents a fundamental change of direction in asynchronous programming models. Rather than support a synchronous message passing model, Linda allows data to be globally shared throughout the system. Instead of sharing memory, however, Linda systems share a tuple space that corresponds to names of data objects.
Although Linda may present some difficulties as a general parallel programming model, the language semantics are very straightforward. As a result, it may be possible to use Linda as the target of a higher level language that presents the user with a more general shared memory model. Transformations between shared variables and entries in the tuple space are made possible through use of the in() and out() operations described above. The efficiency of such translations is likely to remain extremely machine dependent, however.
Although the programming models used by "languages" such as the Force [Jor87] , BBN Uniform System [Bab88] and IBM EPEX [Bab88] are quite different from those used in CSP, Occam, Concurrent C or Linda, they still tend to support an asynchronous style of parallel programming. In particular, the programming model used in the Force, Uniform System and EPEX is often described as a SPMD model for multiprocessors that utilize a shared memory space. The Uniform System was developed primarily for the BBN Butterfly, although it has essentially been replaced with the Mach programming environment of Carnegie Mellon University. EPEX was initially designed for use on the IBM RP3 system, but has also been implemented on the IBM 3090 series. The Force has been implemented on a number of machines, including the Denelcor HEP, the Flex/32, and the Sequent Balance Series. These machines all share the characteristics that main memory is directly addressable by each processor and that memory access time is essentially independent of the memory location being referenced. As a result, communication between processes is accomplished primarily through the use of shared variables, rather than the message passing scheme used by languages discussed previously. For the purpose of this discussion, a shared variable is any variable that may be directly accessed by any process in a system, regardless of the physical attributes of the underlying hardware. Thus, shared variables could conceivably be present even in systems that do not support shared memory in the system hardware.
It is interesting to note that none of the SPMD systems described above is actually a language; each consists of a set of language constructs that are used as macros within conventional sequential programs. The macros are resolved by a preprocessor, and code is generated by a standard sequential compiler. Rather than describe the exact syntax of all three systems, this discussion will focus on one - the Force. The discussion is relevant to EPEX and the Uniform System as well, since all three systems share a similar style and structure.
The most fundamental characteristic of the Force computational model is the lack of explicit programmer control over process management. Processes are never created or destroyed over the course of execution in a Force program; processes are allocated and created at program initialization by low-level system dependent code. Unless restricted by an explicit parallel primitive, all statements in a Force program are executed by all processes in parallel. The number of processes assigned to a program is always independent of the number of actual processes available on the system. The management of both actual and virtual processes is therefore hidden from the Force programming model. Despite the lack of process control, The Force model does support the notion of separate instruction streams, thus supporting asynchronous parallelism.
A Force program begins with the declaration of a force of
processes as follows:
Force <name> of <no. of procs> ident <proc id>
<variable declarations>
[Externf <module name>]
End declarations
<Force program>
Join
where parallel environment variables are visible to the macros within the program, and runtime information regarding the number of processes and the current process identifier is available to the programmer. The Join macro signifies the end of the main program in the Force. A parallel subroutine can be invoked in a similar manner, with the macro Forcesub used in place of the Force macro above. Parameters are passed to a Force subroutine just as they would to a conventional Fortran subroutine. A Fortran RETURN statement is used to terminate a Force subroutine; all processes must execute the statement at some time, however.
Macros corresponding to control flow operations in a
sequential language actually deal with distributing the work
load in a Force program. Loop structures are based on the
DOALL construct of parallel Fortran implementations. In a
DOALL loop, each process executes the body of the loop
statement for a different value of the index variable,
provided that their are no restrictive or ambiguous data
dependencies between iterations of the loop. The loop body
is therefore treated as a sequential code block by each
process, with no particular ordering specified for execution
of the different iterations. The Force provides two
variations of the DOALL construct, as illustrated below:
Presched Do <n> <var> = <i1>,<i2>[,<i3>]
<loop body>
<n> End Presched Do
Selfsched Do <n> <var> = <i1>,<i2>[,<i3>]
<loop body>
<n> End Selfsched Do
The Presched Do macro statically assigns index values to processes on the basis of the number of index values to be iterated and the number of processes available in the force. The Selfsched Do macro, by contrast, allows processes to dynamically schedule themselves over the iterations of the loop by acquiring the next available iteration from a shared index variable. Doubly nested loops may also be parallelized in certain situations through use of the Pre2do and Self2do macros that closely resemble the above prescheduling and self-scheduling macros in both format and usage.
A parallel case statement is also provided to distribute
independent single stream code blocks across the force of
processes. The Pcase macro is declared as:
Pcase on <variable>
[Pcond(<Fortran IF condition)]
<code block>
[Usect]
[Pcond(<Fortran IF condition)]
<code block>
...
End Pcase
The conditions associated with each block of code are evaluated in an arbitrary order, each by a single process. Since each "case" is associated with only a single (arbitrarily chosen) process, conditions are generally most useful when involving shared variables.
Unlike languages based on the CSP message passing model, the
Force uses shared variables to implement interprocess
communication. In the Force environment, a variable must
either be private to a single process, or globally
accessible by all processes. The distinction between
private and shared variables is deliberately kept
independent of textual scoping constructs, such as the
Fortran common directive, to avoid confusion of parallel
attributes with scoping issues. The format of variable
declarations in the Force is:
Private <type> <variable list>
Shared <type> <variable list>
Async <type> <variable list>
where Async variables are a special class of shared variables that are to be used in certain synchronization constructs to be described later in this section. In the Fortran implementation of the Force, the type of a variable list in the above declaration formats may be preceded by the directive Common /<label>/ as in conventional Fortran programs. Shared variables may be directly referenced by name, as can Private and Async variables. Thus, addition of parallel attributes in the Force do not affect the naming conventions of the underlying language.
Since explicit process control is not visible to the programmer in the Force, there is never a need to "send" values to, or "receive" values from, other processes. Values are instead passed between processes through variables declared as being shared. One side effect of this communication model is the need for explicit synchronization primitives. Languages such as Occam and Concurrent C provide implicit synchronization in the message passing model, since processes requesting service must wait for the request to be accepted by another process. The Force computational model separates the issues of communication and synchronization to some degree, and must therefore provide a set of appropriate constructs to manage synchronization. The synchronization macros in the Force can be divided into two classes, depending on whether they deal with control flow or data synchronization.
The primary control-oriented synchronization macro is used
to implement barriers across processes. The format of this
macro is:
Barrier
<code block>
End Barrier
The semantics of a barrier in the Force require that all processes execute the Barrier macro before one (arbitrarily chosen) process proceeds to execute the block of code within the barrier definition. When this execution is complete, all processes resume execution with the statement that immediately follows the End Barrier macro. These semantics unfortunately limit the ability of a programmer to break processes into subtasks, since only a single process may execute the code within a barrier block. This requirement, of course, was necessary in order to avoid explicit specification of processes - a clear objective of the Force methodology.
A second control flow synchronization object allows critical
sections to be specified for any shared variable. The
format of a Force critical section is:
Critical <lock-variable>
<code block>
End Critical
where the specified code block may only be executed by a single process at a time in all critical sections that specify the same lock-variable.
Data synchronization is based on the notion of
producer-consumer pairs, where shared
variables contain both a value and a binary state that
indicates whether the variable is
full or empty. Data
synchronization objects must be declared as Async variables
to convey the fact that they may be accessed in an
asynchronous manner by various processes. The data
synchronization mechanisms in the Force are described as
follows:
Produce <async variable> = <expr>
Consume <async variable> into <private variable>
Copy <async variable> into <private variable>
Void <async variable>
Isfull(<async variable>)
where an invocation of any of the above processes is considered to be an atomic (indivisible) operation with respect to other processes. The Produce macro waits for the asynchronous shared variable to be set empty, sets its value to the result of the specified expression, and marks the variable as being full. The consume macro waits for the asynchronous variable to be set full, at which time its value is assigned to the specified private variable and the asynchronous variable is once again marked as empty. Copy is similar to Consume, except that the asynchronous variable is not set the empty state after the assignment is made. The Void macro sets the state of an asynchronous variable to empty, regardless of the original state of the variable. Finally, the Isfull() macro acts as a logical function call that returns the state of an asynchronous variable without blocking on either full or empty states. Care must be taken to insure that deadlock ( For the purpose of this discussion, deadlock refers to any situation in which a producer and consumer of a shared object reach a state where neither can proceed without the other process changing state. ) does not occur in code that uses producer/consumer synchronization, since the Force does specify any deadlock prevention mechanisms.
The problem with the SPMD model is primarily that shared variables impose very little structure on communication and relatively small penalties for having shared structures (especially on the machines for which these systems were developed), hence there is a tendency to write SPMD programs in which very little concern is invested in minimizing visibility of data. To communicate between two processes, one must create a shared variable that is visible to all processes -- essentially the same problem encountered with Linda's mechanism, except that references are made using a name given at compile-time rather than a runtime associative lookup. The problem is not unlike that of developing large sequential programs as independent modules using a programming language that only provides for global variables.
Of course, not all communication is for the purpose of transmitting values; there is also the problem of communicating synchronization objects. There are a wide variety of synchronization mechanisms supported, but the key issue is really the determinism of the program's execution, not the primitives used to achieve that goal.
As opposed to synchronous languages, where parallelism is expressed through data layout and algorithm topology, asynchronous languages focus on communication between processes as the means of managing parallel execution. The model of communication varies from the message passing of CSP to the shared memory SPMD model of the Force. Regardless of the model, however, it is apparent that synchronization has traditionally been closely associated with communication in asynchronous languages. Thus, the key aspect of asynchronous languages, as opposed to synchronous languages, appears to be the inclusion of explicit synchronization.
Synchronization in a programming model is not really the issue, however; it is simply a means to an end, where the "end" is actually determinism. Although the set of synchronization mechanisms in most asynchronous parallel languages is quite rich, the provision for specification of determinism is absent. Nondeterminism is inherent in constructs such as the select statement of Concurrent C and the Pcase macro of the Force, but the programmer has no general means of expressing desired or acceptable nondeterminism in an algorithm. The nondeterminism present in all asynchronous models examined thus far is based only on the selection of one of several events that occur in an unknown order. Determinism is enforced through synchronization constructs in all other cases.
For parallel execution of a program, determinism refers to the ability of a code segment to produce the same output for a given set of inputs over a large number of runs. Deterministic execution will occur if and only if the code is free of both races and deadlocks.
A race occurs when two or more processes, either executing concurrently or randomly interleaved, attempt to access the same variable and at least one of the processes is storing into (writing) that variable. A binary race can therefore be either a "read- write" race or a "write-write" race.
A deadlock occurs when two or more processes are waiting for events (e.g., changing variable values) such that none of the events will occur because each event depends on the prior completion of another such event.
Although races cannot always be detected by static flow analysis, the potential for a race can be detected statically and a warning generated.
As for detection of deadlock, consider a binary deadlock between processes Pi and Pj that occurs when Pi waits for event Ej before causing event Ei and Pj waits for event Ei before causing event Ej. Let the place where process Pi waits be called Wi, and the place where process Pj waits be called Wj. In terms of deadlock detection, there are two ways in which this deadlock can occur:
Hence, although it is not possible for a compiler to detect all deadlocks, detecting the first type combined with detecting potential races will insure that each deadlock is at least flagged with a warning message. The problem then becomes one of limiting the warnings to those potential races that actually are races and that were not intended.
An important realization concerning parallel algorithms is that some programs do not require deterministic execution. For example, a weather prediction algorithm might sacrifice deterministic execution to gain speed, since the increased speed would allow use of more up-to-date input data. Nondeterministic execution would be preferred over a deterministic model in such a case, since more accurate forecasts could be generated.
Since none of the asynchronous languages provide a construct for specifying whether or not non-deterministic execution is acceptable, it is not possible to use the above analysis to full advantage. This makes programs much more difficult to debug and maintain -- a problem notoriously associated with the asynchronous languages.
Programming models such as those provided by the Force and EPEX, however, make specification of parallel structure very easy. Using a model based on the idea of multiple control- flow threads running through the same code, these models invoke a type of structured parallel execution that is remarkably similar to that found in synchronous execution models.
Thus, the most important aspect of the SPMD models is that they allow relatively complex asynchronous parallel structure to be embodied within language constructs, exactly as is done in synchronous languages. This makes it possible to merge the fundamental characteristics of both synchronous and asynchronous execution in a single programming model that can be efficiently implemented on a number of architectures.
The wide variety of synchronous and asynchronous languages developed over the past several years has led to the evolution of two very different classes of parallel architectures. These classes -- SIMD and MIMD -- can be broken into several subclasses based on specific machine characteristics, but the fundamental difference in the two classes involves the number of possible control flow paths during execution of a program. A SIMD execution model contains only a single conceptual "program counter," regardless of the number of physical processors present in the system. A MIMD model, conversely, may have any number of "program counters" available during execution. As discussed in previous chapters, language constructs have been developed for each class of architecture, generally resulting in programming environments that are not portable, and often not scalable. The apparent incompatibility of synchronous and asynchronous constructs is manifested in two issues related to the number of instruction streams possible in the execution model of a particular architecture -- control structure and data management.
Proper specification of control structure and data management semantics leads to a model where determinism can be specified as an independent characteristic of parallel program structure.
Control flow in synchronous languages is usually expressed within a construct, much as in a sequential language. Asynchronous parallel control, however, is typically expressed through the interactions of multiple sequential "threads" of execution. Since SIMD architectures conceptually use only a single program counter, synchronous parallel programs can be represented as a single code image. Processors are independently enabled or disabled as the program counter passes through the code image, allowing SIMD processing elements to follow different control flow paths through use of time multiplexing. Asynchronous code being executed on a MIMD architecture may conceptually have any number of program counters, each corresponding to a distinct code image. The appropriateness of this model for MIMD execution was further reinforced by the fact that most early MIMD architectures did not support shared memory, requiring that a separate code image literally be sent to each processor. The advent of shared memory MIMD architectures and the associated programming model has made it possible for processes to share a single code image, since access time to memory is approximately the same for all processors. Of course, each processor could still maintain a separate code image, but system memory requirements and program load times are significantly reduced when only a single code image is maintained, provided that the code is executed by more than one processor.
Algorithms written for MIMD architectures often execute the same sequence of statements, but individual processes may take different paths through the control structure of the program. Through proper use of control flow constructs it should then be feasible to represent an arbitrarily complex set of independent asynchronous processes within a single code image.
Thus, the first step towards unification of synchronous and
asynchronous programming models is to maintain a single code
image for a parallel program, regardless of whether the
program is to be executed on a SIMD or MIMD machine.
Consider a program consisting of two processes, A and B,
that is to be run on a MIMD architecture with four
processors. Let process A be assigned to processor 0 and
process B be executed by processors 1, 2, and 3. A single
code image could be created as follows:
if (processor_is_0) {
A
} else {
B
}
Of course, this example assumes that each processor "knows" its own identity -- a reasonable assumption for most parallel architectures. This code sequence is quite similar to the if-then-else statement of most synchronous languages. In fact, the only difference between the synchronous and asynchronous versions involves the order of execution of processes A and B imposed by the different programming models.
The semantics of a synchronous language would typically specify that processors 2, 3 and 4 be disabled while process A is executed in processor 0. Upon completion of A, processor 0 would be disabled, and processors 2, 3 and 4 would execute process B. Asynchronous semantics would allow processes A and B to proceed concurrently, with the order of execution of any part of task A unspecified with respect to the execution of task B. The synchronous semantics are more restrictive than the asynchronous counterparts, due to the added constraints involving the order of execution. It is interesting to note that the less restrictive asynchronous semantics do not prohibit a synchronous implementation of the above code, since synchronization constructs could be used to effectively emulate synchronous execution.
Synchronous languages include implicit synchronization in the specification of control flow; by separating synchronization and control flow directives, the semantics of synchronous and asynchronous control flow constructs become equivalent. In other words, the above if-then-else statement could be coded by a programmer, regardless of whether the underlying hardware operates in SIMD or MIMD mode. The choice of execution model is made by the compiler, rather than the programmer.
Thus far, the discussion of a unified programming model has two important implications regarding the roles of the programmer and compiler in a parallel processing environment. First, a unified programming model relies heavily on the ability of a compiler to generate appropriate code for both synchronous and asynchronous modes of execution. The code generation phases of compilation will, by necessity, be tied very closely to the underlying hardware. The second implication involves the fact that the programmer may no longer be able to determine the exact mode of execution that will be used for code segments written on machines where both SIMD and MIMD parallelism is supported. This ambiguity may be avoided by permitting the programmer to embed "hints" in the code. These compiler directives would suggest, but not require, either a SIMD or MIMD form of implementation. In a language like C, one might loosely disguise such hints as casts. Such a feature is not necessary, however, if the language semantics allow either mode of parallelism to be adopted.
As seen in the earlier discussion of synchronous and asynchronous models, one of the fundamental means by which parallelism is expressed is the management of data and variables in a program. Vital to the unified programming model is the ability to independently support three aspects of data management: data layout, assignment statements, and synchronization. Collectively, these aspects provide the foundation for an environment where determinism can be treated as a separate characteristic of the programming model. Languages proposed thus far link synchronization and data management at a level that cannot be controlled by the programmer, effectively eliminating separate specification of determinism in a program.
As discussed previously, there is little semantic difference between the layout of variables and data in parallel programming models. Synchronous models tend to map data to the processors of a SIMD architecture, while asynchronous models associate data with processes that may or may not have direct mapping onto the actual processors of a system. In synchronous languages, variables are defined as being stored in either a control processor (scalar variables) or a set of processing elements (parallel variables). In asynchronous languages, variables are either local to a process, or globally shared among all processes. Although the implementation of data layout varies widely among various hardware configurations, there is no fundamental difference between data layout in the programming models of synchronous and asynchronous languages.
Variables declared as being parallel in a synchronous model are assigned a shape and extent of parallelism. In more robust languages, such as Parallel-C and C*, the extent of parallelism need not be related to the actual number of processors available to a program. This mapping of variables and data to virtual processors is quite analogous to the definition of local, or private, variables in an asynchronous model such as that presented by the Force. Such variables are defined for each process regardless of the number processes created. Semantics governing the definition of parallel variables in synchronous languages and private variables in asynchronous models are actually very similar; the major difference is the way in which the variables are assigned to processes or processors.
In synchronous languages, virtual processors are allocated
on the basis of data and variable declarations. For
instance, the Parallel C declaration:
parallel [16] int x;
allocates an "array" of sixteen virtual processors, each of
which contains a definition of the integer x. Compare this
declaration with the following Concurrent C code segment:
main()
{
process myproc p[16]
int i;
for (i=0; i<16; i++)
p[i] = create myproc();
}
process body myproc()
{
int x;
/* body of myproc */
}
process spec myproc()
{
/* declaration of myproc transactions */
}
In this program segment, a set of sixteen processes are created, each containing a private integer variable named x. Processes are explicitly specified in Concurrent C; local variables are then defined within each process. The Force takes this mechanism one step further by eliminating explicit specification of processes. Instead, the variable x would simply be defined as a private integer across a "force" of processes that are implicitly defined at run- time.
Hence, the two code sequences above effectively result in the same set of distributed variables. The distinction between parallel variables in a synchronous model and private variables in an asynchronous models is therefore one of syntax; the sema