UNIFICATION OF SYNCHRONOUS AND ASYNCHRONOUS MODELS FOR PARALLEL PROGRAMMING LANGUAGES

A Thesis
Submitted to the Faculty
Purdue University
by
Michael J. Phillip
In Partial Fulfillment of the
Requirements for the Degree
Master of Science in Electrical Engineering
May 1989

This thesis is dedicated to Mitzi.

ACKNOWLEDGMENTS

I would like to thank my thesis adviser, Dr. Henry Dietz, for his encouragement, expertise, and always open door.

I would also like to thank Dr. Tom Casavant and Dr. Dave Meyer for serving on my committee and providing invaluable feedback.

I thank Dr. H. J. Siegel and Wayne Nation for their input regarding the past, present, and future status of PASM.

A special thanks to Dr. James Kuehn for the many hours of guidance and constructive conversations regarding compilers and Parallel-C.

Finally, I thank Dr. Paul Schneck, Dr. John Riganati, and the research staff at the Supercomputing Research Center for providing the opportunity and environment to develop a suitable thesis topic.

ABSTRACT

Phillip, Michael J., M.S.E.E.,
Purdue University. May 1989. Unification Of Synchronous And
Asynchronous Models For Parallel Programming Languages. Major
Professor: Dr. Henry G. Dietz.

Most parallel languages can be associated with either a synchronous or asynchronous programming model. Synchronous languages tend to infer semantics from a particular execution model, while asynchronous languages are often difficult to implement efficiently on existing hardware. Each model encompasses a different form of parallelism, but neither is capable of expressing both synchronous and asynchronous parallel constructs.

In this thesis, a number of existing parallel languages are examined with respect to the precise semantics of the parallel constructs and their impact on the type of parallelism that can be expressed in the programming model. From this examination, a unified programming model is derived that allows parallel programs to be mapped efficiently onto both SIMD and MIMD architectures. Fundamental to this model is the separation of synchronization from control flow and communication issues, as well as the capability to allow nondeterminism in parallel algorithms.

As a concrete illustration of the unified model, an explicitly parallel C language, XPC, is proposed. A BNF grammar for the language is given, and the semantics of parallel constructs are informally discussed. XPC extends all communication and control constructs of the C programming language to allow concise specification and management of both synchronous and asynchronous parallelism.

Notice for HTML Users

The HTML version of this thesis differs from the official copy in several ways, including a variety of minor formatting changes. The official copy is available through the usual channels for a Purdue MS Thesis. However, this HTML version is complete and very convenient for browsing. Most WWW browsers will even allow you to print this version by automatically converting it into postscript.

This HTML document was generated using a script that replaces figures with in-line gif images that are hypertext linked to full-resolution postscript versions; the (nK .ps.Z) at the upper right of each in-line image provides both the size of the compressed postscript file and the hypertext link. There are also a variety of hypertext links inserted for navigation within the document, including a Hypertext Index at the end of this document (click here to jump to the Hypertext Index).

1. Introduction

As parallel processing systems become more widely available over the next several years, the need for adequate software development environments will become increasingly visible. Programming languages proposed or implemented thus far tend to support a particular type of parallelism and are often closely associated with a particular architecture. Issues such as scalability and portability -- fundamental in sequential languages -- are often ignored in parallel programming models. There is a need to provide the programmer with a model that encompasses many forms of parallelism, while still retaining the "feel" of a more familiar sequential language.

To this end, this thesis proposes a parallel language, based on the C programming language, that will allow programmers to explicitly specify and manage parallelism on a broad class of architectures. A foundation for such a language is built by first examining a number of explicitly parallel languages with respect to the semantics of parallel constructs and the interaction of such constructs during program execution. From this examination, aspects of a unified programming model are derived, and inconsistencies in existing language semantics are exposed. This unified model provides a unique perspective of parallel programming, and identifies language features that are desirable in a general-purpose parallel programming language.

1.1. Parallel Programming Models

The sequential model of programming has spawned a great number of languages and algorithms, all tied intricately to von Neumann style architectures. With the evolution of parallel hardware has come the introduction of languages that support the specification and management of parallelism. These languages have found moderate acceptance, but remain difficult to use due to a number of shortcomings including a lack of machine independence, a lack of scalability, and a lack of programming support, such as source-level debugging. Hence, the primary challenge of hardware and software designers alike is to provide an environment where parallel programs can be designed, implemented, and debugged by programmers who have both natural and learned tendencies to decompose problems in a sequential manner. Thus far, three approaches have been pursued in an attempt to meet this challenge:

Synchronous parallel model
Allow parallelism only within a given "step" of a solution. A program specifies a sequence of such steps, with parallelism facilitated by simultaneous execution of each step in many processors. This approach is often associated with a SIMD execution model.
Asynchronous parallel model
Represent a program as a set of sequential processes that occasionally interact, but otherwise proceed independently. A program's processes are written based on the assumption that all processes execute asynchronously. If one process needs information from another, it may be forced to wait for that information to become available, but no other processes are affected by this interaction. This model is generally used in conjunction with MIMD execution models.
Sequential model
Write an "ordinary" sequential program and have an automated system convert it into parallel code. A program is first analyzed to determine a partial ordering of its operations on data that will preserve the correctness (meaning) of the sequential program. This analysis can be performed on either "dusty decks" of existing code [KuS84] [Ell85] or on a sequential language that has been designed or modified to facilitate this analysis, such as Sisal [McS85] , Blaze [MeV85] , or Refined C [DiK84] . The operations from the partial ordering are then "packaged" into parallel- executable processes appropriate for the target machine. Although this approach may provide the eventual solution, the complex and abstract nature of the required analysis is beyond the scope of this discussion.

It is interesting to note that all three approaches map the problem of programming for parallel execution into a sequential domain. Synchronous models allow the problem to be decomposed into a series of small steps that may be executed in parallel, while asynchronous models treat the problem as a set of nearly independent sequential tasks that can proceed concurrently. Although the third technique can potentially be applied in conjunction with the first two approaches, the synchronous and asynchronous models have typically been considered to be mutually exclusive.

Clearly, this exclusion is not inherent in the problem itself, but is instead a result of applying a particular set of constraints to the formation of algorithms that map the problem onto a particular class of machines. This view has brought about an important dilemma: if a solution contains both synchronous (SIMD) and asynchronous (MIMD) parallelism, and the underlying hardware can support both models, how can the programmer specify both types in a single programming language?

Implicit in the statement of this dilemma is an assertion that many current parallel architectures can, in fact, support both types of programming models. It is ironic, then, that the emphasis in discussions of parallel programming models is most often placed on whether a synchronous or asynchronous model should be chosen. If parallelism is the only way to achieve higher effective speed, it is no longer acceptable to discard one flavor of parallelism simply because the programming language uses a different model of computation.

1.2. Parallel Execution Models

In order to better understand how the synchronous and asynchronous constructs differ, a brief description of the machine models most closely identified with each language type is presented below. A number of other hardware configurations exist in parallel architectures, but the SIMD and MIMD models provide the most natural division for the discussion of programming models to be presented in this document.

1.2.1. SIMD

As originally defined [Fly66] SIMD refers to a Single Instruction stream, Multiple Data stream model of execution. In such a model, only a single sequence of instructions is present, but each instruction may be simultaneously executed in an arbitrary set of processors, with each processor operating upon its own data. A typical block diagram for a SIMD machine is shown in Figure 1.1.

Image1 not displayable as text. ( 7K .ps.Z )

Figure 1.1. SIMD block diagram

Typically, program instructions are fetched and decoded by the control processor (CP). Operations to be performed in parallel are then broadcast to those processing elements (PEs) that have been enabled for the current instruction. Processors that are disabled must remain idle while the instruction is executed in enabled PEs. The parallelism in a SIMD model is at the lowest level of the program hierarchy, or the instruction level. Thus, SIMD machines can efficiently implement fine-grain parallelism. This efficiency is due to the "perfect" scheduling of operations that is possible in a SIMD model; no interference occurs between processing elements as the program is executed.

1.2.2. MIMD

A Multiple Instruction stream, Multiple Data stream (MIMD) machine is characterized by several sequences of instructions, or processes, that are distributed in some manner on a set of processors. These instruction streams may differ in both content and structure, and can effectively be viewed as being executed by a set of uniprocessors that are able to communicate with each other through use of special software and hardware support. A generic block diagram for a MIMD model is illustrated in Figure 1.2.

Image2 not displayable as  text. ( 9K .ps.Z )

Figure 1.2. MIMD block diagram

The means of communication between processors is generally used to conceptually divide MIMD architectures into a number of subclasses. For example, some memory models have a shared address space to facilitate communication, while others must send messages through a network. These communication techniques are also used in SIMD execution models, but are typically "hidden" from the programmer. Thus, since processes need not be executing the same instruction at the same time, and since communication uses up valuable execution time, large-grain parallelism is generally preferred in the programming model. The delays due to communication are also present in SIMD execution models, but the relative timing between the participating processes can be determined statically.

1.3. A Unified Model of Parallelism

In this document, the emphasis is not placed on a competitive evaluation of synchronous and asynchronous programming models; clearly, both encompass desirable forms of parallelism. Instead, the goal is to examine the impact of several synchronous and asynchronous language features on the ability to combine desirable aspects of SIMD and MIMD machine models. The divergence of SIMD (synchronous) and MIMD (asynchronous) programming models has by no means been accidental, but rather the result of differing design goals and implementation constraints.

A SIMD model of execution leads to hardware that is relatively straightforward to build and reasonably easy to understand. Most SIMD machines simply extend von Neumann architectures to allow simultaneous operation of multiple functional units. As a result, most synchronous languages have been built around a particular machine, providing a near optimal match between language constructs and the underlying hardware. Examples of such language/machine pairs include GLYPNIR [LaL75] / Illiac IV [BaB68] , Vector-C [Li84] / Cyber 205 [Lin82] , Parallel-C [KuS85b] / PASM [SiS86] , and C* [RoS87] / Connection Machine [Hil85] . In each of these cases, at least some aspect of the language semantics have been inferred from the execution model.

A MIMD model, by contrast, is theoretically much more powerful, but is often more difficult to implement in hardware. This is especially true if communication is supported with shared memory hardware. Asynchronous languages tend to be designed as tools for examining abstract theories of computation; hence, they often do not map well onto parallel hardware. In many respects, the SIMD model is actually a proper subset of the more general MIMD model.

1.4. Overview of This Document

As indicated above, explicitly parallel languages tend to fall into one of two classes based on the execution model of the target architecture. These classes have generally been considered incompatible, and tend to favor certain types of applications and specific modes of parallelism. This document seeks to identify common features of both language classes, with the goal of specifying a unified programming model that can be applied to a broad range of parallel algorithms and machines. The approach taken is to first examine existing languages from the perspective of the programming model, rather than a particular implementation. Although descriptions of these languages have appeared elsewhere, discussion typically revolves around the language/machine pair, thus obscuring issues related to the underlying programming model. When taken out of the context of a particular machine environment, inconsistencies in the language semantics often appear, making it difficult for the language to be implemented and used on other systems. The discussion of existing languages is not intended to be comprehensive, but does attempt to describe in some detail those constructs that affect the expression and use of parallelism. The manner in which synchronous and asynchronous language classes are compared is believed to be unique, as is the attempt to unify the two programming models.

Chapter 2 surveys four synchronous programming languages with respect to the semantics of parallel constructs that support computation on SIMD architectures. The survey not only describes the salient features of each language, but also points to inconsistencies in the model of parallelism used within individual constructs. An attempt has been made to divide constructs into several categories, based on the impact each category has on parallel execution. Thus each language discussion contains subsections covering the expression of parallelism, the use of conditional and iterative constructs, and communication between processors within the system, as well as a brief summary of the strengths and weaknesses of the model.

Chapter 3 addresses asynchronous languages, ranging from message passing models to shared memory paradigms that suppress explicit naming of processes. As in Chapter 2, the survey attempts to focus on semantics of constructs that are desirable in a unified programming model. Specifically, subsections cover the specification and management of processes and the means by which processes communicate with one another.

The synchronous and asynchronous models examined in the previous chapters are merged into a unified parallel programming model in Chapter 4. This unification takes place by extracting language features from previously described languages and providing a possible set of more general semantics for parallel computation. The key to the specification of a unified programming model is the separation of interdependent characteristics that are often inherently linked in synchronous or asynchronous models. Thus, issues such as synchronization and communication are treated separately, leading to a discussion of determinism, a fundamental aspect of parallel programming models.

Chapter 5 presents a concrete illustration of the concepts proposed in Chapter 4 by describing a new parallel language, XPC. The language is similar in style to many of the languages previously discussed, but the semantics of constructs are independent of the execution model of the target architecture. The language is also unique in that it is intended to be compatible with current compiler technology, including automatic parallelization.

Finally, Chapter 6 summarizes the work presented in this thesis, and outlines future research directions in the area of parallel language development.

2. SYNCHRONOUS LANGUAGES

2.1. Synchronous Constructs

As described in the previous chapter, most explicitly parallel languages are associated exclusively with either a synchronous (SIMD) or asynchronous (MIMD) model of execution. Constructs added to synchronous languages to support parallel execution can be divided into two classes: constructs that deal with the expression of parallelism, and those that handle assignments and communication between processors. This chapter will examine both classes of constructs for a series of four synchronous languages.

Synchronous languages allow parallelism to be expressed in the following ways:

1. variable declarations (data structures and data layout) 2. enabling/disabling of PEs (algorithm topology)

Variables are usually declared as either being distributed among all processors or local to the control processor; the keywords differ from language to language, but the meaning is the the same. For example, Glypnir uses the descriptor CU to define data and variables that reside in the control unit, and PE to designate parallel variables for the processing elements. In Parallel-C, the keywords for such declarations are scalar and parallel, respectively.

Algorithm topology refers to the manner in which a program maps onto a parallel execution model, including the control flow and data flow of an algorithm across many processors. Ideally, algorithm topology is independent of machine topology, at least within a given class of parallel architectures.

Communication between processors is deceptively straightforward in synchronous languages. Given that data in SIMD machines is classified as being either scalar or parallel, most synchronous models specify semantics governing each combination of data classifications. Some languages only allow certain types of assignments; others attempt to support arbitrary communication between all processing elements, including the control processor. Inherent in all such semantics, however, is the provision for synchronization within communication constructs. It is this linkage of potentially independent issues that prevents most synchronous languages from being implemented on other classes of architectures, and often introduces inconsistencies into the programming model.

2.2. Glypnir

Glypnir [LaL75] was developed as a programming language for the Illiac IV parallel processing system [BaB68] . At the time of its development, compiler technology could not provide the optimization necessary to allow both efficient execution and topology independence. Thus, a Glypnir program specification is closely associated with the Illiac IV hardware. The Illiac IV consists of 64 processing elements (PEs) and a single control unit (CU). Each processing element may directly access 2048 words of memory, and may indirectly access any address within the system. Interprocessor communication can take place through an interconnection network that allows PEs to directly communicate with other processors that have logical distances of +-1 or +-8.

2.2.1. Expression of Parallelism

Glypnir allows parallelism to be expressed through variable declarations and the use of conditional and iterative constructs. Most standard variable types are supported in Glypnir, including real (floating point), integer (fixed point), and alpha (character). Aggregate variables of these basic types can be defined with the vector keyword, and the pointer directive allows memory indirection to be expressed in conjunction with any of the above types.

In order to support parallelism at the level of variable declarations, Glypnir uses the descriptor cu to define data and variables that reside in the control unit, and pe to designate parallel variables for the processing elements. The format of these declarators is as follows:

/* scalar variable in control unit */
cu integer i
/* parallel array in PEs */
pe real vector v[10]

Parallel variables in Glypnir are of a fixed size and extent, resulting in parallel data layout that closely resembles the Illiac IV topology. These data structures are referred to as swords (super words) and slices, representing one word of data from each of the 64 processing elements of the Illiac IV. A group of 64 words, each from a different PE, but with the same address within that PE, is called a sword. A slice is similar to a sword, but the addresses within a PE need not be identical. Thus, the variable v above actually declares a vector of 10 swords, or, equivalently, a two-dimensional "array" (64 x 10) of integers. If the vector is indexed with a cu variable, the resulting value is a sword. If the index is instead a pe variable, the result is a slice; each PE uses its own value of the index to access an element of the vector.

An additional data type, boolean, is provided for use in conditional expressions, which are described in the following section.

2.2.2. Conditional Constructs

In Glypnir, conditional constructs have a style similar to those found in Algol, with the semantics extended to support control of identical parallel instruction streams. The format of these statements is as follows:

if <BE> then <S1> else <S2>
for all <BE> then <S1>
go to <LABEL>

The notation <BE> represents a Boolean expression consisting of one Boolean value for each of the 64 PEs in the machine model of the Illiac IV. Thus, all conditional constructs in Glypnir are treated as parallel statements, with each PE evaluating its own element of the Boolean expression. The semantics of these constructs specify that statement <S1> is to be executed in those PEs where <BE> is evaluated as being true, followed by execution of statement <S2> (when present) in PEs where <BE> is determined to be false. The semantics of the for all statement are identical to those of if-then; the statement simply provides an alternative set of mnemonics. Nested conditional statements are supported by means of a run-time execution stack that contains a properly nested sequence of the Boolean expressions encountered at each level of the conditional hierarchy. The Boolean expressions can therefore be viewed as enable patterns, and each level of nesting results in a logical AND operation of the current Boolean expression with the inherited enable pattern. While this semantic interpretation is not completely general for parallel computation, it is analogous to the semantics of conditional constructs in a sequential programming model. Glypnir also addresses the issue of go to statements inside of conditional constructs, presenting a solution that is consistent, but incomplete. Whenever a go to statement appears in the body of <S1> or <S2> of a conditional statement such as those above, a potential problem arises.

Consider the following code sequence (where x, y and max are sword variables):

if x > y then begin
    max <- x
    go to LABEL1
end else begin
    max <- y
    go to LABEL2
end

This would appear to imply that some PEs branch to LABEL1 while others branch to LABEL2, thus violating the imposed SIMD model of execution. To resolve this apparent conflict, the Glypnir semantics require that all PEs execute the go to statement in <S1> if any PE does. Therefore, if the expression x > y is evaluated to be true in any of the 64 PEs, then all PEs will branch to LABEL1 after assigning the variable max equal to the value of x. It is not clear whether this assignment takes place in all PEs or only in those where x > y, but it is clear that statement <S2> in the else clause will not be reached in any PE. As a result, these semantics can lead to confusing (but predictable) code segments that have no analog in a sequential language. As will be shown later, these semantics are overly restrictive, and can be relaxed to allow a more general set of conditional constructs to be specified.

Another problem arises when the target of a branch is located inside of a conditional or iterative construct. For instance, in the above example LABEL1 may actually refer to code in the body of another if statement. It is difficult to infer the semantics of such an occurrence from the Glypnir specification.

2.2.3. 3 Iterative Constructs

Glypnir provides a number of iterative control constructs, each extending the semantics of conditional statements to support repeated execution of a block of code. The following are valid iterative statements in Glypnir:

loop i <- i1, i2, i3 do <S>
thru i do <S>
for i <- i1 step i2 until i3 do <S>
do <S> until <BE>
while <BE> do <S>

Like the Fortran do loop, the Glypnir loop statement performs a comparison of the control variable (i) with a conditional limit (i3), with the statement <S> being executed repetitively until the condition i > i3 is met. Unlike Fortran, however, the conditional test is performed before the loop statement is executed. The control variable (i) must be a cu integer variable, while the initial value, increment, and limit (i1, i2, and i3, respectively) can be arithmetic expressions that yield a single value - swords and slices are disallowed. The Glypnir for statement is similar to the loop statement, except that the control variable, initial value, increment and limit may be swords. As a result, the statement <S> may be executed a different number of times in different PEs, with each PE using its own values of the control variable, initial value, increment, and limit.

The thru statement repeats execution of the statement <S> for the number of iterations specified by the cu variable (or constant) i.

The do and while statements in Glypnir allow explicit parallelism in loops by requiring that a Boolean expression be specified as the conditional test. A do loop in Glypnir will repetitively execute the statement <S> until the Boolean expression <BE> is false, and the while loop will execute <S> as long the Boolean expression is true. Thus, the statement <S> in a do loop will be executed one or more times, whereas the body of a while loop is executed zero or more times.

2.2.4. Assignments and Communication

Assignment statements in Glypnir are relatively straightforward, since all parallel variables must be of the same extent. Arbitrary communication between PEs is supported through the use of pointers, which may either be local to a specific memory module or global to the system memory. Pointer variables may be either a single cu pointer or a sword of pe pointers. A number of allocation and deallocation statements are provided in the language to support the use of pointers.

2.2.5. Summary

Glypnir was an early attempt to provide high-level programming support for a particular parallel architecture. Clearly, an engineering approach was used in the design of the language, since most tradeoffs in the language favor a practical implementation over conceptual elegance in the programming model. Perhaps foremost of these tradeoffs was the choice to generate the most efficient code that compiler technology was capable of at the time, at the cost of losing topology independence. No attempt is made to differentiate between the programming model of Glypnir and the Illiac IV machine model. The lack of parallel data abstraction is a well-documented problem with the language. There are also problems with the control flow model, as described above, but the language does attempt to provide a complete, abstract control structure.

Communication between processors is highly machine- dependent, but the inclusion of pointers as a means of interprocessor communication provides a bit of foreshadowing for subsequent languages based on the C programming model. Of course, Glypnir pointer variables must be identified at the point of declaration, thus avoiding some of the ambiguous aliasing issues present in C, where the address of any object may used in expressions and assignment statements.

Glypnir provides many insights into the fundamental issues of language specification for parallel architectures. Although there are flaws in the language model, it remains a valuable starting point in the study of synchronous programming languages.

2.3. Actus

A few years after the introduction of Glypnir, Actus [Per79] was proposed as a general purpose synchronous programming language for a number of SIMD machines. The primary goal in the development of Actus was to provide an explicitly parallel programming language that was independent of both the underlying machine architecture and existing parallel algorithms. Instead, Actus allows the programmer to concisely specify the parallelism of a problem, provided that the problem can be specified within the bounds of a synchronous programming model. The language is based on Pascal, with strong typing and structured programming constructs extended to support parallelism.

2.3.1. Expression of Parallelism

In Actus, only array data structures can be declared to have any extent of parallelism, although the parconst keyword is included to declare parallel constants. An Actus array may contain both serial and parallel dimensions, as illustrated below:

const
    P = 100;
    S = 10;
var
    (* parallel *)
    pvar : array [1:P] of integer;
    (* scalar *)
    svar : array [1..S] of integer;
    (* mixed *)
    grid : array [1..S, 1:P] of real;

In this example, P denotes the maximum extent of parallelism that may be assumed by the variables pvar and grid.

Parallel arrays may be defined for any data type in the language, including structured types such as Pascal records and sets. The use of the ":" character signifies that the associated index is to be interpreted as the parallel dimension of the array. Actus allows only a single dimension of parallelism to be expressed for an array, although no such limitation is placed on the number of scalar dimensions permitted. Thus, Actus provides a means of defining all Pascal variables and types for a one- dimensional array of processing elements. The conceptual view of parallel data structures is quite similar to that presented in the Glypnir programming model. Unlike Glypnir, however, the extent of parallelism in Actus is not tied to any particular machine.

Actus introduces the notion of index sets as a mechanism to specify the extent of parallelism in statements and expressions. Through the use of index sets, the extent of parallelism can be changed each time a data structure (array) is referenced. Index sets are declared much like ordinary variables, but must specify constants as lower and upper bounds. Processors may be selectively omitted from an index set by taking the union (+), intersection (*), or difference (-) of two or more index sets. Regular patterns may be expressed through the use of square brackets and a constant increment. The following example illustrates the declaration of index sets in Actus:

index
    (* odd PEs from 1 to 99 *)
    odd = 1:[2]99;
    (* PEs 6 to 8 and 25 to 32 *)
    subset = 6:8 + 25:32;

2.3.2. Conditional Constructs

In many respects, Actus is similar to Glypnir in its handling of conditional constructs; both languages derive their semantics from sequential counterparts of similar appearance. The following are valid conditional constructs in Actus:

if <E> then <S1>
if <E> then <S1> else <S2>
case <CS> of
    <L1>: <S1>;
    <L2>: <S2>;
    ...
    <LN>: <SN>
end;

The key difference between the Actus semantics and those of Glypnir involves the specification of the conditional expression. In Actus it is not necessary that the expression <E> be evaluated in all PEs. In fact, the expression need not exhibit any parallelism at all. If the conditional expression is scalar, however, the extent of parallelism for any variables in the body of statements <S1> and <S2> must be explicitly specified or inherited from an enclosing construct of known extent. This inheritance can take place either through a nesting mechanism similar to that of Glypnir, or by use of a within construct. The within construct has the form

within <EOP> do <S>

where <EOP> represents the extent of parallelism for the statement <S>. When the conditional expression is parallel, it is evaluated for each element present in the current extent.

Actus does not directly address the issue of goto statements appearing inside the body of conditional constructs, opting instead to disallow their use. This solution is certainly self-consistent, but results in some loss of generality in the programming model.

2.3.3. Iterative Constructs

Actus provides three iterative constructs, corresponding to the Pascal while, repeat, and for statements.

while <E> do <S>
repeat <S> until <E>
for <INDEX> := <LOWER> by <INC> to <UPPER> do <S>

As with the conditional constructs, the extent of parallelism appearing within the body of iterative statements must be explicitly specified or inherited whenever the conditional expression <E> is scalar. When the conditional expression contains parallel variables, the body of the iterative statement takes on the extent of the conditional expression. Thus, the while loop becomes more flexible than the repeat-until loop in Actus due to the fact that the conditional expression in the while loop is evaluated before execution of the statement body, allowing the extent of parallelism in the loop to vary over iterations. As in Glypnir, the extent of parallelism may never increase - once a processing element drops out of a loop it must remain idle until all other active processors have completed execution of the loop.

The semantics of the for statement differ somewhat from those of the previously described iterative constructs. In an Actus for loop, each of the parameters (index, lower bound, increment, and upper bound) may be parallel, provided that the extent of parallelism is specified or inherited. By allowing the index to be parallel, the for loop may be transformed into a number of potentially independent loops executing in parallel, each performing a different number of iterations. This interpretation is similar to the Fortran doall construct, which is included in a number of parallel implementations of Fortran.

2.3.4. Assignments and Communication

Assignment statements in Actus must involve a single extent of parallelism that is less than or equal to the maximum extent of the variables used within the statement. The set of selected processing elements may vary on the left and right sides of the assignment, but the extent of the variables and expressions must be the same.

Thus, data alignment may take place in assignment statements through use of the shift and rotate operators. The usage of these operators is as follows:

<EOP> shift <INT EXPR>
<EOP> rotate <INT EXPR>

where EOP> is the extent of parallelism of the variable or expression, and <INT EXPR> is an integer expression. The following example illustrates the use of data alignment in Actus assignment statements.

var
   x: array [1:10] of integer;
index
   part = 3:5 + 7:8;
   ...
x[part] := x[part shift -2] + x[part rotate 1]

In this example, element 3 receives the sum of elements 1 and element 4; element 8 receives the sum of elements 6 and 3, and so forth.

2.3.5. Summary

Of all the languages considered in this chapter, Actus provides the cleanest and most self-consistent specification. Unfortunately, it does so by imposing constraints that limit the effectiveness of the language for the broad class of problems that cannot be described in terms of parallel data structures with a high degree of regularity. The control flow model is fairly restrictive in Actus, since goto statements are disallowed and the semantics of conditional constructs impose a particular order of evaluation for expressions.

One of the important contributions of Actus is the use of index sets as a means of aligning data. In Actus, the extent of parallelism is always specified as part of the construct. By providing a concise method of specifying interprocessor communication without the arbitrary use of pointers, Actus is able to retain much of its expressiveness without compromising its goal of strong data typing. The notion of index sets is extremely useful in a general parallel processing model, as will be seen in later chapters.

2.4. Parallel-C

Parallel-C [KuS85b] , as originally proposed, was a superset of the C programming language that included extensions to support both synchronous and asynchronous language constructs. In particular, Parallel-C was intended to provide an explicitly parallel high-level language for the PASM parallel processing system. The original language specification includes syntax for synchronous constructs, and a discussion of suggested approaches for asynchronous constructs.

PASM [SiS86] is a Partitionable SIMD/MIMD computer consisting of a number of control processors and processing elements. The system may be dynamically reconfigured to allow either synchronous or asynchronous execution of parallel programs.

2.4.1. Expression of Parallelism

The style of earlier languages such as Glypnir and Actus became the basis for a number of explicitly parallel synchronous languages. Languages such as Lucas [Fer82] and Parallel-C take the notion of index sets one step further, allowing arbitrary selection of parallel shape and extent through the use of selectors. In Parallel-C, the keyword selector acts as both a type specifier and a storage class by defining a "vector of bits" that corresponds to the set of processors that may participate in parallel statements and expressions. A selector variable can be defined for any shape or extent of parallelism, and becomes the primary method of selection when used in conjunction with parallel variables. As with index sets in Actus, selectors may be combined with set operators such as union, intersection, and difference to form more complex shape descriptors. Unlike Actus, Parallel-C allows any variable to be parallel, requiring only that the shape and extent of the variable be specified at declaration.

2.4.2. Conditional Constructs

Parallel-C extends all conditional constructs of C to a SIMD programming model. As in Actus, conditional expressions may be either scalar or parallel, with run-time evaluation determining the extent of parallelism. The format of the Parallel-C conditional constructs is as follows:

if (<E>)
   <S1>;
else
   <S2>;
switch (<E> {
   case <C1> : <S1>;
   case <C2> : <S2>;
          ...
   default: <SN>;
}
goto <LABEL>;

In a strict SIMD environment, the semantics of the Parallel- C if-then-else construct closely resemble those of Actus. If the expression <E> is parallel, the statement <S1> is executed in those PEs where <E> is evaluated to be non-zero. All other PEs are disabled, and must wait for <S> to be completed in all enabled PEs before proceeding with execution. ( In some cases it may be preferable to execute the body of an if statement and the corresponding else statement concurrently in an asynchronous manner, provided that there are no unacceptable side effects of one statement body on the other. Parallel-C does not prohibit such an occurrence, but nor does it specify the exact semantics of asynchronous (MIMD) operation. )

Nested if-then-else statements are handled by maintaining a run-time stack of PE address masks generated through evaluation of the conditional expression. As in Actus, the set of enabled processors for nested conditional constructs must always be a subset of those enabled in the enclosing conditional statement. Unlike Actus, however, Parallel-C does not require that the extent of parallelism within the body of a conditional statement match that of the conditional expression.

The switch statement in Parallel-C is significantly more complicated than its counterpart in Actus, due to the possible (but not necessary) inclusion of break statements within a case statement. In essence, break statements act as a special instance of a goto statement, where the destination of the branch is the textual point immediately following the closing brace of the switch statement. In a strictly synchronous model of execution, each statement block must be executed separately; PE masks must be generated for each case statement before any of the case statements are actually executed in any of the processing elements. Since some case statements may not contain break statements, a logical OR operation of two or more PE masks may need to be performed to attain the proper set of enabled processors for a particular case statement. Likewise, the PE mask for the default statement, if present, is the complement of the result obtained by performing an "or" operation of all prior case statement masks.

At first glance, these semantics appear to be consistent with a synchronous model of execution. A closer look, however, reveals some subtle inconsistencies between the semantics of if-then-else constructs and those of a switch statement. The fact that processor masks are generated before execution of any case substatements means that each substatement is executed no more than once. Execution can conceptually pass through multiple case statements, each consisting of a different shape of parallelism. The flow graph associated with a particular switch statement could also be generated with a series of if-then and goto statements. However, the semantics of the latter code sequence would prevent the execution of multiple case statements, since no provision for changing the shape of parallelism within the if-then statement is supported.

2.4.3. Iterative Constructs

The following three iterative constructs have been extended to support explicit parallelism in Parallel-C:

while (<E>)
   <S>;
do
   <S>
while (<E>);
for (<E1>; <E2>; <E3>)
   <S>;

The semantics of these constructs are basically the same as those of the conditional constructs, with the statement <S> being executed in those PEs where the conditional expression is evaluated to be true. As in previous languages, the set of active processors may not increase over the course of execution. The body of the do-while loop is always executed at least once in all PEs that encounter the loop, whereas the statements in the while and for loops are executed zero or more times, depending on the initial evaluation of the conditional expression.

2.4.4. Assignments and Communication

Synchronous languages have taken quite different approaches to the specification of semantics for assignment statements. Languages such as Glypnir and Actus require that a single extent be associated with any assignment statement. Parallel-C relaxes this constraint somewhat, defining semantics for the following cases:

  1. Left side scalar, right side scalar. (S=S) This is equivalent to a sequential C assignment statement.
  2. Left side scalar, right side parallel. (S=P) The right side of the assignment should select only one element to be assigned to the scalar variable on the left side. If more than one element is selected, the result of the assignment is undefined.
  3. Left side parallel, right side scalar. (P=S) The scalar value on the right side of the assignment is promoted to a parallel value of the shape and extent of the variable on the left side of the statement before the assignment is made.
  4. Left side parallel, right side parallel. (P=P) The assignment takes place in those PEs selected on both sides of the expression, provided that the range of selected elements on the left side is a subset of those selected on the right side. If this condition is not met, the results are undefined.

These semantics have several implications on the model of execution. First, the architecture must have some means of promoting a scalar value to a parallel value of arbitrary shape. This process usually involves some form of global broadcast from the control unit to the selected processing elements. Second, the machine must have a path from each PE to the control unit in order to facilitate parallel to scalar assignments. The Parallel-C semantics for this case are fairly restrictive, yet they do not adequately define this type of assignment. Consider the case where a parallel value returned from a function call is used as the selector in an assignment expression that has a scalar variable on the left side. In such a case it may not be possible to determine at compile time how many elements will be selected in the parallel value, thus the scalar result must be assumed to be undefined. Of course, an explicit check can be made by the programmer, or a warning message can be generated during the flow analysis phase of compilation, but neither of these options appears to be a desirable solution. Finally, the semantics governing assignments between parallel variables disallow implicit communication between PEs by requiring that the assignment take place only in those PEs selected on both sides of the expression. An exchange of data between PEs must take place by means of system function calls that can perform the desired permutations. This is actually quite analogous to the sequential C programming model, where no standard I/O functions are included in the language definition. The semantics also imply that a pointer may not be assigned to values outside of the memory space associated with the processing element in which it is defined. This is a subtle, but important, point.

Casting the type and shape of parallel variables is supported in the language model of Parallel-C, but the semantics associated with such an operation appear to be inconsistent with the rules governing assignments. Without specifying exactly how the reshaping takes place, however, it is impossible to determine whether data is actually moved between processors.

2.4.5. Summary

Parallel-C attempts to provide explicitly parallel constructs that can be used in both a synchronous and asynchronous execution environment. Most of the actual discussion of the language, however, involves only the synchronous constructs. The use of selectors in parallel expressions is an elegant means of specifying parallel processes, and can likely be incorporated into a unified programming model. Unfortunately, the control flow semantics in the language are inconsistent at times, and do not appear to be extendible to an asynchronous execution model.

The communication model used in Parallel-C requires that assignments between parallel variables of differing shapes take place only in elements selected on both sides of the statement. Other assignments can still take place, but must use library or system calls to perform the permutation. This conservative approach may not yield the most general programming model, but it does simplify the implementation of the language considerably.

2.5. C*

C* [RoS87] was proposed and implemented as a parallel extension of C for the Connection Machine [Hil85] . Its programming model is intended to allow easy manipulation of "data-level parallelism" through use of special parallel data structures and operators. The Connection Machine consists of a large number of simple processing elements that are effectively connected in a "hypercube" configuration. Access to the machine is provided through an attached front-end host that acts as the control processor for the system. Functionally, the Connection Machine behaves as a SIMD model, although C* is intended to support both large and fine-grained parallelism.

2.5.1. Expression of Parallelism

As in other synchronous languages, C* distinguishes between variables and data that reside in the front end, or control unit, and those which are allocated in the processing elements. C* uses the terms mono and poly, respectively, but also introduces the concept of domains. A domain is a structure type that describes the memory layout of a data processor; its syntax closely resembles that of a class in the C++ language [Str86] . By definition, all code that belongs to a domain is parallel, while all other code is serial. Poly variables are only parallel if accessed in parallel code. In C*, only code that is activated within a domain may be executed in parallel. This activation is performed with a selection statement, as defined below:

[domain tag] . statement

The selection statement activates each instance of the specified domain, allowing parallel execution of the subsequent statement. When the statement is completed, all instances of the domain are deactivated.

2.5.2. Conditional Constructs

The control flow model in C* is perhaps the most complex (and general) set of semantics yet proposed for control flow in an explicitly parallel language. All standard C conditional constructs are supported in C*, although the goto statement has not yet been implemented.

The C* programming model provides a single locus of control, referred to as the "master PC." The "master PC" is analogous to the program counter in a sequential architecture, and is consistent with the synchronous (SIMD) model of execution presented by the Connection Machine. In parallel code (e.g. code invoked within a domain), each data processor selected by the current domain is also assigned a "latent PC" that is designated as being either "active" or "waiting." An active latent PC is always at the same point in the code as the master PC, while a waiting latent PC is elsewhere in the program. The desired effect of this model is that parallel code should be executed as if each processor is executing a separate copy of the code, but with all processors remaining synchronized. To this end, C* proposes a number of rules to govern the semantics of conditional and iterative constructs, including the Rule of Merging Control and the Block Synchronization Rule.

The Rule of Merging Control asserts that whenever the master PC passes a point during execution where one or more latent PCs are waiting, all such latent PCs become active and continue execution with the master PC.

The Block Synchronization Rule is the heart of the C* control flow model. This rule states that whenever the master PC reaches the end of a statement or substatement, every active latent PC becomes inactive and must "wait" at the point immediately following the statement. At this point, execution is halted and the master PC is transferred to a new point in the code. The new point is selected by examining the just completed substatement and then searching outward through the lexical hierarchy of nested statements until a waiting latent PC is found. Control is then transferred to this point via the master PC and execution continues.

2.5.3. Iterative Constructs

In C*, iterative constructs are handled exactly as conditional constructs, by invoking the Block Synchronization Rule. As in Parallel C, all standard C constructs supporting repetitive evaluation of a conditional construct are supported. The set of active processors participating in a loop may only decrease monotonically.

2.5.4. Assignments and Communication

C* attempts to provide a more general set of assignment semantics, but still tends to infer too much from the execution model of the target architecture. Assignment semantics in C* can essentially be summed up in two rules: the Replication Rule and the As-If-Serial Rule.

The Replication Rule states that a scalar value is "automatically" replicated where necessary to form a parallel value. The word "automatically" again implies the presence of a broadcast mechanism in the hardware, a rather reasonable assumption shared by Parallel-C (and many other parallel languages).

The As-If-Serial rule asserts that some sequential ordering will be applied on an operator-by-operator basis for every active processor in a statement. In other words, operations on poly (parallel) variables take place as if the active processors performed the operations in some unspecified (and unpredictable) order. These semantics are critical, for example, when performing reduction operations where a number of processors are updating a single mono (scalar) variable. The As-If-Serial rule is said to facilitate parallelism by imposing sequential semantics. While this may seem intuitively true for cases where the left-hand side of an assignment is a mono variable, it less clear in the case of parallel assignments. Application of the As-If-Serial rule also brings up an interesting case of broken symmetry, where statements that would be equivalent in a sequential language lead to different results when the operands are parallel values.

The C* semantics for assignment statements differ from those presented in Parallel-C in two important ways. First, assignments from parallel to scalar variables are supported through invocation of the As-If-Serial Rule. Thus, C* allows multiple reduction and multiple broadcast operations to take place among values of arbitrary shape. The second difference involves the use of pointer indirection as a means of interprocessor communication. In C*, the statement

*p = *q;

sends a "message" containing the value stored at location q to location p. Although this appears to be a desirable feature of the language, it infers a great deal from the underlying machine model.

2.5.5. Summary

C* makes an ambitious attempt to specify a completely general synchronous model of computation. The language semantics are fairly self-consistent, with the possible exception of some of the control flow constructs, as described above. The primary difficulty with C* involves the inherent SIMD synchronization present in the semantics of the communication model. By specifying that operands to assignments are evaluated on each side of the statement for all instances before any assignments are made, C* prevents asynchronous execution from ever taking place. If an ordering of evaluations is desired, the programmer could simply specify appropriate code to accomplish what the As- If-Serial Rule requires.

For instance, the assignment:

x = y;

where x and y are defined as poly variables, will first evaluate x in each active processor, then evaluate y in a similar manner, and finally make the appropriate assignments.

Instead of requiring such an ordering, the same effect could be attained by replacing the above assignment with:

tmp = y;
x = tmp;

The less restrictive semantics needed for this second case are preferable in a general parallel programming model.

2.6. Conclusions

Based on the above discussion of synchronous languages, a couple of observations can be made concerning the expression of parallelism through variable declarations. First, variable declarations should provide a means of "hiding" the machine topology by directly expressing the data relationships of the algorithm to be executed. This characteristic is especially important if the program is to be scalable and portable. The second observation involves the use of domains, as in C*. Although domains are a reasonably elegant way to describe the memory layout of processors, they are not (as currently specified) easily extended to an asynchronous programming model.

One of the problems inherent in all of the control flow models discussed thus far is the failure to differentiate between three basic attributes of parallel computation. In a sequential language, semantics need only specify how a construct is to be interpreted in the context of a given expression or statement. In a parallel language (explicit or otherwise), the semantics must specify not only how, but also when and where a construct can (or cannot) be applied. It is hardly a coincidence that a compiler needs exactly this information in order to perform automatic parallelization.

In most synchronous programming languages, the semantics for conditional and iterative constructs provide a reasonably consistent definition of how and when a particular construct is to be executed, but they do not always address the issue of where the construct may be applied. Consider, for example, the goto statement. Most languages either forbid the use of such a statement in parallel code (e.g. Actus), or restrict the control flow model in some asymmetric way (e.g. Glypnir).

Languages such as Parallel-C and C* attempt to define a more general control flow model, but fail to specify self- consistent semantics for arbitrary control flow patterns. For instance, the C* language specification does not restrict the use of a goto statement within parallel code, yet it is unclear what would result of a branch from the body of one conditional statement to that of another, especially if the set of active processors differs in the two code sequences. The C* code sequence of Listing 2.1 illustrates this point.

extern int random();
domain grad_student {
    short salary;
    double hours;
    int knowledge;
    int passed_qualifier;
    mono char *poly adviser;
} Jim, Mike, Scott, Carla, Todd, Greg;
    [domain grad_student].{
TEST:   passed_qualifier = (random() & 1);
        if (!passed_qualifier) {
            knowledge = 0;
            goto STUDY;
        } else {
            knowledge += 100;
        }
        while (knowledge < 1000) {
            ++knowledge;
            if (knowledge > 500 && !passed_qualifier) {
                goto TEST;
            }
STUDY:      hours += 2.0;
        }
    }
         Listing 2.1  Arbitrary control flow in C*

Admittedly, this is a rather contrived example, but it could be executed on a sequential machine. In a parallel model of execution, however, the set of processors that execute the if substatement will likely differ from those enabled in the body of the while statement. In this case, the C* control flow model appears to support contradictory semantics. The semantics for iterative and conditional constructs state that the set of active processors for each iteration can only decrease monotonically, yet the Rule of Merging Control asserts that the latent PCs "waiting" at the STUDY label will become active as the master PC passes that point in the code. Clearly, an ambiguity results. A similar situation could result when a goto statement branches from one domain to another, although the latent PCs that exited the domain via the goto statement would presumably be "destroyed" upon completion of the code block containing the goto statement (as a result of the Block Synchronization Rule).

Although virtually all synchronous languages share the problem illustrated above, the means of specifying parallel data in such models can be abstracted to a more general model where no distinction is made between a process and the processor on which it executes. This concept will be explored in greater depth in Chapter 4.

3. ASYNCHRONOUS LANGUAGES

3.1. Asynchronous Constructs

While synchronous languages tend to express parallelism through data layout and algorithm topology, asynchronous languages focus on processes and the means of interprocess communication.

Two basic methodologies have been developed to deal with asynchronous parallelism, one based on the model presented in CSP [Hoa78] , and another that is often referred to as a single program, multiple data (SPMD) model. Examples of each language type will be considered in this chapter, including Occam, Concurrent C, Linda, and the Force.

Although the two models differ somewhat in style and implementation, both share a common means of expressing parallelism through asynchronous constructs:

  1. specification, creation, and termination of sequential processes
  2. connections to, and communication with, other sequential processes

Unlike synchronous constructs, most of the problems with asynchronous constructs lie in the difficulty of efficiently implementing the language model on existing parallel machines and in the fact that static analysis of program behavior is both difficult and of limited use.

3.2. Occam

Occam [Inm86] was developed as a language for programming systems constructed using multiple Inmos Transputer chips. Although the transputer chip only provides for local memory (memory directly accessible to just one processor), the transputer chip also incorporates four high-speed serial links that can be used to send information to other transputers.

Occam is based on the Communicating Sequential Processes [Hoa78] model for parallel execution, in which input and output primitives are treated as basic building blocks of structured communication between otherwise sequential processes. Nondeterminism in the CSP model is handled through the use of guarded commands that permit input and output specifications to be integrated with control flow constructs.

3.2.1. Specification of Processes

The primary means of specifying executable code sequences in Occam is a process. Occam contains three primitive processes, each of which can be combined with other primitives to form more complex sequential or parallel processes. The primitive processes defined in Occam are assignment, input and output.

Assignment simply involves changing the value of a variable, much like assignment statements in sequential languages. Occam variables are declared without type, and are restricted in scope to the process in which they are defined. For example, the Occam statement

a := b + c

simply sums the values of variables b and c and assigns the result to the variable a.

The input and output primitives, signified by the symbols ? and !, respectively, are used to pass values between processes. The Occam statements

in ? x;y
out ! x+y

will input values for the variables x and y from the channel named in, add the two values, and output the sum to the channel named out. In general, any valid Occam expression may be specified as the value to be passed to another process via an output statement. The use of the input and output primitives are discussed further in the following section.

Primitive processes can be combined through the use of constructors. The three basic Occam constructors are seq, par, and alt, corresponding to sequential, parallel, and alternative processes. In a sequential process, primitive component processes are executed in a textually specified sequence; the process terminates upon completion of the final component process. Parallel processes allow component processes to execute concurrently, with the parallel process terminating only when execution has completed in all component processes. Parallel processes are differentiated on the basis of the indentation of the code. The textual order of the processes is insignificant, since the processes proceed concurrently.

A slightly different process specifier is the alt constructor. The alt constructor specifies that one of the component processes is to be arbitrarily selected and executed. Each component of an alternative process has a "guard" associated with it as a means of selecting one of the component processes for execution. The guard consists of an input primitive along with an optional Boolean expression. An alternative process remains idle until at least one guard condition has been met. If the condition is satisfied in more than one process, one of the processes is selected at random. Execution of the selected process then takes place.

A simple Occam program is illustrated in the Occam code segment of Listing 3.1.

chan ina, inb:
chan outa, outb:
chan intesta, intestb:
seq
  prod1 := 0
  prod2 := 0
  par
    seq
      var testa:
      intesta ? testa
      while testa
        var a:
        seq
          ina ? a
          prod1 := prod1 * a
          outa ! prod1
    seq
      var testb:
      intestb ? testb
      while testb
        var b:
        seq
          inb ? b
          prod2 := prod2 * b
          outb ! prod2
         Listing 3.1.  Occam process specification

The two while loops in Listing 3.1 are executed in parallel, but the code within each loop is executed sequentially. The Boolean expression true is evaluated as part of the repetitive while process, and, in this case, indicates that the loop processes will never terminate. In additional to the repetitive while process, Occam provides a conditional process, signified by the if keyword. Since code within an Occam process is local to that process, the semantics of conditional and repetitive processes are quite similar to those found in sequential languages; these constructs are not discussed further here.

Occam processes can be named in a manner analogous to subroutines in a sequential language, although the implementation more closely resembles that of "macro" substitutions. For example, the two parallel processes in Listing 3.1 could be specified as follows:

proc multiplier (chan in,out,intest var prod) =
  seq
    var test:
    intest ? test
    while test
      var x:
      seq
        in ? x
        prod := prod * x
        out ! prod
where in, out, intest, and prod are formal parameters repre- senting the channels and variable unique to each process. The parallel invocation of the multipliers could then be specified as:
par
  multiplier (ina,outa,intesta,prod1)
  multiplier (inb,outb,intestb,prod2)

It is also possible to specify a group of parallel processes using a replicator to create "an array of concurrent processes" all executing the same code:

chan in[4], out[4], intest[4], prod[4]:
par i = [0 for 4]
   multiplier (in[i],out[i],intest[i],prod[i])

would create four processes, one each for values of i equal to 0, 1, 2, and 3. This notation is concise, but the Occam scheme for specifying processes can lead to a great deal of inefficiency.

Consider a parallel algorithm that is to perform an operation on each element of a 64 element array, on a computer that has only 16 processors to devote to the problem. Given the Occam model, 64 processes must be created, despite the fact that only 16 will be executing at any given time. No provision is made in Occam for generating code that would divide the workload evenly among the 16 processors as just 16 processes. This lack of dynamic load balancing capability becomes extremely inefficient when the time needed to perform operations on individual elements varies widely. In fact, the effective execution rate in such cases becomes bounded in a manner that is similar to execution on a SIMD architecture.

3.2.2. Communication Between Processes

In Occam, a process is generally associated with a processor, and connections between processes are referred to as channels. A channel connects two parallel processes such that one process can "send" a value onto a channel and a second process can "receive" that value from the channel. A process can communicate with any number of other processes, but communication is only one-way; a second channel must be established between two processes if two-way communication is to take place. Consider, for instance, the following Occam code sequence:

chan in, out:
var x:
seq
    in ? x
    out ! x

This sequence defines two I/O channels, IN and OUT. The program first reads a value from the channel called IN (using the input operator, "?") and stores this value into the variable called X. The value is then sent out on the channel called OUT (using the output operator, "!").

By treating communication primitives as programming objects that can be manipulated as data and variables, Occam has provided a clean, if not overly elegant, method of expressing communication between processes. The problem with communication in Occam, however, is that transactions between processes can only be expressed through a channel. The destination of a message between processes can not be directly specified, but instead must be explicitly routed through intermediate processes. This scheme not only lowers the efficiency of processes that must pass messages along from neighboring senders and receivers, but also requires that an algorithm be carefully mapped onto the machine topology. Thus, Occam programs are typically tied to a particular interconnection pattern and cannot be easily moved between machines.

3.2.3. Summary

Although Occam provides a concise representation of the CSP programming model, it is very closely linked with the Inmos transputer for which it was implemented. The necessity of explicitly specifying communication channels leads to code that becomes difficult to move between systems or even machine configurations. The synchronous message passing model also presents difficulties in that synchronization and communication issues are once again inseparable.

Occam process control is flawed in that the programmer must either bear the inconvenience of writing scheduler code or suffer inefficient execution due to ineffective process management.

3.3. Concurrent C

Concurrent C [GeR86] extends the CSP model used in Occam to provide a more general (and complex) concurrent programming model. Like Occam, Concurrent C chooses a synchronous message passing scheme to facilitate parallelism, with processes proceeding independently except when communication takes place. Concurrent C rejects a shared memory model of asynchronous computation on the basis of intended applications, such as a collection of workstations connected by a local area network.

The first distributed implementation of Concurrent C targets several identical computers connected by an Ethernet. In this implementation, however, only processes on the same computer can share global data.

3.3.1. Specification of Processes

Concurrent C separates the definition of a process from its instantiation. A process definition consists of a type (specification) and a body (implementation). Each instantiation of a process definition is referred to as a process. Concurrent C processes are sequential in nature, and follow directly the semantics of a sequential C program. Each process effectively has its own set of resources, such as a stack, registers, and program counter. The scheduling of processes on the underlying architecture is not specified by Concurrent C, as this is considered to be an implementation issue. In the definition of a process, only the type is visible to other processes; the body of a definition cannot be externally accessed.

A process type is defined as:

process spec process-type-name (parameter declarations)
   {transaction declarations};

For instance, the declaration

process spec semaphore (int count)
{
  trans int inc();
  trans int dec();
}

defines a process that has one parameter and two associated transactions.

A process body specifies the statements that are to be executed whenever the process is invoked. In many respects, a process body is similar to the body of a function in C, except that a process cannot return a value. The format of a process body is:

process body process-type-name (process-parameter-names)
  statement

A process is invoked through use of the create operator as follows:

create process-type-name (initial-values) [priority(p)]

The integer expression p is an optional parameter that gives the process a priority relative to some implementation- dependent standard. A number of built-in functions are also included in Concurrent C to assist in the handling priorities, but these fall beyond the scope of the current discussion. Each created process returns a unique id that can be stored in a process variable of the same type. Arrays of process variables and pointers to processes are also valid in Concurrent C. For instance, given the process definition of semaphore above, the following variables could be defined:

process semaphore s, sa[8], *sp;

A Concurrent C program begins with a single active process that calls the function main(); execution of the program proceeds from the body of the main function.

A process can exist in one of three states at any given time. A process is active as soon as it is created, and remains so throughout the period during which the statements of the process body are being executed. A completed process is one that has executed a return statement in the process body or has reached the end of the body statements. Finally, a process is said to be terminated when the process and all processes created by it have completed. A process can also become terminated by executing a terminate alternative as part of a select statement. (The select statement will be discussed in the following section).

3.3.2. Communication Between Processes

As in Occam and CSP, communication in Concurrent C is performed by a synchronous message passing protocol. The message passing primitives collectively form a rendezvous between two processes, where one process acts as the client and the other acts as the server of the message. Whereas communication in Occam is unidirectional (a simple rendezvous), Concurrent C supports bidirectional communication referred to as an extended rendezvous, or a transaction. A transaction in Concurrent C has the format:

process-value.transaction-name (actual-parameters)

where process-value specifies a specific process, and transaction-name designates a transaction that is contained in the process type of the chosen process. Arguments are passed by value as in C function calls.

A transaction is initiated, or called, by the client process, which is then "blocked" from further execution until the requested server process accepts the transaction call. When a rendezvous is established, the message is passed from the client to the server. The client remains idle while the server performs the requested task. Upon completion of the task, the server returns the results, if any, to the client process, which may then proceed with execution.

For each transaction within a process, the process type specifies a name for the transaction, the types of the arguments passed to the server, and type of value returned to the client. Thus, in the semaphore process defined above, both the inc and dec transactions return an integer value but have no arguments. The following example illustrates the creation of a process and a call to the member transactions of the server process s.

int i;
process semaphore s;
s = create semaphore(8);
s.inc();
...
s.dec();

This example demonstrates a pair of transactions from the perspective of the client. Since Concurrent C transactions are bidirectional, it is also necessary to examine the language mechanisms provided to service a transaction request. The accept statement is a transaction as seen from the side of the server, or called process. The accept statement has the form:

accept transaction-name (parameter-names) [suchthat (e)][by (ae)]

where suchthat specifies an optional boolean expression that involves one or more of the accept statement parameters, and by specifies an arithmetic expression that denotes an ordering of the transactions to be accepted (the default is FIFO order). In a sense, the suchthat clause serves the same purpose as processor address masks in synchronous languages, since only those transaction calls that have an expression that evaluates to be true are considered for acceptance.

Concurrent C provides a select statement to handle alternative, nondeterministic events. The format of the select statement is:

select {
   [(guard1):] alternative1
or
   [(guard2):] alternative2
...
or
   [(guardn):] alternativen
}

where a guard is a boolean expression and an alternative is block of Concurrent C statements. The order of evaluation of guard expressions is unspecified, and all guards are not guaranteed to be executed. Exactly one alternative is chosen and executed by a select statement. If guard expressions for multiple alternatives are evaluated as true, a series of precedence conditions are followed based on the type of the alternatives. The type of an alternative is determined by the textually first statement appearing within each block. Highest priority is given to an alternative that begins with an accept statement for which a call is pending. Next in line is an "immediate" alternative (an alternative that contains no accept, delay or terminate statements).

Thus, communication between processes is intricately interleaved with control flow in the Concurrent C model. Nondeterminism is inherent in the control flow structure, and the programmer must take care to ensure that desired program behavior results when using Concurrent C.

3.3.3. Summary

Concurrent C introduces a fairly complex set of new operators and program structures to support asynchronous parallel programming. As in Occam and CSP, the semantics of process specification are fairly consistent, and code within a process body usually executes as if it were on a uniprocessor. A synchronous message passing model is used for interprocess communication, but unlike Occam, this communication can be bidirectional. Concurrent C manages to avoid some of the pitfalls of Occam in regards to process connection patterns, but still falls short of the ideal of independence from machine topology. Through the use of transaction pointers and process variables, Concurrent C processes are able to communicate directly with other processes, regardless of the physical position of the processes in the system. Transactions need not traverse neighboring processes that otherwise would not be involved in the exchange of information in order to reach a destination. Where Occam tends to map only to a particular machine (transputer), Concurrent C maps to a class of machines - namely, those that do not use a shared memory model of execution. ( Actually, Concurrent C can be implemented on a shared memory system, but constructs for managing shared memory were deliberately left out of the language to ensure portability. )

The primary problem with the Concurrent C model is the inability to express synchronization and communication independently. As currently specified, the synchronous message passing model requires synchronization in all cases, including those in which synchronization may not be necessary.

3.4. Linda

A dramatic departure from the synchronous message passing model of Occam and Concurrent C is presented in Linda [GeC85] .

Linda was developed as a programming tool to simplify the specification of explicitly parallel programs. The primary goal of Linda is to uncouple the spatial and temporal aspects of process management in a parallel processing environment. The Linda system consists of a preprocessor and compiler that translate language constructs into host language code, and a host language-independent runtime kernel that handles interprocess communication. Linda constructs have been added to both C and Fortran host languages, and the sytem kernel has been implemented on AT&T Bell Labs' S/Net multicomputer, a MicroVAX network, and the Intel iPSC hypercube.

3.4.1. Specification of Processes

Linda, despite its sophisticated management of communication, does not stress the specification of parallel processes. The most unique feature of the Linda system is the use of a shared tuple space (TS) as the memory model. Rather than accessing conventional memory by addresses, Linda uses a logical tuple name to refer to data objects. In some respects, Linda deliberately avoids the specification of processes, opting instead to use distributed data structures as the means of expressing parallelism. Unlike many asynchronous languages, Linda allows shared data objects to be accessed by any processor. There is no notion of a "manager process" or "monitor" that accesses shared data on behalf of a requesting process. A Linda program is not partitioned into processes to be assigned to processors. Instead, Linda employs a "replicated worker" model that replicates a program sequence several times to form processes that execute simultaneously on distributed data structures. Processes in Linda are therefore uncoupled, with no aspects of a machine or algorithm topology present in the model.

Linda's process management scheme relies heavily on the ability of the underlying system to perform dynamic task scheduling, especially when the time required to complete independent tasks varies significantly. The Linda model attempts to support dynamic scheduling by providing a distributed data structure referred to as a "task bag." All processes receive work assignments from the task bag, and add to the bag any additional tasks that are created during the course of executing the current assignment. This scheme also leads to scalable programs, since increasing the number of available processors will likely increase, and never decrease, system performance.

3.4.2. Communication Between Processes

The model of communication in Linda is far more abstract than that of Occam or Concurrent C. Through use of the shared tuple space, all processes may directly communicate with one another. Hence, rather than sending data to a destination (Concurrent C) or through a channel (Occam), a Linda program places tuples of data into the tuple space. Once a tuple has been placed in the TS, any process may examine or remove the tuple from the TS. A tuple in Linda consists of character string that acts as a logical name, as well any other specified variable fields. Access to a particular tuple is granted by matching the name tag and all other tuple fields. Hence, the tuple space is essentially an arbitrarily large, fully associative, tuple memory shared by all processes.

Parameters specified without type are assumed to be actual parameters; the keyword formal may also be used to treat a previously declared parameter as an actual value during an operation on the tuple space.

The tuple space may be accessed through the use of three primitive operations: out(), in(), and read(). The out(t) and in(t) operations cause a tuple t to be added to or withdrawn from the tuple space. The read() operation makes a copy of the requested tuple parameters, if found, without actually removing the tuple from TS. Thus, successful completion of either the read() or in() operations is dependent upon an exact match of the requested tuple parameters and those of the named tuple currently residing in the tuple space. A fourth, non-primitive operation, eval(), will allow unevaluated tuples to be added to the tuple space.

An example of communication via the tuple space is illustrated below:

/* place ("abc", 5, 6) into TS */
out("abc", 5, 6);
/* get tuple called "abc" from TS */
/* here, i is set to 6 */
in("abc", 5, int i);

There are three basic difficulties in using this model. The first is that sequential segments of a Linda program are difficult to describe statically. For instance, it is difficult to determine which process will in the tuple that another process put out to TS. Although the awkward constraint of Occam that communication take place exclusively between neighboring processes is removed, the fact that Linda provides only a single shared name space for tags makes it difficult to ensure that multiple processes will not unexpectedly generate the same tag.

The second problem is that the fully associative nature of the processing of TS tuples makes it difficult to construct an efficient implementation. While shared memory models do not necessarily imply the presence of physically shared memory in hardware, an associative tuple space requires special hardware that is not available in most systems. This limits the portability of the Linda system.

Finally, the use of a single shared tuple space requires all values to be globally visible. As a result, optimizations based on locality of reference and lexical scoping cannot be used for programs written in Linda.

3.4.3. Summary

Linda represents a fundamental change of direction in asynchronous programming models. Rather than support a synchronous message passing model, Linda allows data to be globally shared throughout the system. Instead of sharing memory, however, Linda systems share a tuple space that corresponds to names of data objects.

Although Linda may present some difficulties as a general parallel programming model, the language semantics are very straightforward. As a result, it may be possible to use Linda as the target of a higher level language that presents the user with a more general shared memory model. Transformations between shared variables and entries in the tuple space are made possible through use of the in() and out() operations described above. The efficiency of such translations is likely to remain extremely machine dependent, however.

3.5. The Force

Although the programming models used by "languages" such as the Force [Jor87] , BBN Uniform System [Bab88] and IBM EPEX [Bab88] are quite different from those used in CSP, Occam, Concurrent C or Linda, they still tend to support an asynchronous style of parallel programming. In particular, the programming model used in the Force, Uniform System and EPEX is often described as a SPMD model for multiprocessors that utilize a shared memory space. The Uniform System was developed primarily for the BBN Butterfly, although it has essentially been replaced with the Mach programming environment of Carnegie Mellon University. EPEX was initially designed for use on the IBM RP3 system, but has also been implemented on the IBM 3090 series. The Force has been implemented on a number of machines, including the Denelcor HEP, the Flex/32, and the Sequent Balance Series. These machines all share the characteristics that main memory is directly addressable by each processor and that memory access time is essentially independent of the memory location being referenced. As a result, communication between processes is accomplished primarily through the use of shared variables, rather than the message passing scheme used by languages discussed previously. For the purpose of this discussion, a shared variable is any variable that may be directly accessed by any process in a system, regardless of the physical attributes of the underlying hardware. Thus, shared variables could conceivably be present even in systems that do not support shared memory in the system hardware.

It is interesting to note that none of the SPMD systems described above is actually a language; each consists of a set of language constructs that are used as macros within conventional sequential programs. The macros are resolved by a preprocessor, and code is generated by a standard sequential compiler. Rather than describe the exact syntax of all three systems, this discussion will focus on one - the Force. The discussion is relevant to EPEX and the Uniform System as well, since all three systems share a similar style and structure.

3.5.1. Specification of Processes

The most fundamental characteristic of the Force computational model is the lack of explicit programmer control over process management. Processes are never created or destroyed over the course of execution in a Force program; processes are allocated and created at program initialization by low-level system dependent code. Unless restricted by an explicit parallel primitive, all statements in a Force program are executed by all processes in parallel. The number of processes assigned to a program is always independent of the number of actual processes available on the system. The management of both actual and virtual processes is therefore hidden from the Force programming model. Despite the lack of process control, The Force model does support the notion of separate instruction streams, thus supporting asynchronous parallelism.

A Force program begins with the declaration of a force of processes as follows:

Force <name> of <no. of procs> ident <proc id>
   <variable declarations>
   [Externf <module name>]
End declarations
   <Force program>
Join

where parallel environment variables are visible to the macros within the program, and runtime information regarding the number of processes and the current process identifier is available to the programmer. The Join macro signifies the end of the main program in the Force. A parallel subroutine can be invoked in a similar manner, with the macro Forcesub used in place of the Force macro above. Parameters are passed to a Force subroutine just as they would to a conventional Fortran subroutine. A Fortran RETURN statement is used to terminate a Force subroutine; all processes must execute the statement at some time, however.

Macros corresponding to control flow operations in a sequential language actually deal with distributing the work load in a Force program. Loop structures are based on the DOALL construct of parallel Fortran implementations. In a DOALL loop, each process executes the body of the loop statement for a different value of the index variable, provided that their are no restrictive or ambiguous data dependencies between iterations of the loop. The loop body is therefore treated as a sequential code block by each process, with no particular ordering specified for execution of the different iterations. The Force provides two variations of the DOALL construct, as illustrated below:

Presched Do <n> <var> = <i1>,<i2>[,<i3>]
   <loop body>
<n> End Presched Do
Selfsched Do <n> <var> = <i1>,<i2>[,<i3>]
   <loop body>
<n> End Selfsched Do

The Presched Do macro statically assigns index values to processes on the basis of the number of index values to be iterated and the number of processes available in the force. The Selfsched Do macro, by contrast, allows processes to dynamically schedule themselves over the iterations of the loop by acquiring the next available iteration from a shared index variable. Doubly nested loops may also be parallelized in certain situations through use of the Pre2do and Self2do macros that closely resemble the above prescheduling and self-scheduling macros in both format and usage.

A parallel case statement is also provided to distribute independent single stream code blocks across the force of processes. The Pcase macro is declared as:

Pcase on <variable>
[Pcond(<Fortran IF condition)]
   <code block>
[Usect]
[Pcond(<Fortran IF condition)]
   <code block>
   ...
End Pcase

The conditions associated with each block of code are evaluated in an arbitrary order, each by a single process. Since each "case" is associated with only a single (arbitrarily chosen) process, conditions are generally most useful when involving shared variables.

3.5.2. Communication Between Processes

Unlike languages based on the CSP message passing model, the Force uses shared variables to implement interprocess communication. In the Force environment, a variable must either be private to a single process, or globally accessible by all processes. The distinction between private and shared variables is deliberately kept independent of textual scoping constructs, such as the Fortran common directive, to avoid confusion of parallel attributes with scoping issues. The format of variable declarations in the Force is:

Private <type> <variable list>
Shared <type> <variable list>
Async <type> <variable list>

where Async variables are a special class of shared variables that are to be used in certain synchronization constructs to be described later in this section. In the Fortran implementation of the Force, the type of a variable list in the above declaration formats may be preceded by the directive Common /<label>/ as in conventional Fortran programs. Shared variables may be directly referenced by name, as can Private and Async variables. Thus, addition of parallel attributes in the Force do not affect the naming conventions of the underlying language.

Since explicit process control is not visible to the programmer in the Force, there is never a need to "send" values to, or "receive" values from, other processes. Values are instead passed between processes through variables declared as being shared. One side effect of this communication model is the need for explicit synchronization primitives. Languages such as Occam and Concurrent C provide implicit synchronization in the message passing model, since processes requesting service must wait for the request to be accepted by another process. The Force computational model separates the issues of communication and synchronization to some degree, and must therefore provide a set of appropriate constructs to manage synchronization. The synchronization macros in the Force can be divided into two classes, depending on whether they deal with control flow or data synchronization.

The primary control-oriented synchronization macro is used to implement barriers across processes. The format of this macro is:

Barrier
  <code block>
End Barrier

The semantics of a barrier in the Force require that all processes execute the Barrier macro before one (arbitrarily chosen) process proceeds to execute the block of code within the barrier definition. When this execution is complete, all processes resume execution with the statement that immediately follows the End Barrier macro. These semantics unfortunately limit the ability of a programmer to break processes into subtasks, since only a single process may execute the code within a barrier block. This requirement, of course, was necessary in order to avoid explicit specification of processes - a clear objective of the Force methodology.

A second control flow synchronization object allows critical sections to be specified for any shared variable. The format of a Force critical section is:

Critical <lock-variable>
  <code block>
End Critical

where the specified code block may only be executed by a single process at a time in all critical sections that specify the same lock-variable.

Data synchronization is based on the notion of producer-consumer pairs, where shared variables contain both a value and a binary state that indicates whether the variable is full or empty. Data synchronization objects must be declared as Async variables to convey the fact that they may be accessed in an asynchronous manner by various processes. The data synchronization mechanisms in the Force are described as follows:

Produce <async variable> = <expr>
Consume <async variable> into <private variable>
Copy <async variable> into <private variable>
Void <async variable>
Isfull(<async variable>)

where an invocation of any of the above processes is considered to be an atomic (indivisible) operation with respect to other processes. The Produce macro waits for the asynchronous shared variable to be set empty, sets its value to the result of the specified expression, and marks the variable as being full. The consume macro waits for the asynchronous variable to be set full, at which time its value is assigned to the specified private variable and the asynchronous variable is once again marked as empty. Copy is similar to Consume, except that the asynchronous variable is not set the empty state after the assignment is made. The Void macro sets the state of an asynchronous variable to empty, regardless of the original state of the variable. Finally, the Isfull() macro acts as a logical function call that returns the state of an asynchronous variable without blocking on either full or empty states. Care must be taken to insure that deadlock ( For the purpose of this discussion, deadlock refers to any situation in which a producer and consumer of a shared object reach a state where neither can proceed without the other process changing state. ) does not occur in code that uses producer/consumer synchronization, since the Force does specify any deadlock prevention mechanisms.

3.5.3. Summary

The problem with the SPMD model is primarily that shared variables impose very little structure on communication and relatively small penalties for having shared structures (especially on the machines for which these systems were developed), hence there is a tendency to write SPMD programs in which very little concern is invested in minimizing visibility of data. To communicate between two processes, one must create a shared variable that is visible to all processes -- essentially the same problem encountered with Linda's mechanism, except that references are made using a name given at compile-time rather than a runtime associative lookup. The problem is not unlike that of developing large sequential programs as independent modules using a programming language that only provides for global variables.

Of course, not all communication is for the purpose of transmitting values; there is also the problem of communicating synchronization objects. There are a wide variety of synchronization mechanisms supported, but the key issue is really the determinism of the program's execution, not the primitives used to achieve that goal.

3.6. 2 Conclusions

As opposed to synchronous languages, where parallelism is expressed through data layout and algorithm topology, asynchronous languages focus on communication between processes as the means of managing parallel execution. The model of communication varies from the message passing of CSP to the shared memory SPMD model of the Force. Regardless of the model, however, it is apparent that synchronization has traditionally been closely associated with communication in asynchronous languages. Thus, the key aspect of asynchronous languages, as opposed to synchronous languages, appears to be the inclusion of explicit synchronization.

Synchronization in a programming model is not really the issue, however; it is simply a means to an end, where the "end" is actually determinism. Although the set of synchronization mechanisms in most asynchronous parallel languages is quite rich, the provision for specification of determinism is absent. Nondeterminism is inherent in constructs such as the select statement of Concurrent C and the Pcase macro of the Force, but the programmer has no general means of expressing desired or acceptable nondeterminism in an algorithm. The nondeterminism present in all asynchronous models examined thus far is based only on the selection of one of several events that occur in an unknown order. Determinism is enforced through synchronization constructs in all other cases.

For parallel execution of a program, determinism refers to the ability of a code segment to produce the same output for a given set of inputs over a large number of runs. Deterministic execution will occur if and only if the code is free of both races and deadlocks.

A race occurs when two or more processes, either executing concurrently or randomly interleaved, attempt to access the same variable and at least one of the processes is storing into (writing) that variable. A binary race can therefore be either a "read- write" race or a "write-write" race.

A deadlock occurs when two or more processes are waiting for events (e.g., changing variable values) such that none of the events will occur because each event depends on the prior completion of another such event.

Although races cannot always be detected by static flow analysis, the potential for a race can be detected statically and a warning generated.

As for detection of deadlock, consider a binary deadlock between processes Pi and Pj that occurs when Pi waits for event Ej before causing event Ei and Pj waits for event Ei before causing event Ej. Let the place where process Pi waits be called Wi, and the place where process Pj waits be called Wj. In terms of deadlock detection, there are two ways in which this deadlock can occur:

  1. The code is written such that for every possible "parallel flow path" the execution of both Wi and Wj occurs before either Ei or Ej has executed. In general, this type of race can be detected using static (compile-time) flow analysis.
  2. The code is written such that some, but not every, "parallel flow path" places the execution of both Wi and Wj before either Ei or Ej has executed. Although this is not always statically detectable, this by definition is a problem synthesized from two read-write races -- one for Ej-Wi and one for Ei-Wj.

Hence, although it is not possible for a compiler to detect all deadlocks, detecting the first type combined with detecting potential races will insure that each deadlock is at least flagged with a warning message. The problem then becomes one of limiting the warnings to those potential races that actually are races and that were not intended.

An important realization concerning parallel algorithms is that some programs do not require deterministic execution. For example, a weather prediction algorithm might sacrifice deterministic execution to gain speed, since the increased speed would allow use of more up-to-date input data. Nondeterministic execution would be preferred over a deterministic model in such a case, since more accurate forecasts could be generated.

Since none of the asynchronous languages provide a construct for specifying whether or not non-deterministic execution is acceptable, it is not possible to use the above analysis to full advantage. This makes programs much more difficult to debug and maintain -- a problem notoriously associated with the asynchronous languages.

Programming models such as those provided by the Force and EPEX, however, make specification of parallel structure very easy. Using a model based on the idea of multiple control- flow threads running through the same code, these models invoke a type of structured parallel execution that is remarkably similar to that found in synchronous execution models.

Thus, the most important aspect of the SPMD models is that they allow relatively complex asynchronous parallel structure to be embodied within language constructs, exactly as is done in synchronous languages. This makes it possible to merge the fundamental characteristics of both synchronous and asynchronous execution in a single programming model that can be efficiently implemented on a number of architectures.

4. UNIFICATION OF SYNCHRONOUS AND ASYNCHRONOUS MODELS

4.1. Characteristics of a Unified Model

The wide variety of synchronous and asynchronous languages developed over the past several years has led to the evolution of two very different classes of parallel architectures. These classes -- SIMD and MIMD -- can be broken into several subclasses based on specific machine characteristics, but the fundamental difference in the two classes involves the number of possible control flow paths during execution of a program. A SIMD execution model contains only a single conceptual "program counter," regardless of the number of physical processors present in the system. A MIMD model, conversely, may have any number of "program counters" available during execution. As discussed in previous chapters, language constructs have been developed for each class of architecture, generally resulting in programming environments that are not portable, and often not scalable. The apparent incompatibility of synchronous and asynchronous constructs is manifested in two issues related to the number of instruction streams possible in the execution model of a particular architecture -- control structure and data management.

Proper specification of control structure and data management semantics leads to a model where determinism can be specified as an independent characteristic of parallel program structure.

4.2. Control Structure

Control flow in synchronous languages is usually expressed within a construct, much as in a sequential language. Asynchronous parallel control, however, is typically expressed through the interactions of multiple sequential "threads" of execution. Since SIMD architectures conceptually use only a single program counter, synchronous parallel programs can be represented as a single code image. Processors are independently enabled or disabled as the program counter passes through the code image, allowing SIMD processing elements to follow different control flow paths through use of time multiplexing. Asynchronous code being executed on a MIMD architecture may conceptually have any number of program counters, each corresponding to a distinct code image. The appropriateness of this model for MIMD execution was further reinforced by the fact that most early MIMD architectures did not support shared memory, requiring that a separate code image literally be sent to each processor. The advent of shared memory MIMD architectures and the associated programming model has made it possible for processes to share a single code image, since access time to memory is approximately the same for all processors. Of course, each processor could still maintain a separate code image, but system memory requirements and program load times are significantly reduced when only a single code image is maintained, provided that the code is executed by more than one processor.

Algorithms written for MIMD architectures often execute the same sequence of statements, but individual processes may take different paths through the control structure of the program. Through proper use of control flow constructs it should then be feasible to represent an arbitrarily complex set of independent asynchronous processes within a single code image.

Thus, the first step towards unification of synchronous and asynchronous programming models is to maintain a single code image for a parallel program, regardless of whether the program is to be executed on a SIMD or MIMD machine. Consider a program consisting of two processes, A and B, that is to be run on a MIMD architecture with four processors. Let process A be assigned to processor 0 and process B be executed by processors 1, 2, and 3. A single code image could be created as follows:

if (processor_is_0) {
    A
} else {
    B
}

Of course, this example assumes that each processor "knows" its own identity -- a reasonable assumption for most parallel architectures. This code sequence is quite similar to the if-then-else statement of most synchronous languages. In fact, the only difference between the synchronous and asynchronous versions involves the order of execution of processes A and B imposed by the different programming models.

The semantics of a synchronous language would typically specify that processors 2, 3 and 4 be disabled while process A is executed in processor 0. Upon completion of A, processor 0 would be disabled, and processors 2, 3 and 4 would execute process B. Asynchronous semantics would allow processes A and B to proceed concurrently, with the order of execution of any part of task A unspecified with respect to the execution of task B. The synchronous semantics are more restrictive than the asynchronous counterparts, due to the added constraints involving the order of execution. It is interesting to note that the less restrictive asynchronous semantics do not prohibit a synchronous implementation of the above code, since synchronization constructs could be used to effectively emulate synchronous execution.

Synchronous languages include implicit synchronization in the specification of control flow; by separating synchronization and control flow directives, the semantics of synchronous and asynchronous control flow constructs become equivalent. In other words, the above if-then-else statement could be coded by a programmer, regardless of whether the underlying hardware operates in SIMD or MIMD mode. The choice of execution model is made by the compiler, rather than the programmer.

Thus far, the discussion of a unified programming model has two important implications regarding the roles of the programmer and compiler in a parallel processing environment. First, a unified programming model relies heavily on the ability of a compiler to generate appropriate code for both synchronous and asynchronous modes of execution. The code generation phases of compilation will, by necessity, be tied very closely to the underlying hardware. The second implication involves the fact that the programmer may no longer be able to determine the exact mode of execution that will be used for code segments written on machines where both SIMD and MIMD parallelism is supported. This ambiguity may be avoided by permitting the programmer to embed "hints" in the code. These compiler directives would suggest, but not require, either a SIMD or MIMD form of implementation. In a language like C, one might loosely disguise such hints as casts. Such a feature is not necessary, however, if the language semantics allow either mode of parallelism to be adopted.

4.3. Data Management

As seen in the earlier discussion of synchronous and asynchronous models, one of the fundamental means by which parallelism is expressed is the management of data and variables in a program. Vital to the unified programming model is the ability to independently support three aspects of data management: data layout, assignment statements, and synchronization. Collectively, these aspects provide the foundation for an environment where determinism can be treated as a separate characteristic of the programming model. Languages proposed thus far link synchronization and data management at a level that cannot be controlled by the programmer, effectively eliminating separate specification of determinism in a program.

4.3.1. Data Layout

As discussed previously, there is little semantic difference between the layout of variables and data in parallel programming models. Synchronous models tend to map data to the processors of a SIMD architecture, while asynchronous models associate data with processes that may or may not have direct mapping onto the actual processors of a system. In synchronous languages, variables are defined as being stored in either a control processor (scalar variables) or a set of processing elements (parallel variables). In asynchronous languages, variables are either local to a process, or globally shared among all processes. Although the implementation of data layout varies widely among various hardware configurations, there is no fundamental difference between data layout in the programming models of synchronous and asynchronous languages.

Variables declared as being parallel in a synchronous model are assigned a shape and extent of parallelism. In more robust languages, such as Parallel-C and C*, the extent of parallelism need not be related to the actual number of processors available to a program. This mapping of variables and data to virtual processors is quite analogous to the definition of local, or private, variables in an asynchronous model such as that presented by the Force. Such variables are defined for each process regardless of the number processes created. Semantics governing the definition of parallel variables in synchronous languages and private variables in asynchronous models are actually very similar; the major difference is the way in which the variables are assigned to processes or processors.

In synchronous languages, virtual processors are allocated on the basis of data and variable declarations. For instance, the Parallel C declaration:

parallel [16] int x;

allocates an "array" of sixteen virtual processors, each of which contains a definition of the integer x. Compare this declaration with the following Concurrent C code segment:

main()
{
   process myproc p[16]
   int i;
   for (i=0; i<16; i++)
      p[i] = create myproc();
}
process body myproc()
{
   int x;
   /* body of myproc */
}
process spec myproc()
{
   /* declaration of myproc transactions */
}

In this program segment, a set of sixteen processes are created, each containing a private integer variable named x. Processes are explicitly specified in Concurrent C; local variables are then defined within each process. The Force takes this mechanism one step further by eliminating explicit specification of processes. Instead, the variable x would simply be defined as a private integer across a "force" of processes that are implicitly defined at run- time.

Hence, the two code sequences above effectively result in the same set of distributed variables. The distinction between parallel variables in a synchronous model and private variables in an asynchronous models is therefore one of syntax; the semantics are actually the same.

Scalar variables in a synchronous model are generally associated with the control processor of a SIMD architecture. Such a processor is visible to all other processors in the system, although the means by which it is accessed is typically machine dependent. This description is equally applicable to a shared variable in an asynchronous model. Shared variables may be located in physically shared memory, or may be transparently maintained across local memories of a distributed system. Language semantics for shared and scalar variables differ only in the means by which such variables may be accessed; the semantics of declarations and definitions are nearly identical. In fact, ignoring syntax, the only difference between the declaration of scalar and shared variables is the notion that scalar variables can only reside in a single designated processor; shared variables may conceptually be associated with any declared process.

4.3.2. Assignment Statements

Although the layout of parallel variables and data is similar in synchronous and asynchronous programming models, the manner in which these objects are accessed varies considerably. As described above, there are typically two types of parallel attributes associated with variables and data in explicitly parallel languages. Synchronous languages contain scalar and parallel types, whereas asynchronous languages generally make a distinction between shared and private data objects. The four cases of assignment statements in synchronous languages such as Parallel C and C* are equally applicable to asynchronous assignments statements. For notational convenience, in the following discussion let S represent scalar values in synchronous languages and shared values in asynchronous languages. Parallel (synchronous) and private (asynchronous) data and variables will be denoted by the letter P. The four classes of assignments can then be summarized for both synchronous and asynchronous models as follows:

  1. S = S; In a synchronous language, this is equivalent to a sequential C assignment statement, since all scalar data reside in the memory space of a single control processor. The situation is similar with shared data in an asynchronous model, except that shared data need not conceptually or physically reside in the same process. In order to present a unified model, the semantics should simply state that the assignment takes place conceptually within a single process or processor. On SIMD machines, these semantics are inherent in the machine model. MIMD implementations must simply ensure that the process "selected" to access the variable is the same on both sides of the assignment. Synchronization, such as producer/consumer pairs in asynchronous models, should not be specified as part of the assignment semantics.
  2. S = P; The semantics for this type of assignment vary even with in a given programming model. Recall that in Parallel C, a single element of the parallel value should be assigned to the scalar variable. If more than one element is selected on the right side of the assignment, the result of the statement is undefined. The C* semantics are dependent on the type of assignment operation involved. For the standard C assignment operator (=), an arbitrarily selected element of the parallel value is assigned to the scalar variable. This class of C* assignment statements can also be used to express a reduction operation, however, by specifying that each element of the parallel variable assign its result to the scalar variable through an operation that is applied "as if" in some (unspecified) serial order. An example of such a reduction would be the += operator. Asynchronous languages generally use synchronization constructs to control access to a shared variable when updates from one or more private variables are desired. The semantics of the actual assignment are quite straightforward; private variables may asynchronously store their results in the specified shared variable. The relaxed asynchronous semantics are once again more appropriate, since synchronization information is specified separately from the assignment construct.
  3. P = S; The semantics for this type of assignment are surprisingly similar in existing synchronous and asynchronous languages. In SIMD implementations, the scalar value on the right side of the assignment is "promoted" to a value of shape and extent equal to the parallel variable on the left side. The assignment is then made in all processing elements selected by the variable P. This generally involves some form of hardware broadcast, where a single value is replicated across portions of the machine topology. Asynchronous languages allow the shared variable to be accessed by any number of private variables, provided that the value is only being read or copied. Linda actually goes so far as to distinguish between copy and read operations from the shared tuple space. Synchronization primitives are often used in conjunction with this type of assignment to allow critical regions and producer/consumer interactions to be implemented.
  4. P = P; This type of assignment is perhaps the most difficult to specify in a unified model, since synchronization, assignment, and communication information is all buried within the construct. Once again, synchronous languages differ in the the treatment of such a statement, with C* allowing multiple permutations of data and Parallel C requiring that the assignment take place only in those elements selected on both sides of the statement. Asynchronous languages either require that private values on the right side of an assignment be explicitly passed as messages to a known destination, as in Occam and Concurrent C, or that the statement be broken down into two components involving a store into, and subsequent load from, a shared variable. Thus, P=P becomes S=P followed by P=S. Such a transformation is not practical on a SIMD architecture, since data would needlessly be sent to and from the control unit of the system. The semantics of the unified model should therefore specify only that the assignment takes place within a fixed set of processes. For synchronous languages this implies that parallel variables in assignments must be of the same declared extent, but need not have the same set of selected elements on both sides of the statement. The assignments on SIMD machines may actually take place through use of system calls that are transparent to the user.

4.3.3. Synchronization

As alluded to earlier, one of the fundamental aspects of a unified model is the separation of synchronization from communication and control flow issues. Synchronous languages tend to encapsulate synchronization within language constructs. Although this approach prevents synchronous languages from being implemented on more general MIMD architectures, there is clearly a need for synchronization in a synchronous model. As described above, semantics for control flow, communication, and data management can be specified for the less restrictive asynchronous model, yet still be implemented on a SIMD architecture if appropriate synchronization primitives are provided. In the unified model, code executing on SIMD processing elements is treated as a process, as is the code executing in the control processor. The control processor is not tied to a particular location in the machine topology, as it is in all synchronous languages.

The synchronization directives provided in the Force offer a good starting point for the specification of necessary synchronization constructs for the unified programming model. Such constructs must provide a means of designating those processes that are to be included in the synchronization. Unlike the Force, where all processes are affected by synchronization macros, the unified model must allow selective synchronization to be specified for both SIMD and MIMD processes.

The most general of the commonly used synchronization primitives appears to be the barrier. For the purpose of this discussion, a barrier is simply a synchronization primitive that requires participation by a designated set of processes. Note that this definition differs significantly from that offered in the Force, where barriers are used to denote regions of code that are to be executed by a single process once all processes have reached a specified point in the program. Barriers in the Force therefore act more as "monitors," such as those found in the CSP model.

By allowing barriers to be set across any specified set of processes, timing bounds can be placed wherever needed in a program. The set of processes can be as small as one, or as large as the entire "force." Critical regions of code can be specified by selecting a single process to access a shared (or scalar) variable in a statement immediately following a barrier across all processes. Barriers placed within iterative constructs can enforce a "do across," rather than "doall" interpretation of a loop. In other words, all statements within a given iteration of the loop may be required to be completed before another iteration begins.

Producer/consumer synchronization can also be handled with barriers by placing a barrier across participating processes between the producer and consumer of a shared object. Thus, counting semaphores are effectively just barriers placed across a known number of processes.

4.4. Determinism

The desire for deterministic execution in parallel programs is, in large part, due to the unwanted effects of nondeterminism on issues such as debugging and portability between machines. Despite the shortcomings, it is desirable to allow nondeterminism in a parallel programming model, provided that the nondeterminism can be detected statically. Many problems exhibit properties that are nondeterministic, and asynchronous machine models are certainly capable of providing nondeterministic results.

Another key aspect of the unified model, therefore, is to provide a means of controlling nondeterminism without banishing it from the language. Compiler technology is capable of aggressively seeking out potential nondeterminism in programs. With this capability, programmers can be warned of potential deadlocks and race conditions in parallel code. For example, given the code sequence:

if (parallel_expression) {
    *p = a;
} else {
    *q = b;
}

a compiler may conclude that p and q might be pointers to the same object, thus causing a warning of a potential "write-write" race to be generated. The programmer can suppress such warnings by specifying testable, disambiguating assertions. For example, the statement:

assert(p != q);

is currently supported by a number of C compilers and is specified as a library routine in the current ANSI C draft proposal. More friendly syntax and more flexible semantics for such an assertion appears in Refined C [DiK84] .

It is also possible that the race does exist (i.e. p and q hold the same memory address), but the program will produce acceptable results with either value. In such a case, the programmer may lexically specify a scope within which a list of variables may produce results that are not deterministic -- a "nondeterministic" region for the specified variables.

For example, if p and q were pointing at the same location but nondeterministic execution was acceptable, the following syntax could be used:

nondet(*p, *q) {
    if (parallel_expression) {
        *p = a;
    } else {
        *q = b;
    }
}

This sequence would inform both the compiler and anyone reading or debugging the program that nondeterministic accesses may be made to these memory locations. Items listed in a nondet directive are treated as objects, rather than variable names, allowing functions and pointer indirection to be specified as well. Hence, an int variable i would be listed as nondet(i) and a function f would be listed as nondet(f). Listing a function implicitly lists all variables that the function might access.

Synchronization primitives are not tied to the specification of determinism, since doing so would introduce potentially undesired constraints on model of execution. Determinism is treated as a lexical issue, whereas synchronization deals wit the relative timing of interdependent processes.

5. XPC

5.1. Introduction

XPC is an extension of the C programming language that provides explicit specification and management of both synchronous and asynchronous styles of parallelism. The language is primarily intended as a means of supporting the unified model of parallelism discussed in the previous chapter, but will also attempt to meet the following objectives:

  1. Provide a viable target language for parallel programming environments or paradigms such as PARSE [CaD87] .
  2. Specify semantics that do not prevent the compiler from performing analysis deemed necessary for possible "compiler- assisted" parallelization.
  3. Maintain a programming style that is consistent with the conventional C language wherever possible.

It should be noted that many of the underlying characteristics of XPC could be incorporated into an existing parallel language, provided that the semantics of such a language changed accordingly. Thus, the specification of XPC is not the key issue; the model from which the language semantics are derived is the figure of merit.

In many respects, XPC resembles the style of synchronous languages such as Parallel-C and C*, but specifies semantics that are closer to those found in SPMD models such as the Force. The programming model does not restrict the types of architectures for which the language may be targeted, although implementation issues will vary considerably for different execution models. No attempt is made to distinguish between processes and processors; the term "process" is generally used here when discussing language semantics, since "processors" are often associated exclusively with SIMD machines. The language does not specify the mapping of processes to processors, allowing as many virtual processes as needed to be used in an algorithm.

It is important to note that the language is presented here in preliminary form; modifications will almost certainly occur to the grammar over the course of implementation.

5.2. Language Definition

Discussion of XPC is divided into functional sections corresponding to program specification, data layout, control structure, assignments and communication, and determinism and synchronization. A BNF grammar for the language is distributed throughout these sections, with constructs added specifically for XPC italicized for emphasis. The grammar is derived from a "YACC-like" specification in the GNU C compiler (GCC) [Sta89] , ( The parser generator in GCC is referred to as Bison, rather than YACC. ) with terminal symbols appearing in a courier font to distinguish them from non-terminal tokens generated by the parser. Embedded actions have been omitted from the XPC grammar for clarity. Semantics will instead be discussed on an informal basis and will include references to previously discussed language models whenever comparisons are appropriate.

XPC will be implemented with a source code compiler, rather than as a set of preprocessor macro definitions. As a result, the language definition should also specify a lexical analysis routine; for brevity, it may be assumed that a tool such as lex() may be used to generate an appropriate lexical analyzer.

5.2.1. Program Specification

XPC programs are specified in a manner almost identical to conventional C programs. Data and function definitions may include barrier declarations, as well as all standard C data types. The designation of parallel data structures, including integer arrays that may be used to access such structures, is discussed in the following section. The description of barriers is deferred until section 5.2.5, which deals with determinism and synchronization in XPC. The grammar for the top level of program specification in XPC is provided in Listing 5.1.

Preprocessor directives may also be used in XPC programs, with syntax identical to that of standard C specification.

<program> ::=
          /* empty */
        | <extdefs>
<extdefs> ::=
          <extdef>
        | <extdefs> <extdef>
<extdef> ::=
          <fndef>
        | <datadef>
        | ASM '(' <string> ')' ';'
<datadef> ::=
          <notype_initdecls> ';'
        | <declmods> <notype_initdecls> ';'
        | <typed_declspecs> <initdecls> ';'
        | <declmods> ';'
        | <typed_declspecs> ';'
        | <barrier_decls> ';'
        | ';'
<fndef> ::=
          <typed_declspecs> <declarator> <xdecls> <compstmt>
        | <declmods> <notype_declarator> <xdecls> <compstmt>
        | <notype_declarator> <xdecls> <compstmt>
Listing 5.1.  XPC program specification

XPC does not provide language constructs for the creation or termination of processes at the highest level of the program topology. In this respect, XPC follows the lead of the SPMD models described earlier.

5.2.2. Data Layout

As in other parallel languages, data and variables in XPC may be defined as being either local to a single process or distributed across many processes. The distinction between these classifications is left intentionally vague to prevent association with any particular execution model. Thus, on a shared-memory MIMD configuration the classes would be treated as "shared" and "private;" on a SIMD machine the distinction would be between "scalar" data in the control processor and "parallel" data in the processing elements.

A new delimiter has been included in XPC to allow simple and flexible declaration of both classes of variables. Consider the following XPC declarations:

int m[8];
mono int m[|8];
int p[||4][||4][8];
poly int p[||4][||4][8];
The first two declarations are equivalent; both define a mono array of eight integers. The second pair of declara- tions are interchangeable as well, with each defining an array of eight integers in a two-dimensional configuration of sixteen processes. Thus, the delimiter "[||" signifies that the expression immediately following is to be inter- preted as a shape declarator for a poly variable. The "[|" delimiter is included in the language in order to maintain a sense of symmetry. It may be used (but is not required) to declare conventional arrays in XPC. Upon close inspection, it should be apparent that the mono and poly keywords are redundant. The terms are included in the initial version of XPC only as a mnemonic aid to programmers.

The keywords mono and poly are borrowed from C* notation to specify storage allocation classes as described above. These particular terms were chosen because they are not overly restrictive in their meaning, although the terms parallel and scalar from Parallel-C could have just as easily been used. Listings 5.2 and 5.3 provides the BNF grammar specification for XPC declarations.

<xdecls> ::=
          /* empty */
        | <decls>
<decls> ::=
          <decl>
        | <decls> <decl>
<decl> ::=
          <typed_declspecs> <initdecls> ';'
        | <declmods> <notype_initdecls> ';'
        | <typed_declspecs> ';'
        | <declmods> ';'
<typed_declspecs> ::=
          <typespec> <reserved_declspecs>
        | <declmods> <typespec> <reserved_declspecs>
<reserved_declspecs> ::=
          /* empty */
        | <reserved_declspecs> <typespecqual_reserved>
        | <reserved_declspecs> SCSPEC
        | <reserved_declspecs> SACSPEC
<declmods> ::=
          TYPE_QUAL
        | SCSPEC
        | SACSPEC
        | <declmods> TYPE_QUAL
        | <declmods> SCSPEC
        | <declmods> SACSPEC
<typed_typespecs> ::=
          <typespec> <reserved_typespecquals>
        | <nonempty_type_quals> <typespec> <reserved_typespecquals>
<reserved_typespecquals> ::=
          /* empty */
        | <reserved_typespecquals> <typespecqual_reserved>
Listing 5.2.  Declaration format in XPC
<typespec> ::=
          TYPESPEC
        | <structsp>
        | TYPENAME
        | TYPEOF '(' <expr> ')'
        | TYPEOF '(' <typename> ')'
<typespecqual_reserved> ::=
          TYPESPEC
        | TYPE_QUAL
        | <structsp>
<typename> ::=
          <typed_typespecs> <absdcl>
        | <nonempty_type_quals> <absdcl>
Listing 5.2,  continued

Data and variable declarations replicated across processes may be accessed by integer arrays that resemble index sets in Actus or selectors in Parallel-C. Rather than create a new variable type, however, XPC allows arrays to be treated as first-class objects. Thus, any integer array may be used as a "selector" or "index set" for accessing poly variables. The use of such arrays is nearly identical to the use of selectors in Parallel-C.

Consider the following XPC code sequence:

int mask1[4][4], mask2[4][4];
float x[||4][||4];
   ...
x[||mask1 && mask2] = 0;

This code defines two 4 x 4 arrays of integers that may be used to access elements of a poly variable of the same shape. The poly variable x is set equal to zero in the processes selected by the intersection of non-zero elements in mask1 and mask2.

Since arrays are first-class objects in XPC, operations performed on a named array variable are actually carried out on each element of the array. Thus, accessing a poly variable with an integer array is equivalent to selecting the set of processes that correspond to the non-zero elements of the array. In situations where the current extent of parallelism is implied or specified by an encompassing expression or statement, the full extent may be selected as:

x[||#]

where x is defined as a poly variable. In some cases, it may be desirable to specify that any single process may be selected from the current extent. In such cases, the notation:

x[||?]

may be used to designate that a single arbitrary process be selected for evaluation in the subsequent expression or statement. Such a notation is particularly useful when dealing with barriers, to be discussed later.

<declarator> ::=
          <after_type_declarator>
        | <notype_declarator>
<after_type_declarator> ::=
          '(' <after_type_declarator> ')'
        | <after_type_declarator> '(' <parmlist_or_identifiers>  %prec '.'
        | <after_type_declarator> <sdim> <expr> ']'  %prec '.'
        | <after_type_declarator> <sdim> ']'  %prec '.'
        | <after_type_declarator> PDIM <expr> ']'  %prec '.'
        | <after_type_declarator> PDIM ']'  %prec '.'
        | <after_type_declarator> PDIM '?' ']'  %prec '.'
        | <after_type_declarator> PDIM '#' ']'  %prec '.'
        | '*' <type_quals> <after_type_declarator>  %prec UNARY
        | TYPENAME
<parm_declarator> ::=
          <parm_declarator> '(' <parmlist_or_identifiers>  %prec '.'
        | <parm_declarator> <sdim> <expr> ']'  %prec '.'
        | <parm_declarator> <sdim> ']'  %prec '.'
        | <parm_declarator> PDIM <expr> ']'  %prec '.'
        | <parm_declarator> PDIM ']'  %prec '.'
        | <parm_declarator> PDIM '?' ']'  %prec '.'
        | <parm_declarator> PDIM '#' ']'  %prec '.'
        | '*' <type_quals> <parm_declarator>  %prec UNARY
        | TYPENAME
Listing 5.3. Declarators in XPC
<notype_declarator> ::=
          <notype_declarator> '(' <parmlist_or_identifiers>  %prec '.'
        | '(' <notype_declarator> ')'
        | '*' <type_quals> <notype_declarator>  %prec UNARY
        | <notype_declarator> <sdim> <expr> ']'  %prec '.'
        | <notype_declarator> <sdim> ']'  %prec '.'
        | <notype_declarator> PDIM <expr> ']'  %prec '.'
        | <notype_declarator> PDIM ']'  %prec '.'
        | <notype_declarator> PDIM '?' ']'  %prec '.'
        | <notype_declarator> PDIM '#' ']'  %prec '.'
        | IDENTIFIER
<bar_declarator> ::=
          BARRIER IDENTIFIER
        | <bar_declarator> PDIM <expr> ']'  %prec '.'
        | <bar_declarator> PDIM ']'  %prec '.'
        | <bar_declarator> PDIM '?' ']'  %prec '.'
        | <bar_declarator> PDIM '#' ']'  %prec '.'
<structsp> ::=
          STRUCT <identifier> '{' <component_decl_list> '}'
        | STRUCT '{' <component_decl_list> '}'
        | STRUCT <identifier>
        | UNION <identifier> '{' <component_decl_list> '}'
        | UNION '{' <component_decl_list> '}'
        | UNION <identifier>
        | ENUM <identifier> '{' <enumlist> <maybe_comma> '}'
        | ENUM '{' <enumlist> <maybe_comma> '}'
        | ENUM <identifier>
<sdim> ::=
          '['
        | SDIM
<maybe_comma> ::=
          /* empty */
        | ','
<parmlist_or_identifiers> ::=
          <parmlist> ')'
        | <identifiers> ')'
Listing 5.3,  continued
<parmlist_1> ::=
          <parmlist> ')'
<parmlist> ::=
          /* empty */
        | <parms>
        | <parms> ',' ELLIPSIS
<parms> ::=
          <parm>
        | <parms> ',' <parm>
<parm> ::=
          <typed_declspecs> <parm_declarator>
        | <typed_declspecs> <notype_declarator>
        | <typed_declspecs> <absdcl>
        | <declmods> <notype_declarator>
        | <declmods> <absdcl>
Listing 5.3,  continued

Initialization of poly variables follows the same semantics as initialization of aggregate variables in C. In order to simplify the often tedious task of specifying process "masks," a notation for replicating values has been added to the language. This notation may be used for poly variables and mono arrays alike.

For example, the sequence

int sel[||2][||4] = {#4{1}, 0, #3{2}};
sets the first four elements of sel to one, the next element to zero, and the last three elements to the value two. Sim- ilarly, the sequence
int odds[||M] [N]  [N]  = {#M {#(N*N)/2 {0, 1}, #N%2 {0}}};
sets all odd elements of a two-dimensional array equal to one in each of M processes, where M and N are constants specified elsewhere in the program. Note that any sequence can be replicated using this notation, and that the replica- tor may be any expression involving constants or statically initialized variables. Scalar array dimensions always take precedence over parallel dimensions when initializing vari- ables in XPC. All other precedence relations follow conven- tional C semantics for aggregate initialization. The gram- mar for XPC initialization is shown in Listing 5.4.
<initdecls> ::=
          <initdcl>
        | <initdecls> ',' <initdcl>
<notype_initdecls> ::=
          <notype_initdcl>
        | <notype_initdecls> ',' <initdcl>
<barrier_decls> ::=
          <barrier_dcl>
        | <barrier_dcl> ',' <barrier_dcl>
<maybe_asm> ::=
          /* empty */
        | ASM '(' <string> ')'
<initdcl> ::=
          <declarator> <maybe_asm> '=' <init>
        | <declarator> <maybe_asm>
<notype_initdcl> ::=
          <notype_declarator> <maybe_asm> '=' <init>
        | <notype_declarator> <maybe_asm>
<barrier_dcl> ::=
          <bar_declarator> <maybe_asm> '=' <init>
        | <bar_declarator> <maybe_asm>
<init> ::=
          <expr_no_commas>
        | <rep> '{' '}'
        | <rep> '{' <initlist> '}'
        | <rep> '{' <initlist> ',' '}'
<initlist> ::=
          <init>
        | <initlist> ',' <init>
<rep> ::=
          /* empty */
        | <rep> '#' <expr_no_commas>
Listing 5.4.  Initialization in XPC

5.2.3. Control Structure

XPC supports the complete set of C control flow constructs, as illustrated in Listing 5.6. The control flow model used is similar to that found in C*, except that synchronization is not implied within the semantics of individual constructs.

Conditional constructs such as if-then and switch may have either mono or poly expressions specified as the condition. As in other languages, poly (parallel) expressions are evaluated to determine the set of processes that will execute the ensuing statement. Nested conditional statements are handled by means of a run-time stack that keeps track of the active set of processes. The XPC control flow model does not require that nested conditional statements be executed by a monotonically decreasing set of processes, for reasons to be explained later. The semantics of the if-then statement do not impose an order of execution on the if body and the else body. Thus, on a SIMD architecture, conditional constructs are executed as they would in synchronous languages, with execution of the if (or else) statement by a set of processes being followed by execution of the else (or if) statement by the complement of that set. A MIMD computer could execute the if and else statements concurrently, provided that no undesirable side-effects are present between the two statement bodies.

The semantics of the if-then construct are extended to switch statements. Unlike asynchronous models, where only a single case may be executed by a single process, XPC allows any or all cases to be executed concurrently whenever possible. If an ordering of case substatements is required by the model of execution, the ordering is arbitrary with respect to the textual representation in the program. Of course, the use of break statements within a switch statement does impose an ordering of the case substatements, but this ordering need not restrict parallel execution of code sections that are independent.

The key aspect of the XPC control flow model is the separation of process synchronization from the evaluation of control flow. The model is quite similar to the Block Synchronization Rule of C*, with one important difference. Rather than associate an extent of parallelism with a conditional statement, the extent is associated with either a statement or a "basic block" in the flow diagram of the program -- whichever is smaller. In most cases, the two interpretations yield the same results, but the XPC control flow model is more general than that of most synchronous languages. The definition of an XPC statement is illustrated in Listing 5.5.

Many of the problems discussed in Chapter 2 involve the semantics for arbitrary control flow in a parallel environment. The inconsistencies encountered in the handling of switch statements and looping mechanisms could usually be traced to the more generic problem of supporting arbitrary use of goto statements. Most languages specify that the set of active processes may never increase over the course of execution of an iterative or conditional statement. These semantics made the use of goto statements quite difficult when the target of the branch appeared inside the body of a different control structure.

<stmts> ::=
          <stmt>
        | <stmts> <stmt>
<xstmts> ::=
          /* empty */
        | <stmts>
<compstmt> ::=
          <maybe_nondet> '{' '}'
        | <maybe_nondet> '{' <decls> <xstmts> '}'
        | <maybe_nondet> '{' <stmts> '}'
<maybe_nondet> ::=
          /* empty */
        | NONDET '(' <primary>
Listing 5.5.  XPC statements

At least three approaches may be used to adequately address the issue of arbitrary control flow in explicitly parallel languages:

  1. Disallow the use of goto statements. This is the approach used in Actus. This solution would also restrict the use of switch statements to semantics that resemble those of the Pascal case statement.
  2. Prevent execution of any goto statements that branch out of a lexical statement with a given extent of parallelism. This approach is similar to that used in C* and Parallel-C, but would likely result in programs where most goto statements are ignored. Parallel-C and C* do not actually prevent such an occurrence, but the semantics are specified in a way that makes implementation of goto statements difficult. It is interesting to note that Glypnir uses what may best be described as the complement of this approach by specifying that all processors execute the goto statement and resume execution at the destination of the branch.
  3. Support arbitrary use of goto statements by applying extent setting mechanisms to nodes of the control flow graph, rather than to statements or expressions. This approach takes a control flow model such as that described by the Block Synchronization Rule of C* and applies it to the nodes of a flow graph generated by the compiler. In effect, the extent of parallelism may be changed on any arc entering the flow graph of a statement. Thus, a branch label located within a while loop would actually cause the loop statement to be broken up into multiple flow graph nodes. The semantics governing the extent of parallelism within a node would be identical to those specified for statements in C*. For instance, the set of active processes may only monotonically decrease within a "basic block" of the flow graph for a statement. Loop structures that have only a single point of entry are therefore treated in XPC just as they would in Parallel-C or C*. Those that contain a branch label are treated differently in XPC, however, since the active set of processes may change in an arbitrary manner at all points of entry into the loop. Statements that are "smaller" than a basic block (i.e. several statements appear within a basic block) can each be assigned a different extent of parallelism. This is consistent with the semantics of other parallel languages.

While the initial implementation of XPC may opt to support one of the first two approaches, it is hoped that the more general semantics will ultimately be adopted. It is also worth noting that the third approach would enable logical operators such as "&&" and "||" to be applied to expressions containing poly variables of differing extents. This is due to the fact that such operators are actually treated as a series of nested conditional expressions in C.

<simple_if> ::=
          IF '(' <expr> ')' <stmt>
<stmt> ::=
          <compstmt>
        | <expr> ';'
        | <simple_if> ELSE <stmt>
        | <simple_if>
        | WHILE '(' <expr> ')' <stmt>
        | DO <stmt> WHILE '(' <expr> ')' ';'
        | FOR '(' <xexpr> ';' <xexpr> ';' <xexpr> ')' <stmt>
        | SWITCH '(' <expr> ')' <stmt>
        | CASE <expr> ':' <stmt>
        | DEFAULT ':' <stmt>
        | BREAK ';'
        | CONTINUE ';'
        | RETURN ';'
        | RETURN <expr> ';'
        | ASM <maybe_type_qual> '(' <string> ')' ';'
        | ASM <maybe_type_qual> '(' <string> ':' <asm_operands> ')' ';'
        | ASM <maybe_type_qual> '(' <string> ':' <asm_operands> ':'
          <asm_operands> ')' ';'
        | ASM <maybe_type_qual> '(' <string> ':' <asm_operands> ':'
          <asm_operands> ':' <asm_clobbers> ')' ';'
        | GOTO <identifier> ';'
        | <identifier> ':' <stmt>
        | ';'
<maybe_type_qual> ::=
          /* empty */
        | TYPE_QUAL
<xexpr> ::=
          /* empty */
        | <expr>
Listing 5.6.  XPC control constructs

Perhaps the greatest divergence between the suggested XPC control flow model and those described in other languages appears in the semantics of loops and nested conditional statements. Whereas synchronous languages require that the set of active processes monotonically decrease as execution of a loop or nested if-then-else statement proceeds, XPC allows the extent to vary in accordance with the control flow graph of the statement. This viewpoint of control flow is clearly compatible with an asynchronous model of execution where loops may be dynamically scheduled across iterations, but is seemingly incompatible with the synchronous models proposed for SIMD machines. Closer examination, however, reveals that the only significant difference between the XPC model and that of synchronous control flow is the type of operation performed on nested process "masks." Instead of performing an AND operation with the extent of the enclosing statement, the conditional expression is independently evaluated to determine the active set of processes for the subsequent statement body. A runtime stack is still maintained so that the extent of the enclosing statement can be restored upon completion of the nested statement. The situation is analogous to the following conventional C code sequence:

if (a)  {
   x=1;
   if (b) {
     x=2;
     y=1;
   }
   y=2;
}

There is no implied interdependence between conditions a and b in the above code, just as there need not be any implied interdependence between the extent of parallelism in nested blocks in XPC. By associating an extent of parallelism with the actual flow graph nodes, control flow in the language model is not restricted by constraints of a particular architecture. The most visible illustration of this concept is probably the C switch statement, where the thread of control flow may "fall through" several case substatements, each with a unique set of active processes.

5.2.4. Assignments and Communication

Assignment expressions in XPC extend the semantics of conventional C to encompass the shape and extent of parallel variables and data without imposing synchronization constraints. Simply stated, assignments in XPC must take place between objects of the same extent, whether the objects are mono or poly in nature. The semantics of these expressions are described in the previous chapter for each of the four possible combinations of objects on the left and right side of the assignment. Assignments through memory indirection (pointers) are allowed in XPC, with the stipulation that any indirection on an object is assigned a shape and extent identical to that of the object being referenced.

Assignments between poly variables of different shapes are permitted, provided that the variables are of the same extent. In other words, a different subset of processes may be specified on each side of an assignment expression, but the variables must be declared for the same set of processes in order for the assignment to be valid. The BNF for expressions in XPC is shown in Listing 5.7.

<identifier> ::=
          IDENTIFIER
        | TYPENAME
<identifiers> ::=
        IDENTIFIER
        | <identifiers> ',' IDENTIFIER
<unop> ::=
          '&'
        | '-'
        | '+'
        | PLUSPLUS
        | MINUSMINUS
        | '~'
        | '!'
<expr> ::=
        <nonnull_exprlist>
<exprlist> ::=
          /* empty */
        | <nonnull_exprlist>
<nonnull_exprlist> ::=
          <expr_no_commas>
        | <nonnull_exprlist> ',' <expr_no_commas>
Listing 5.7.  XPC expressions
<unary_expr> ::=
          <primary>
        | '*' <cast_expr>   %prec UNARY
        | <unop> <cast_expr>  %prec UNARY
        | SIZEOF <unary_expr>  %prec UNARY
        | SIZEOF '(' <typename> ')'  %prec HYPERUNARY
        | ALIGNOF <unary_expr>  %prec UNARY
        | ALIGNOF '(' <typename> ')'  %prec HYPERUNARY
<cast_expr> ::=
          <unary_expr>
        | '(' <typename> ')' <cast_expr>  %prec UNARY
        | '(' <typename> ')' '{' <initlist> <maybe_comma> '}'  %prec UNARY
<expr_no_commas> ::=
          <cast_expr>
        | <expr_no_commas> '+' <expr_no_commas>
        | <expr_no_commas> '-' <expr_no_commas>
        | <expr_no_commas> '*' <expr_no_commas>
        | <expr_no_commas> '/' <expr_no_commas>
        | <expr_no_commas> '%' <expr_no_commas>
        | <expr_no_commas> LSHIFT <expr_no_commas>
        | <expr_no_commas> RSHIFT <expr_no_commas>
        | <expr_no_commas> ARITHCOMPARE <expr_no_commas>
        | <expr_no_commas> EQCOMPARE <expr_no_commas>
        | <expr_no_commas> '&' <expr_no_commas>
        | <expr_no_commas> '|' <expr_no_commas>
        | <expr_no_commas> '^' <expr_no_commas>
        | <expr_no_commas> ANDAND <expr_no_commas>
        | <expr_no_commas> OROR <expr_no_commas>
        | <expr_no_commas> '?' <xexpr> ':' <expr_no_commas>
        | <expr_no_commas> '=' <expr_no_commas>
        | <expr_no_commas> ASSIGN <expr_no_commas>
Listing 5.7,  continued
<primary> ::=
          IDENTIFIER
        | CONSTANT
        | <string>
        | '(' <expr> ')'
        | '(' <compstmt> ')'
        | <primary> '(' <exprlist> ')'   %prec '.'
        | <primary> <sdim> <expr> ']'   %prec '.'
        | <primary> PDIM <expr> ']'   %prec '.'
        | <primary> PDIM '?' ']'   %prec '.'
        | <primary> PDIM '#' ']'   %prec '.'
        | <primary> '.' <identifier>
        | <primary> POINTSAT <identifier>
        | <primary> PLUSPLUS
        | <primary> MINUSMINUS
Listing 5.7,  continued

The communication requirements underlying assignments are not specified directly by the semantics of the assignment statements, since different systems use different protocols.

5.2.5. Determinism and Synchronization

As described earlier, synchronization is included in parallel programming models only as a means of ensuring deterministic execution. The model used in XPC uncouples synchronization from other aspects of the program structure, resulting in potentially nondeterministic code. This nondeterminism can be detected by the compiler, however, so the language model need only provide a mechanism by which the programmer can "help" the compiler make the right choices when attempting to generate code.

XPC allows nondeterministic execution to be specified through use of the nondet assertion, which may precede any statement body. For example, the code sequence

nondet (x) {
   if (p[||mask])
      x = 5;
   else
      x = 6;
}

indicates that the programmer is aware of the potentially nondeterministic results and that the if and else bodies may be executed concurrently. Note that these bodies could be executed simultaneously even on a SIMD architecture. This type of parallelism (admittedly trivial in this example) is not supported in other synchronous languages. If the nondet assertion was not present in this example, the compiler could detect the nondeterminism and warn the user with an appropriate message.

The nondet assertion allows the programmer to designate any block of statements as being potentially nondeterministic. Such blocks may be parallelized by the compiler despite potential race conditions or hidden aliasing information. This feature should be especially useful when calling subroutines with flow graphs that are not visible to the calling function. Without the nondet feature, the compiler would typically have to make worst case assumptions concerning access to global variables and passed parameters.

Synchronization is provided through the use of barrier variables that are declared and initialized just as poly variables. Some examples of barrier definitions are:

barrier this[||N][||N];
barrier that[||#];

The first definition creates a barrier variable for a two- dimensional array of processes. The second definition defines a barrier for the full extent of processes assigned to the current program block.

The difference between barriers and parallel variables in XPC involves the side-effects of evaluating the variables. Evaluation of a barrier variable as an "lvalue" (i.e. a reference to a memory object, such as the left side of an assignment) causes a a barrier synchronization to effectively take place across the specified set of processes upon completion of the evaluation. A non-zero element in a barrier variable indicates that the corresponding process is to participate in the barrier synchronization. A barrier variable may also appear as an "rvalue" (i.e. a value that does not refer to an object in memory), in which case no barrier synchronization is invoked, but the current "status" of the barrier may be read. Thus, processes may set poly variables or integer arrays equal to the state of a barrier variable without actually invoking the barrier.

Storing into a barrier variable will therefore cause a barrier synchronization to take place in the specified set of processes, while reading a barrier variable will simply copy the state of a barrier into a parallel variable or integer array of the same shape and extent. The use of barrier variables does not infer a particular implementation of such synchronization; rather, it provides information to the compiler regarding the relative timing constraints of a particular algorithm. On a true SIMD machine, for example, barrier synchronization is inherent at the instruction level. Other machines may present a SIMD model of execution, but actually be implemented without true instruction-level synchronization. In such cases, barrier variables may prove invaluable as a means of ensuring fine- grain control of interprocess timing.

Contrary to most barrier implementations, XPC barriers are not necessarily costly to use, since barriers may be specified for any arbitrary set of processes. Unlike the Force, where all processes must reach the "barrier" before any processes may proceed with execution, XPC provides barriers as a general purpose synchronization mechanism. Critical regions may be specified using barriers and mono variables. For instance, the following code segment could be used to allow a single process to update the mono variable s:

mono int s;
poly int p[||64];
barrier first[||64];
  ...
s = p[||first[||?]];

5.2.6. Other Features

Conventional C types and constructs, such as enumerated types, bit fields, and assembler directives, are also supported in XPC. The semantics of such features are identical to those of the standard C language. For the sake of completeness, the grammar for these constructs is provided in Listings 5.8 and 5.9. These topics are not discussed further here.

<asm_operands> ::=
          /* empty */
        | <nonnull_asm_operands>
<nonnull_asm_operands> ::=
          <asm_operand>
        | <nonnull_asm_operands> ',' <asm_operand>
<asm_operand> ::=
          STRING '(' <expr> ')'
<asm_clobbers> ::=
          STRING
        | <asm_clobbers> ',' STRING
Listing 5.8.  Assembler directives

5.3. Summary

The current definition of XPC is intended to illustrate the feasibility of a unified parallel programming model. Semantics have been described for parallel execution of C language constructs, and a BNF grammar has been specified for an initial implementation.

Only three keywords have been added to the C language, two of which (mono and poly) are included only for programming clarity. A pair of delimiters are included to differentiate between mono and poly dimensions. Arrays are treated as first-class objects in XPC, eliminating the need for special index sets when accessing parallel processes. Barriers are treated as special variable types due to the unique side effects of evaluating such variables.

The fundamental characteristics of the XPC programming model involve the separation of synchronization from communication and control flow issues and the inclusion of a mechanism for the allowance of nondeterminism in a program. XPC programs are maintained as single code images, and process allocation is independent of the actual number of processors in the target architecture.

<string> ::=
          STRING
        | <string> STRING
<component_decl_list> ::=
          /* empty */
        | <component_decl_list> <component_decl> ';'
        | <component_decl_list> ';'
<component_decl> ::=
          <typed_typespecs> <components>
        | <nonempty_type_quals> <components>
<components> ::=
          /* empty */
        | <component_declarator>
        | <components> ',' <component_declarator>
<component_declarator> ::=
          <declarator>
        | <declarator> ':' <expr_no_commas>
        | ':' <expr_no_commas>
<enumlist> ::=
          <enumerator>
        | <enumlist> ',' <enumerator>
<enumerator> ::=
          <identifier>
        | <identifier> '=' <expr_no_commas>
<absdcl> ::=
        /* empty */
        | <absdcl1>
<nonempty_type_quals> ::=
          TYPE_QUAL
        | <nonempty_type_quals> TYPE_QUAL
<type_quals> ::=
          /* empty */
        | <type_quals> TYPE_QUAL
Listing 5.9.  Enumerated types and bit fields
<absdcl1> ::=
          '(' <absdcl1> ')'
        | '*' <type_quals> <absdcl1>  %prec UNARY
        | '*' <type_quals>  %prec UNARY
        | <absdcl1> '(' <parmlist_1>  %prec '.'
        | <absdcl1> <sdim> <expr> ']'  %prec '.'
        | <absdcl1> <sdim> ']'  %prec '.'
        | <absdcl1> PDIM <expr> ']'  %prec '.'
        | <absdcl1> PDIM ']'  %prec '.'
        | <absdcl1> PDIM '?' ']'  %prec '.'
        | <absdcl1> PDIM '#' ']'  %prec '.'
        | '(' <parmlist_1>  %prec '.'
        | '[' <expr> ']'  %prec '.'
        | '[' ']'  %prec '.'
        | <sdim> <expr> ']'  %prec '.'
        | <sdim> ']'  %prec '.'
        | PDIM <expr> ']'  %prec '.'
        | PDIM ']'  %prec '.'
        | PDIM '?' ']'  %prec '.'
        | PDIM '#' ']'  %prec '.'
Listing 5.9,  continued

6. SUMMARY AND CONCLUSIONS

The need for a unified programming model becomes increasingly apparent as different parallel architectures continue to proliferate. The traditional approach of specifying languages for a particular class of machines limits both the portability and the applicability of parallel algorithms. Thus far, explicitly parallel languages have shown a tendency to exclusively support either a synchronous or asynchronous model of execution. The exclusion of certain types of parallelism is no longer acceptable in programming models, nor is the need to use a different language for each type of existing architecture.

In this thesis, a number of synchronous and asynchronous languages were examined, allowing a series of observations to be made concerning the types of parallelism supported in each model.

Synchronous languages tend to express parallelism through data layout and algorithm topology. Synchronization is often inherent in the semantics of control flow and communication constructs, despite the fact that such constraints are not always required. Control constructs, in particular, often specify inconsistent semantics that prohibit the use of arbitrary control flow in a program.

Asynchronous languages offer much more flexible control flow, since processes are generally treated as independent sequential entities. The primary difficulty with asynchronous models involves the means of interprocess communication. Models that use a synchronous message passing scheme are often tied directly to a specific machine topology, whereas shared memory models often suffer from inefficient implementations.

From the examination of other languages, a set of characteristics desired in a unified model were identified. These properties include:

  1. Maintaining a single code image, regardless of the underlying architecture.
  2. Removing the distinction between parallel data classifications such as "shared" and "scalar" or "parallel" and "private."
  3. Separating the specification of synchronization from communication and control flow issues.
  4. Providing a mechanism for the specification of nondeterminism in program segments.

As an initial attempt at supporting a unified programming model, a language definition for XPC is provided. A BNF grammar is denoted for the language and a set of semantics for parallel constructs is discussed. Although the language is presented in preliminary form, it introduces a number of new language constructs, including variables for invoking barrier synchronizations, lexical directives for the specification of nondeterminism, and arrays that may be treated as first-class objects.

It remains to be seen whether or not the XPC model can be efficiently implemented on a broad class of systems. The generality of the language semantics make the language more portable, but will also likely make the compiler implementation more difficult. Arguably, another parallel language (such as XPC) is not the solution to the needs of the parallel processing software community. However, the issues raised in the attempt to define a unified model may help to highlight weaknesses in current approaches and eventually lead to a more flexible and expressive programming environment.

LIST OF REFERENCES

[AhC86]
Sudhir Ahuja, Nicholas Carriero, and David Gelernter, "Linda and Friends," Computer, August 1986, pp. 26-34.
[BaB68]
G. H. Barnes, R. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, and R. A. Stokes, "The Illiac IV computer," IEEE Transactions on Computers, Vol. C-17, August 1968, pp. 746-757.
[Bab88]
R. G. Babb II, Programming Parallel Processors, Addison- Wesley, Reading, Massachusetts, 1988.
[CaD87]
T. Casavant, H. Dietz, and P. Sheu, "The PARSE Approach to Programming Non-shared Memory, Reconfiguarable, Parallel Computers," Fourth International Conference on Supercomputing, May 1989.
[DiK84]
H. Dietz and D. Klappholz, "Refining a Conventional Language for Race-Free Specification of Parallel Algorithms," 1984 International Conference on Parallel Processing, August 1984.
[Ell85]
J. R. Ellis, "Bulldog: A compiler for VLIW Architectures," ACM Doctoral Dissertation Award, MIT Press, 1985.
[Fer82]
C. Fernstrom, "Programming Techniques on the LUCAS Associative Array Computer," 1982 International Conference On Parallel Processing, August 1982, pp. 253-261.
[Fly66]
M. J. Flynn, "Very High-Speed Computing Systems," Proceedings of the IEEE, Vol. 54, December 1966, pp. 281-291.
[GeC85]
D. Gelernter, N. Carriero, S. Chandra, and S. Chang, "Parallel Programming in Linda," 1985 International Conference on Parallel Processing, August 1985, pp. 255-263.
[GeR86]
N. H. Gehani and W. D. Roome, "Concurrent C," Software: Practice and Experience, Volume 16, Number 9, 1986, pp. 821-844.
[Hil85]
D. W. Hillis, "The Connection Machine," MIT Press, Cambridge, Massachusetts, 1985.
[Hoa78]
C. A. R. Hoare, "Communicating Sequential Processes," Communications of the ACM, Volume 21, Number 8, August 1978, pp. 666-677.
[Inm86]
Inmos, "Occam Programming Manual," Inmos Limited, March 1986.
[Jor87]
Harry Jordan, "The Force," ECE Technical Report Number 87-1-1, Department of Electrical and Computer Engineering, University of Colorado, January 1987.
[KuS84]
D. J. Kuck, A. H. Sameb, R. Cytron, A. V. Veidenbaum, C. D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C. Beckman, J. R. B. Davies, and C. P. Kruskal, "The Effects of Program Restructuring, Algorithm Change, and Architecture Choice on Program Performance," 1984 International Conference on Parallel Processing, August 1984.
[KuS85b]
J. T. Kuehn and H. J. Siegel, "Extensions to the C programming language for SIMD/MIMD parallelism," 1985 International Conference on Parallel Processing, August 1985, pp. 232-235.
[LaL75]
D. H. Lawrie, T. Layman, D. Baer, and J M. Randal, "Glypnir - A Programming Language for Illiac IV," Communications of the ACM, Volume 18, Number 3, March 1975, pp. 157-164.
[Li84]
K-C. Li, "Vector C - a Programming Language for Vector Processing," Ph.D. Dissertation, Department of Computer Science, Purdue University, 1984.
[Lin82]
N. R. Lincoln, "Technology and Design Tradeoffs in the Creation of a Modern Supercomputer," IEEE Transactions on Computers, Vol. 31, 1982, pp. 349-362.
[McS85]
J. R. McGraw, S. K. Skedzielewski, S. Allan, R. Oldehoeft, J. Glauert, C. Kirkham, B. Noyce, and R. Thomas, "SISAL: Streams and Iteration in a Single Assignment Language," Language Reference Manual, Version 1.2, Computing Research Group, Lawrence Livermore National Laboratory, March 1985.
[MeV85]
P. Mehotra and J. Van Rosendale, "The Blaze Language: A Parallel Language for Scientific Programming," NASA Contractor Report, ICASE Report Number 85-29, May 1985.
[PeC85]
R. H. Perrott, D. Crookes, P. Milligan, and W. R. M. Purdy, "A compiler for an array and vector processing language," IEEE Transactions on Software Engineering, Vol. SE-11, May 1985, pp. 471-478.
[Per79]
R. H. Perrott, "A language for array and vector processors," ACM Transactions on Programming Languages and Systems, Vol. 1, October 1979, pp. 177-195.
[RoS87]
John R. Rose and Guy L. Steele Jr., "C* : An Extended C Language for Data Parallel Programming," Second International Conference on Supercomputing, Vol. II, May 1987, pp. 2-16.
[Set89]
R. Sethi, Programming Languages; Concepts and Constructs, Addison-Wesley, Reading, Massachusetts, 1989, pp. 359-374.
[SiS86]
Howard Jay Siegel, Thomas Schwederski, James T. Kuehn, and Nathaniel J. Davis IV, "An Overview of the PASM Parallel Processing System," Tutorial: Computer Architecture, IEEE Computer Society Press, Washington DC, 1986, pp. 387-407.
[Sta89]
Richard Stallman, "Internals of GNU CC," Gnu C Compiler, Version 1.34, Free Software Foundation, 1989.
[Str86]
Bjarne Stroustrup, The C++ Programming Language, Addison- Wesley, Reading, Massachusetts, 1986.

Hypertext Index

ACKNOWLEDGMENTS
ABSTRACT
Notice for HTML Users
1. Introduction
1.1. Parallel Programming Models
1.2. Parallel Execution Models
1.2.1. SIMD
1.2.2. MIMD
1.3. A Unified Model of Parallelism
1.4. Overview of This Document
2. SYNCHRONOUS LANGUAGES
2.1. Synchronous Constructs
2.2. Glypnir
2.2.1. Expression of Parallelism
2.2.2. Conditional Constructs
2.2.3. 3 Iterative Constructs
2.2.4. Assignments and Communication
2.2.5. Summary
2.3. Actus
2.3.1. Expression of Parallelism
2.3.2. Conditional Constructs
2.3.3. Iterative Constructs
2.3.4. Assignments and Communication
2.3.5. Summary
2.4. Parallel-C
2.4.1. Expression of Parallelism
2.4.2. Conditional Constructs
2.4.3. Iterative Constructs
2.4.4. Assignments and Communication
2.4.5. Summary
2.5. C*
2.5.1. Expression of Parallelism
2.5.2. Conditional Constructs
2.5.3. Iterative Constructs
2.5.4. Assignments and Communication
2.5.5. Summary
2.6. Conclusions
3. ASYNCHRONOUS LANGUAGES
3.1. Asynchronous Constructs
3.2. Occam
3.2.1. Specification of Processes
3.2.2. Communication Between Processes
3.2.3. Summary
3.3. Concurrent C
3.3.1. Specification of Processes
3.3.2. Communication Between Processes
3.3.3. Summary
3.4. Linda
3.4.1. Specification of Processes
3.4.2. Communication Between Processes
3.4.3. Summary
3.5. The Force
3.5.1. Specification of Processes
3.5.2. Communication Between Processes
3.5.3. Summary
3.6. 2 Conclusions
4. UNIFICATION OF SYNCHRONOUS AND ASYNCHRONOUS MODELS
4.1. Characteristics of a Unified Model
4.2. Control Structure
4.3. Data Management
4.3.1. Data Layout
4.3.2. Assignment Statements
4.3.3. Synchronization
4.4. Determinism
5. XPC
5.1. Introduction
5.2. Language Definition
5.2.1. Program Specification
5.2.2. Data Layout
5.2.3. Control Structure
5.2.4. Assignments and Communication
5.2.5. Determinism and Synchronization
5.2.6. Other Features
5.3. Summary
6. SUMMARY AND CONCLUSIONS
LIST OF REFERENCES
Hypertext Index