Genetic Scheduling of Router Operations for the MasPar MP-1 and MP-2

H. G. Dietz and T. M. Chung

Parallel Processing Laboratory
School of Electrical Engineering
Purdue University
West Lafayette, IN 47907-1285
hankd@ecn.purdue.edu
(317) 494 3357

Abstract

The MasPar MP-1 and MP-2 are massively-parallel SIMD supercomputers. Unlike many SIMD machines, both provide several high-performance communication mechanisms, including a global routing mechanism. The global router network is a multi-stage circuit-switching network which, in a 16,384 processing element machine, provides 1,024 routed data paths. Hardware within the network detects routing conflicts and selects a conflict-free set of connections for each router cycle. This mechanism achieves the theoretical minimum 16 router cycles for most permutations, but some permutations take many more cycles.

For example, a bit reversal permutation takes 70 router cycles when scheduled by hardware, but can be reduced to just 23 if statically scheduled. This paper presents a general-purpose router "compiler*U that can statically schedule arbitrary communications using a genetic search technique to improve upon the default hardware scheduling.

This work was supported in part by the Office of Naval Research (ONR) under grant number N00014-91-J-4013 and by the National Science Foundation (NSF) under award number 9015696-CDA.

Keywords:

Interconnection networks, Compilers, Genetic search algorithms, MasPar MP-1/MP-2, Static scheduling.

Notice for HTML Users

The HTML version of this paper differs from the standard postscript version (click here for the 531K compressed postscript version) in several ways, including a variety of minor formatting changes. The standard postscript version was formatted by groff -mgs as a single file. In contrast, this HTML document was generated using a script that replaces figures with in-line gif images that are hypertext linked to full-resolution postscript versions; the (nK .ps.Z) at the upper right of each in-line image provides both the size of the compressed postscript file and the hypertext link. There are also a variety of hypertext links inserted for navigation within the document, including a Hypertext Index at the end of this document (click here to jump to the Hypertext Index).

1. Introduction

Before defining the specific focus of this paper, it is useful to briefly overview the structure of the MasPar [Bla90] , as shown in Figure 1.

Image1 not displayable as  text. ( 9K .ps.Z )

There are 16,384 PEs (Processing Elements) in a fully configured MasPar MP-1 or MP-2. Each PE has an ALU, a large register file, and either 16K or 64K bytes of memory. The system is microcoded so that each PE appears to be a RISC processor supporting operations on 1, 8, 16, 32, and 64-bit data. Communication between PEs is supported by direct connections from each PE to its 8 neighbors in the 128 by 128 array of PEs -- but there is also a global router multistage interconnection network, so that each PE can directly communicate with any other PE. In addition, the array has access through the router and a shared I/O memory buffer to a variety of external I/O devices ranging from HiPPI interfaces to RAID storage.

Controlling the PE array is the ACU (Array Control Unit), a high-performance 32-bit RISC processor which also can perform scalar computations. The ACU is, in turn, controlled by the front-end machine, either an R3000-based DEC 5000 running Ultrix or a DEC Alpha running OSF. To the front-end machine, the MasPar is essentially seen as a peripheral. Hence, the front-end serves primarily as a user interface -- editing, compiling, debugging, etc., are supported by the front-end.

In this paper, we are concerned with only one aspect of the machine: the global router network. Although MasPar's MPL language [MCC91] provides a router[] construct that appears to perform an arbitrary communication pattern as a single operation, this is not how the communication actually occurs. There are only 1,024 data paths through which data can be sent, so any permutation communication that involves all 16,384 processing elements must take at least 16 "router cycles."

The assembly code generated for a typical MPL router[] operation is thus a tight loop that looks something like:

Image2 not displayable as text. ( 3K .ps.Z )

This code functions by first making the tflag a copy of the enable status of each processing element. Next, it attempts to open the router connections (using ropen) that were requested by processing elements with a tflag value of 1. Because there are more processing elements than there are router data paths, and there can be contention in the three switching stages, some connections are made, but others might fail. The rfetchc instruction then sends data through the connections that were made, closes the connections, and clears tflag on each processor that had a connection. The global or and branch instructions (gor1 and bne) simply cause the router operations to be repeated until all processing elements have a tflag value of 0, indicating that all communications have been completed.

This router[] code sequence takes about 100 clock ticks to execute each loop iteration on an MP-1. Of these 100 ticks, the ropen instruction takes 42 ticks and the rfetchc takes 17 ticks plus one tick for each data bit transmitted (32 for rfetchc32), for a total of 91 ticks. The looping overhead is clearly negligible, as is the updating of the variable __routerCount, which is incremented once for each time around the loop and can be examined by the MPL program to aid in evaluating performance. The number of times this loop is executed for a particular router operation is called the number of router cycles used by that operation, and counting router cycles is an accurate measure of communication performance.

As described above, the MasPar's router network hardware selects conflict-free communications for each router cycle as part of the ropen operation. The question is then, how wisely does the hardware select the conflict-free set so as to minimize the total number of router cycles required?

The surprising answer is that, for many communication patterns, the hardware does remarkably well at selecting which set of conflict-free communications should be executed in each router cycle. For example, given any permutation in which all processing elements communicate with processing elements whose IDs differ by a constant, the router hardware always yields the minimum possible 16 router cycles. However, there are many communication patterns for which the router hardware yields a schedule with as few as just 16 of the 1,024 data paths in the router network in use during some router cycles. It was this observation that led to the creation of the genetic scheduler presented in this paper.

The genetic scheduler presented here simply determines which set of conflict-free communications should be executed in each router cycle, and then replaces the usual single router[] loop with a series of communication phases that enforce a statically-determined ordering. Each processing element only attempts to use the router during the phase which matches its router phase number. Although determining a static phase ordering is expensive, implementing the ordering is very inexpensive. It is not uncommon for complex communication patterns (e.g., bit-reversal or shuffle permutations) to use less than half as many router cycles with phase ordering as would be required if the MasPar hardware alone was used to determine the schedule.

2. The Genetic Search

Because the MasPar MP-1 and MP-2 router hardware algorithm is not publically available, the only method available for determining how the hardware will schedule a communication pattern is to actually execute the pattern using the hardware.

Lacking any information that could be used to prune the search space for an ideal schedule, the search space is incredibly huge. Suppose that we know that a 23-cycle schedule exists for a particular permutation communication pattern involving all 16,384 processing elements. It is trivially shown that the number of possible schedules is then 2316,384. Clearly, an exhaustive search is not possible. A simulated annealing approach was proposed for a similar problem involving the Thinking Machines CM [Dah90] . However, our search space is very "bumpy," so most search methods would fail to find globally optimal (or even very good) schedules. These are precisely the charateristics for which genetic search algorithms perform best [Hol75] [HoA94] .

To map the router scheduling problem into a genetic search, we must define a method for representing a schedule, a fitness measure to drive natural selection, a method by which schedules can be combined by crossover, and a method by which individual schedules can be mutated.

2.1. The Schedule Representation

The representation of a router schedule within the genetic search is a hybrid of phase and cycle assignments for each processing element's communication operation. Every processing element's communication is assigned a phase, and each phase corresponds to a sequence of router cycles scheduled by the router hardware. Thus, once our genetic algorithm has assigned a set of communication operations to a particular phase, the router hardware algorithm will determine how many cycles that phase will require to execute.

For example, consider a bit-reversal permutation (the same example case used in Section 3). The hardware schedule for this communication is represented within our search by grouping the 16,384 processing elements into 70 cycles within a single phase. The number of processing elements within each cycle is:

Phase 0: 16 32 48 64 80 96 112 128 144 160 176
      192 208 240 272 304 304 304 304 304
      304 304 304 304 304 304 320 336 352
      336 320 304 304 304 304 304 304 304
      304 320 336 352 336 320 304 304 304
      304 304 304 304 304 304 304 304 272
      240 208 192 176 160 144 128 112 96 80
      64 48 32 16

The best schedule in generation 0 of the genetic search consists of 6 phases containing a total of 57 cycles:

Phase 0: 672 480 384 272 160 32
Phase 1: 688 656 288 256 256 256 272 288 208
         128 80 32
Phase 2: 416 288 287 223 192 129 95 113 78 48
         17 17 1
Phase 3: 432 544 464 432 512 400 288 224 80
Phase 4: 752 544 336 32
Phase 5: 624 400 400 224 160 48
Phase 6: 544 656 512 304 80 48 32

The final schedule (selected in generation 440) consists of 23 phases, each containing a single cycle:

Phase 0: 847
Phase 1: 832
Phase 2: 817
Phase 3: 816
Phase 4: 800
Phase 5: 782
Phase 6: 766
Phase 7: 751
Phase 8: 736
Phase 9: 721
Phase 10: 720
Phase 11: 704
Phase 12: 690
Phase 13: 687
Phase 14: 672
Phase 15: 672
Phase 16: 640
Phase 17: 638
Phase 18: 641
Phase 19: 623
Phase 20: 611
Phase 21: 610
Phase 22: 608

2.2. The Fitness Measure

In execution, each new phase requires the processing elements to compare their phase number to the current phase, so there is a small overhead for initiating a phase that can take multiple cycles. This overhead is about 5 clock ticks (or about 5% of one router cycle). However, if each phase contains only one cycle, there is no need for the usual router loop, and the overhead of the loop is replaced by the phase initiation overhead. In fact, the overhead for the single-cycle phases is even less than for the standard router loop. This overhead can be even further reduced by replacing the incrementing of __routerCount at each cycle with a single addition of the total number of cycles at the end of the routing sequence. The resulting code is about 5% faster to execute each router cycle than if the standard loop were used.

Given this, the obvious measure of a schedule's fitness reasonably accurately reflects the true performance: the negative of the total number of router cycles is the fitness measure. The closer to zero cycles, the better the performance. For generality, the program also allows for an additional overhead to be accumulated for each multi-cycle phase.

We measure the number of cycles in each phase of a schedule by using the MasPar's router hardware to determine for each processing element, within its assigned phase, in which cycle it will be allowed to communicate.

2.3. Crossover

The problem of combining two schedules to produce a new schedule exhibiting features from both parents can be solved in many ways. Our approach is very unusual, but it is simple, efficient, and effective. Our approach is based on the concepts of exploding a schedule and tupling a schedule.

To explode a schedule, we simply assign each cycle within each phase a unique new phase number. Because the router hardware selected these cycles by partitioning the communications within an original phase, we know that each of the resulting phases will execute in a single cycle.

Tupling is the inverse transformation of exploding. In this process, a schedule with many phases is repackaged by combining pairs of phases into single phases recursively until the desired number of phases remains. To enhance performance, the tupling process is biased to prefer merging relatively "empty" phases into other phases. Notice, however, that merging two phases has great and ponderous impact on how router cycles will be assigned when the router hardware attempts to schedule cycles within the resulting phase.

Given these basic operations, which incidentally execute with substantial internal parallelism on a MasPar MP-1 or MP-2, the crossover algorithm written in MPL is:

crossover(mom, dad, baby)
register int mom, dad, baby;
{
  /* First, explode mom and dad and determine how many
     PEs are enabled for each phase
  */
  plural register int emom = explode(mom);
  plural register int cmom = howmany(emom);
  register int mompmax = reduceMax16(emom);
  plural register int edad = explode(dad);
  plural register int cdad = howmany(edad);
  register int dadpmax = reduceMax16(edad);
  plural register int ebaby;
  /* Create exploded version of baby by, for each PE, taking the
     mom or dad phase that contained the most other PEs. So
     that mom and dad phases are not confused, we add a bias of
     mompmax to phase values taken from dad
  */
  ebaby = ((cmom > cdad) ? emom : (edad + mompmax));
  /* Now we have a little problem... we have an exploded version
     of the baby that not only may have twice as many phases as
     either parent, but could also have empty phase numbers.
     Resolve this by sorting (renumbering) the phases into
     descending order and then remapping (tupling) the resulting
     phases into a set of phases not larger than the smaller of
     mom and dad...
  */
  phase[baby] =
    tuple(sort(ebaby),
          (((mompmax < dadpmax) ? mompmax : dadpmax) -
           (rand() & 1)));
  /* Baby phases are selected, complete the other info */
  cycle[baby] = conflict(phase[baby]);
  merit[baby] = eval(baby);
}

Thus, the new (baby) schedule is created by combining portions of the most effectively scheduled cycles within both parent schedules (mom and dad). Although we tried a number of other methods for crossover, none came close to the efficiency of this method. Conceptually this is because the hardware mechanism for scheduling cannot be "turned-off" -- any crossover mechanism that is not sensitive to the hardware-generated groupings tends to generate schedules that the hardware will fragment into many cycles.

2.4. Mutation

As for crossover, there are many different ways to mutate a schedule. However, the method we found to be most effective is again based on the concepts of exploding and re-tupling. The MPL code is:

plural int
mutate(s)
register int s;
{
  /* Return a mutated version of schedule s created by exploding
     p and tupling the result to a randomly selected number of
     phases between 1 and the number of phases in the exploded
     version
  */
  plural register int ep = explode(s);
  return( tuple( sort(ep), ((rand() % (reduceMax16(ep) + 1)) + 1)));
}

This mutation essentially generates a new schedule by regrouping the cycles of a schedule to form a new schedule with no more phases than the schedule from which it was mutated.

2.5. The Complete Genetic Search

Combining the above to form the complete genetic search is a relatively straightforward matter. First, an initial population is synthesized from the hardware-generated schedule:

  1. Create an inital schedule by placing all communications in a single phase. This generates the hardware schedule.
  2. Create the rest of the first generation's population by creating POP-1 mutations of the initial schedule.
  3. Evaluate the merit (fitness) of each schedule.

The genetic search then proceeds as follows:

  1. Determine the best, worst, and average fitness of the population.
  2. If the best is more fit than the best found thus far, record this schedule as the new best found thus far.
  3. If the best schedule's fitness is perfect (i.e., already takes the theoretical minimum number of cycles possible based solely on the number of processing elements involved in the communication divided by the total number of router data paths), exit.
  4. If the maximum allowed number of generations has been reached, exit.
  5. Create the next generation's population by:
    1. Stochastically selecting the SURVIVE most fit schedules to survive from this generation to the next.
    2. Randomly pairing surviving schedules to create CHILD new schedules by crossover. Each new schedule is evaluated for fitness as it is created.
    3. Randomly selecting POP-(SURVIVE+CHILD) individual surviving or child schedules to mutate to form the remainder of the new population. Each new schedule is evaluated for fitness as it is created.
  6. Go to step 1.

It is perhaps surprising that, although the above algorithm appears to be completely serial, it actually makes good use of the MasPar's parallelism. This is because all of the operations on a schedule are done simultaneously on all 16,384 elements of the schedule. Because of this parallelism, and because the search converges quickly, typical time to find a good schedule is on the order of a few minutes.

3. Performance

The performance of the genetic scheduler described in this paper can be measured in two dimensions:

These issues are addressed in the following sections.

3.1. Performance of the Search

Because any genetic search algorithm will have stochastic properties, it is difficult to characterize the performance of the search. In order to get a better understanding of the typical performance and how widely performance varies, we focussed on changing a variety of parameters while using just two communication patterns:

Because bit-reversal seems more typical of truly randomly- selected permutations, we emphasize it in this paper's presentation.

Perhaps the best way to understand the performance of the genetic algorithm is to examine the profiles of the schedules generated for a typical attempt to schedule the bit-reversal permutation. Such a profile is presented in Figure 2.

The genetic search profiled in Figure 2 was conducted using a population that consisted of 256 survivors, 128 children, and 128 mutants -- a total of just 512 schedules. From the figure, it is clear that at least one of the 512 schedules of the very first generation yielded a significant improvement -- an 18.6% reduction. The figure also clearly shows that the number of router cycles drops quickly, reaching just 23 cycles by generation 440. Further, the total number of new schedules considered up to generation 440 is only 512 + (439 * 256), since each generation after the first contains only 256 new schedules. This is just 112,896 schedules -- a rather small fraction of the 2316,384 schedules in the search space.

440 generations worth of schedules is also relatively quick to search. Including the time to generate copious tracing output and regular dumps of the phase assignments for each processing element, Purdue's MasPar MP-1 still manages to create and examine roughly one new schedule every 0.02 seconds. Thus, the complete 440 generation search described above takes about half an hour. Actually, we let the above search run for 9,999 generations, but it did not find a better schedule after generation 440.

Image3 not displayable as  text. ( 10K .ps.Z )

We also ran a multitude of different combinations of population size and fractions of survivors, children, and mutants. However, we found that small changes in these values and ratios rarely made a significant difference in the progress of the search. It was noted, however, that making the number of survivors too small relative to the number of children tended to degrade the ability of the search to move past suboptimal solutions, and even caused a slow degeneration in the quality of the best schedule in the population.

It seems logical that changing the population make-up in direct response to changes in the average fitness, and range of fitness within a generation, could yield some improvement in the search. For example, one might increase the survival rate when many schedules are fit, or increase the mutation rate when too little genetic variety (a small range of fitness values) occurs. However, several adaptive methods for changing the population make-up as the search progressed were tested, and the fixed population make-up did as well or better.

3.2. Performance of the Resulting Schedule

To measure the performance of the genetically engineered schedule against that which would be created by the MasPar's hardware, a number of tests were performed.

An obvious test involved submitting a permutation to the genetic scheduler that the hardware already would schedule optimally. As expected, the genetic scheduler did not harm the schedule. In fact, it simply recognized that the theoretical minimum bound had been achieved and exited after generation 0.

A better test is to directly investigate how a schedule generated by the genetic search differs from one generated by the hardware. Figure 2 provides some insight into the qualities of the generated schedules. For example, it can be observed that the hardware creates schedules that seem to have a basic symmetry; the number of processing elements that are active in each consecutive cycle first increases and then symmetrically decreases. In contrast, there is little or no symmetry in the schedules created by genetic search, even to the point of an individual processing element being alone in a cycle. What is not obvious from this figure is that the symmetry can hurt performance because it implies that a regular communication pattern can "beat" against the regular pattern of the hardware's scheduling algorithm.

To better understand this, compare Figures 3 and 4. Both figures use gray shades to inidicate in which cycle each of the processing elements within the MasPar's 128x128 array will be connected according to a bit-reversal permutation. Darker colors are used for higher cycle numbers within each figure. Figure 3 shows the highly regular pattern of the hardware schedule; Figure 4 shows a fairly regular, but far more complex, pattern.

Image4 not displayable as  text. ( 126K .ps.Z )

Figure 3: Hardware Assignment of Router Cycles

Image5 not displayable as  text. ( 117K .ps.Z )

Figure 4: Genetic Assignment of Router Phases

Figures 5 and 6 provide similar images, but in these figures the gray shade for each processing element represents the fraction of the 1,024 router network data paths which are active at the time that this processing element performs its communication. Darker shades represent higher utilization. Aside from the fact that Figure 6 is uniformly darker than Figure 5 because the genetic schedule achieves higher utilization of the network, there is a far greater variation in the harwdare schedule's utilization -- and the regions of poor utilization follow a regular pattern. In effect, the "broken symmetry" of the schedule that was generated by genetic search allows it to pair together communications that the hardware would not have considered pairing because they were not part of a symmetric set of operations that could be scheduled together.

Image6 not displayable as  text. ( 110K .ps.Z )

Figure 5: Utilization Density Using Hardware Assignment

Image7 not displayable as  text. ( 115K .ps.Z )

Figure 6: Utilization Density Using Genetic Assignment

A more technically precise way to state this observation is that the router hardware resolves conflicts incrementally at each stage. When there is a conflict within a particular stage, the hardware does not know which communications would be conflict-free for the remaining stage(s). Thus, the hardware uses a fixed pattern for selecting which of the conflicting communications at each stage will be allowed to continue to the next stage. If this fixed pattern does not "line-up" with the appropriate groups of communications, the resulting schedule will contain extra cycles. In contrast, the genetic search essentially seeks these groups by using the complete network to define truly conflict-free sets of communications.

For these reasons, we are not surprised that the test permutations generally achieve a factor of 2X or better speedup over the hardware scheduling.

However, all of the above discussion has ignored communication patterns that are not permutations. There are two ways in which a communication pattern can fail to be a permutation:

Although we have not yet benchmarked the performance with non-permutation communication patterns, both of the above observations actually yield greater performance improvements for our technique when the communication is not a permutation.

Clearly, disabled processing elements do not add complexity to the search or to the solution. Further, only one of the processing elements involved in each race needs to be successfully routed; thus, we can select the processing element that best fit the schedule as the winner of the race.

For example, suppose every processing element wants to store into PE i's variable x. The MasPar's hardware will actually use 16,384 router cycles to accomplish this communication. However, it is trivial to recognize that PE i can store into its variable x without performing any communications at all, and the other communications are trivially eliminated because PE i was selected as the winner of the race. Likewise, if several different processing elements all wish to send to PE i while others are sending elsewhere, we can simply select as winner in the PE i race whichever of the processing elements was able to be scheduled in the cycle with the largest number of active processing elements.

We hope to better package the genetic scheduling tool and to test it with non-permutations in the near future.

4. Conclusion

This paper has presented a simple method for generating faster router schedules for permutation communications to be executed on a MasPar MP-1 or MP-2. The method is novel application of a genetic search algorithm to scheduling router cycles. It is unusual in that the router hardware is used to help determine how crossover and mutation operations should be performed, and it also is unusual in the way it uses parallelism to speed-up the genetic search.

Using this technique to schedule communication patterns that the hardware cannot optimally schedule typically yields a performance increases of 2X or more. For example, shuffle achieves a 2X speed-up, and bit-reversal achieves over 3X. These improved schedules are generated quickly, the genetic search generally yields a good schedule in less than a minute, and a nearly-optimal schedule in perhaps half an hour.

Although the approach as presented is specific to the MasPar hardware, the same principles should be applicable to statically scheduling any communication pattern for a wide range of network structures. For example, the extra-stage network in PASM [SiN87] could be statically scheduled in this way. The key network property required is that the network be synchronously controllable, so that communications assigned to distinct phases are known to be independent for the analysis (i.e., earlier communications do not alter the network's properties for later communications).

References

[Bla90]
T. Blank, "The MasPar MP-1 Architecture," 35th IEEE Computer Society International Conference (COMPCON), pp. 20-24, February 1990.
[Dah90]
Dahl, Thinking Machines internal memo, 1990.
[Hol75]
J. Holland, Adaptation in Natural and Artificial Systems, PhD thesis, University of Michigan, Ann Arbor, MI, 1975.
[HoA94]
E. Hou, N. Ansari, and H. Ren, "Genetic algorithm for multiprocessor scheduling," IEEE Transactions on Parallel and Distributed Systems, 5(2), pp. 113-120, February 1994.
[MCC91]
MasPar Computer Corporation, MasPar Programming Language (ANSI C compatible MPL) Reference Manual, Software Version 2.2, Document Number 9302-0001, Sunnyvale, California, November 1991.
[SiN87]
T. Schwederski, W. G. Nation, H. J. Siegel, and D. G. Meyer, "The Implementation of the PASM Prototype Control Hierarchy," Proc. of Second Int'l Conf. on Supercomputing, pp. I 418-427, 1987.

Hypertext Index

Abstract
Keywords:
Notice for HTML Users
1. Introduction
2. The Genetic Search
2.1. The Schedule Representation
2.2. The Fitness Measure
2.3. Crossover
2.4. Mutation
2.5. The Complete Genetic Search
3. Performance
3.1. Performance of the Search
3.2. Performance of the Resulting Schedule
4. Conclusion
References
Hypertext Index