Obtaining Parallelism

Previous: Object-Based Media Up: Problem Description Next: Memory Latency

Obtaining Parallelism

A traditional computer program (regardless of the high level language used) is written in an imperative manner. One or more sequences of instructions, with occasional conditional branches, are defined by the programmer and executed sequentially by a single processor. In such a program specification, it is difficult to separate the dependencies (limits on parallelism) introduced by the programmer from those intrinsic to the algorithm.

Still, a small amount of parallelism (2 - 4x) is typically obtained in a manner transparent to the programmer, by scheduling adjacent instructions in the instruction sequence to execute in parallel. This may be done either statically at compile time (a Very Long Instruction Word, or VLIW, approach) or dynamically at runtime (``superscalar'').

Parallelism may also be obtained at other levels through any of several mechanisms: parallel threads, or processes; usually targetting a particular computer system architecture and granularity of parallelism (typically coarse-grained). The partitioning of an algorithm into a relatively fixed number of separate threads must currently be done before compile-time by the application programmers.

While a fine grained dataflow execution model would theoretically expose the highest amount of parallelism in an application, the overhead of synchronizing every instruction has been proven costly. A compromise seemingly well suited for today's processors is hybrid dataflow [16], where the dataflow element used for synchronization and scheduling represent short sequences of instructions which only access global variables at the start and end of the sequence. The instruction sequences are then scheduled and executed in a dataflow manner, amortizing the cost of synchronizing the sequence by the number of instructions in the sequence.

wad@media.mit.edu