(This material is not based on the material in PLoP'94.)
Also Known As
Pipes and Filters, Dataflow
Process a stream of information by feeding it through a network of
independent and reusable processing units.
You want to apply a series of operations to one or more streams of data.
For example, filtering an image, analyzing an oscilloscope signal,
summarizing a movie, or formatting simple text files.
The amount of data is unbounded.
Operations may consume unpredictable amounts of time or data
before producing output.
Intermediate results may need to be used in multiple places.
Give data sources and data operators the same interface. Use the Decorator
pattern to construct a dataflow network where operators decorate the
sources of their data.
Each data stream is indexed by time, which may be multidimensional.
Computation is initiated by calling the
on a node of the network. If the node is a data source, this instructs it
to return the block of data corresponding to the given time range. If the
node is an operator, this instructs it to fetch data from its source and
return the processed output corresponding to the given time range.
This is the "pull" solution. See below for the "push" solution.
The network can process an unbounded amount of data, one block at a time.
Operators can expand or contract time, thereby processing different amounts
of data for the same request.
Multiple streams can be merged by synchronizing on the time index.
Time indexing eliminates unnecessary computation. For example, if the
interleaving of two streams is later downsampled by two, only one of the
streams needs to be and will be computed.
Intermediate results can be cached to allow efficient fan-out.
Nodes are loosely coupled. They are oblivious to the structure of the
network except for their immediate inputs. Nodes can take advantage of
Loops in the network are not allowed unless there are delay nodes which
If there is no fan-out, i.e. the network is a chain or collapsing tree,
then time indexing can be avoided. Computation is initiated by
Next(count), to get the next
count units of data.
Operations can be lazy or eager. A lazy operation does not
compute anything until it is asked for. An eager operation constantly
computes new output, putting it in a buffer, until the buffer space is
exhausted. Then it waits for someone to request the data and empty the
buffer. Eager operations are useful when operations can run in parallel
and their outputs are usually needed.
Operators can submit new requests to their source while processing the old
data, providing opportunities for parallelism in the network. Nodes can
reject requests if they are too busy already, thus allowing the parallelism
to automatically adapt to the length of the network.
Nodes can cache their output in case it is requested again, e.g. due to
fan-out. Operators can flush their caches at will, but sources may need to
keep data around forever. Caches can be separate nodes.
If nodes can mutate, thereby changing the output for a particular time
index, then there must be a way to invalidate caches downstream. A node
which caches its output can use the Observer
pattern to accomplish this while maintaining decoupling.
Some of the optimizations used in the Interpreter pattern are automatic with Streams.
For example, dependencies are explicitly reflected in the structure of the
network. All nodes can run in parallel (though perhaps on different time
slices). Dead values are never computed by lazy nodes since they are never
asked for. Effect caching and partial evaluation can still be done,
however. For example, when an identical block of data is input to a node,
the output will also be identical. Unfortunately, such matches seem rare
A different solution to the Streams problem is sometimes used, where what
comes out of the network is simply determined by what goes in. This "push"
solution is useful for interactive visualization. It is also used in
Instead of using Decorators, each node points to the nodes which
follow it in the network. Operators fetch data from their input
queue, process it, and send the result off to the input queues of the nodes
which need it. This approach relies on eager computation and therefore is
not appropriate for some applications, such as accessing a database. For
general-purpose programming, though, it is preferred since it allows loops
and procedure calls, i.e. using the same sub-network in multiple places.
The Unix shell uses eager parallel processes to evaluate a pipeline
command. There is no fan-out so no time indexing is used. The interface
between processes is exceedingly simple: characters go from
stdout. Processes are so independent
that they can be written in any programming language. Unix manages the
buffers and suspends processes automatically, making the buffering
transparent. So it is equally fair to call this a push or a pull
MIT Scheme includes a special kind of list, called a stream, which is only
evaluated on demand. Functions on streams can be composed to form a
dataflow network as in the Streams pattern. There are lots of examples in
Structure and Interpretation of Computer Programs.
extends this idea by evaluating all expressions on demand.
ImageVision library uses the Streams pattern for processing images. It
uses 3-D indexing (two space dimensions plus time for movies). Operations
are lazy and cached.
Opcode's Max, IRIS Explorer, and Khoros use the
Streams pattern to represent arbitrary dataflow programs for visualization
and multimedia. They use the alternative solution based on push rather
Dataflow processors, such as MIT's Monsoon, use a parallel
machine language based on the Streams pattern (push variant). The MIT
course on Dataflow
Languages and Architectures discusses these in detail.
Streams uses the Streams pattern to compose musical patterns.
Last modified: Fri Sep 02 17:23:11 GMT 2005