The noticeable industry trend toward very long instruction word (VLIW) processors is not surprising. Earlier programmable DSPs with a small number (two or three) of functional units were very conscious of code size, utilizing CISC instruction sets which supported commonly used parallel issues as single instructions (e.g. multiply-add.) As the number of functional units integrated into a processor is increased, the need for multiple instruction issue becomes critical.
A VLIW approach to multiple instruction issue is favored for two reasons. First, the overhead of an alternative ``super-scalar'' (run-time instruction parallelizin') approach is significant. And second, as Philips describes it: ``Defining software compatibility at the source code level''[7] is not a problem for video signal processors, whose software life cycle more closely resembles an embedded controller than a mainstream microprocessor. In order to address the related expansion of program code, compression of the instruction stream such as that used on the Philips TriMedia [8] or the Infinite Reality GE [24] is promising. Roll-your-own CISC !
The advantages and disadvantages of group, or vector, instructions have already been mentioned above. The other noteworthy instruction set architecture feature commonly found on the VSPs is conditional execution of each operation (first seen in the IBM 604, 1952 [34].) This prevents disruption of the instruction fetch pipeline (long, even without instruction compression). It can be carried too far, however -- the instructions stream overhead of supporting eight different guard registers (in the TMS320C6201) seems exhorbitant.
The specialized co-processors integrated onto the VSPs were varied. The one common co-processor was an MPEG2 system bit-stream variable length decoder, found on the Philips TriMedia, the Chromatics Mpact, and the Samsung MSP. This reflects the fundamental difficulty of handling a datatype which traditional processors aren't prepared to process. Since a stream of bits, with variable length fields, is central to most efficient communications channels, a processing element or co-processor for parsing/manipulating it should become ubiquitous. Hopefully programmable architectures for manipulating bit-streams will become more common, supplanting the fixed protocol architectures presently encountered.
Other co-processors present on surveyed VSPs were:
The amount of memory and memory bandwidth required by a high performance video or graphics system is a problem. The Rambus solution addresses the bandwidth quite well throught the use of advanced signalling techniques, but doesn't reduce the amount of memory required. Compressing all data before communications or storage, as proposed by Talisman, reduces both the size and bandwidth requirements. The tradeoff is the amount of processing power/area required (27% of the total in the case of Escalante.) As the relative cost of processing/memory decreases, the compression approach should become more common. And one can always attempt to reduce memory requirements through changing the overall algorithm.
Units | Area (G
![]() |
Relative Area |
---|---|---|
Memory Interface | 0.22 | 4% |
Compression/Decompression | 1.4 | 27% |
Clip & Scan Convert | 0.76 | 15% |
Texture Map & Composite | 1.3 | 26% |
Display Generation | 1.0 | 20% |
Interblock Routing | 0.47 | 8% |
Total Usable Area | 5.1 |
The disappearance of the frame buffer in the Talisman architecture is
notable, but not new. Many earlier graphics architectures have used
display lists to generate the images one or more scan lines at a time,
either in order to support fast window or object (sprite)
manipulations, or to eliminate the need for frame buffer memory. The
viability of its replacement, the Image Layer Compositor (ILC), hinges
on the ready availability of multi-billion devices at
consumer prices, and on two other architectural aspects of Talisman:
Nonetheless, decompressing the image data, then performing a bilinear
filtering and alpha blending of the image layers in the process of
generating the display exhibits a level of sophistication in the ILC
that is relatively unique. The cost (in silicon area) is substantial:
the percentage shown for Display Generation in Table
9 doesn't include the 2.4 Mbits of compositing
buffer required. All told, the area cost is roughly comparable to
the 3.3 G required for an equivalent frame buffer but the
performance is vastly superior.
While there is no explicit frame buffer in the Talisman system, the memory
containing the image layer data (output from the polygon rendering stage)
does decouple the display generation from the image rendering.
This point isn't developed upon in the architecture, as they have
accepted a relatively low target video output resolution (1344x1024 at
75fps), attainable with a single chip solution. Decoupling
allows the display resolution to be scaled spatially
by simply scaling the ILC and not the entire system. Since the image
layer composition is easily parallelizable (the data already having
been partitioned into virtual buffers) multiple ILCs could be
employed. A redesign to distribute the memory among all the system
chips, instead of concentrating it all on the rendering engine, as in
Escalante, would be necessary. I expect to see more systems taking
this approach, as the cost of processing silicon (relative to the cost
of memory bandwidth and silicon) drops.