Commentary

The noticeable industry trend toward very long instruction word (VLIW) processors is not surprising. Earlier programmable DSPs with a small number (two or three) of functional units were very conscious of code size, utilizing CISC instruction sets which supported commonly used parallel issues as single instructions (e.g. multiply-add.) As the number of functional units integrated into a processor is increased, the need for multiple instruction issue becomes critical.

A VLIW approach to multiple instruction issue is favored for two reasons. First, the overhead of an alternative ``super-scalar'' (run-time instruction parallelizin') approach is significant. And second, as Philips describes it: ``Defining software compatibility at the source code level''[7] is not a problem for video signal processors, whose software life cycle more closely resembles an embedded controller than a mainstream microprocessor. In order to address the related expansion of program code, compression of the instruction stream such as that used on the Philips TriMedia [8] or the Infinite Reality GE [24] is promising. Roll-your-own CISC !

The advantages and disadvantages of group, or vector, instructions have already been mentioned above. The other noteworthy instruction set architecture feature commonly found on the VSPs is conditional execution of each operation (first seen in the IBM 604, 1952 [34].) This prevents disruption of the instruction fetch pipeline (long, even without instruction compression). It can be carried too far, however -- the instructions stream overhead of supporting eight different guard registers (in the TMS320C6201) seems exhorbitant.

Specialized Co-processors

The specialized co-processors integrated onto the VSPs were varied. The one common co-processor was an MPEG2 system bit-stream variable length decoder, found on the Philips TriMedia, the Chromatics Mpact, and the Samsung MSP. This reflects the fundamental difficulty of handling a datatype which traditional processors aren't prepared to process. Since a stream of bits, with variable length fields, is central to most efficient communications channels, a processing element or co-processor for parsing/manipulating it should become ubiquitous. Hopefully programmable architectures for manipulating bit-streams will become more common, supplanting the fixed protocol architectures presently encountered.

Memory Costs

The amount of memory and memory bandwidth required by a high performance video or graphics system is a problem. The Rambus solution addresses the bandwidth quite well throught the use of advanced signalling techniques, but doesn't reduce the amount of memory required. Compressing all data before communications or storage, as proposed by Talisman, reduces both the size and bandwidth requirements. The tradeoff is the amount of processing power/area required (27% of the total in the case of Escalante.) As the relative cost of processing/memory decreases, the compression approach should become more common. And one can always attempt to reduce memory requirements through changing the overall algorithm.

Units Area (G ) Relative Area

Memory Interface 0.22 4%

Compression/Decompression 1.4 27%

Clip & Scan Convert 0.76 15%

Texture Map & Composite 1.3 26%

Display Generation 1.0 20%

Interblock Routing 0.47 8%

Total Usable Area 5.1

Table 9: Escalante (POP+ILC) Relative Block Costs

**Table 9:** Escalante (POP+ILC) Relative Block Costs
Units	Area (G )	Relative Area
Memory Interface	0.22	4%
Compression/Decompression	1.4	27%
Clip & Scan Convert	0.76	15%
Texture Map & Composite	1.3	26%
Display Generation	1.0	20%
Interblock Routing	0.47	8%
Total Usable Area	5.1

The Retirement of the Frame Buffer

The disappearance of the frame buffer in the Talisman architecture is notable, but not new. Many earlier graphics architectures have used display lists to generate the images one or more scan lines at a time, either in order to support fast window or object (sprite) manipulations, or to eliminate the need for frame buffer memory. The viability of its replacement, the Image Layer Compositor (ILC), hinges on the ready availability of multi-billion

devices at consumer prices, and on two other architectural aspects of Talisman:

Nonetheless, decompressing the image data, then performing a bilinear filtering and alpha blending of the image layers in the process of generating the display exhibits a level of sophistication in the ILC that is relatively unique. The cost (in silicon area) is substantial: the percentage shown for Display Generation in Table 9 doesn't include the 2.4 Mbits of compositing buffer required. All told, the area cost is roughly comparable to the 3.3 G

required for an equivalent frame buffer but the performance is vastly superior.

While there is no explicit frame buffer in the Talisman system, the memory containing the image layer data (output from the polygon rendering stage) does decouple the display generation from the image rendering. This point isn't developed upon in the architecture, as they have accepted a relatively low target video output resolution (1344x1024 at 75fps), attainable with a single chip solution. Decoupling allows the display resolution to be scaled spatially

by simply scaling the ILC and not the entire system. Since the image layer composition is easily parallelizable (the data already having been partitioned into virtual buffers) multiple ILCs could be employed. A redesign to distribute the memory among all the system chips, instead of concentrating it all on the rendering engine, as in Escalante, would be necessary. I expect to see more systems taking this approach, as the cost of processing silicon (relative to the cost of memory bandwidth and silicon) drops.