Memory Latency

Previous: Obtaining Parallelism Up: Problem Description Next: Specialized Processors

Memory Latency

Based on experience with current prototype systems, a system for processing structured video requires on the order of 3 to 20 frames of memory. A single device targeted for broadcast resolution will require from 20 to 150 Mbits of memory, precluding the sole use of on-chip memory. Even if wafer scale integration or multi-chip modules are used, the device memory would be physically separate from the processor, presenting a large access latency. This is a serious problem in any processing system -- the speed of a processor is irrelevant if it is stalled waiting for data. This latency problem is greatly exacerbated by parallel processing, where it is more likely that the data being read is located remotely from the processor, and accesses may conflict with those made by other processors.

The typical approach to reducing latency is the use of a cache -- a relatively small amount of lower latency memory containing a copy of sections of the slower memory that the processor is using or likely to use. Although this approach can be very effective in single processor systems, given typical application memory access patterns its performance on media processing is less than optimal. There are two reasons for this: the large amount of data being processed, and atypical data access patterns. The amount of data typically accessed by a media algorithm typically exceeds the size of a typical cache, diminishing the likelihood that desired data will be found in the cache. But a cache improves performance not only through maintaining a local copy, but also by prefetching data in addresses linearly adjacent to the requested one. Unfortunately, this automatic prefetching can degrade performance when accessing data sparsely (with a step between samples greater than the size of a cache line).

When using the cache mechanism in a multiprocessor system, great care must be exercised to ensure the validity of the data in the cache and the memory. The overhead of maintaining coherence does not scale well, requiring either that all processors monitor all memory accesses, or that lists of data copies be maintained and used. Cache coherence schemes have been shown useful in systems of 2 - 4 processors [1].

Previous: Obtaining Parallelism Up: Problem Description Next: Specialized Processors

wad@media.mit.edu