The field of Computational Auditory Scene Analysis is emerging with new, non FFT-based, representations of audio with a view to solving difficult auditory scene analysis problems. Frequency components are grouped using gestalt principles such as synchrony of onset and temporal proximity as well as psycho-acoustic principles such as harmonicity and critical band masking effects.
Such representations are being employed to partition the time-frequency space, as represented in a Constant-Q spectrogram, into perceptually grouped regions. The potential for these grouped representations is enormous; all the salient components of auditory signals are encoded in three co-existing representations. However, such representations currently require a prohibitive amount of computation and specialist software tools for forming groups and tracks robustly.