Temporal event clustering in speech versus music

Abstract

Both speech and music can be organized as hierarchical, nested groupings of units. In speech, for instance, phonemes can group to form syllables, which group to form words, which group to form sentences, and so on. In music, notes can group to form phrases, which group to form chord progressions, which group to form verses, and so on. We present a new method for extracting events (amplitude peaks in Hilbert envelopes of filter banks) from speech and music recordings, and quantifying the degree of nesting in temporal clusters of events across timescales (using Allan Factor analysis). We apply this method to monologue recordings of speech (TED talks) and also to solo musical performances of similar lengths. We found that both types of recordings exhibit nested clustering, revealing similar organizational principles, but that clustering is more pronounced on shorter timescales (milliseconds) for speech, but longer timescales (seconds+) for music.


Back to Table of Contents