One of the largest remaining unsolved mysteries in cognitive science is how the rapid input of spoken language is mapped onto phonological and lexical representations over time. Attempts at psychologically-tractable computational models of spoken word recognition tend either to ignore time or to transform the temporal input into a spatial representation. This is the approach taken in TRACE (McClelland & Elman, 1986), the model of spoken word recognition that has the broadest and deepest coverage of phenomena in speech perception, spoken word recognition, and lexical parsing of multi-word sequences. TRACE reduplicates featural, phonemic, and lexical inputs at every time step in a potentially very large memory trace, and has rich interconnections (excitatory forward and backward connections between levels and inhibitory links within levels). This leads to a rather extreme proliferation of units and connections that grows dramatically as the lexicon or the memory trace grows. Our starting point is the observation that models of visual object recognition – including visual word recognition – have long grappled with the fundamental problem of how to model spatial invariance in human object recognition. We introduce a model that combines one aspect of TRACE – time-specific phoneme representations – and higher-level representations that have been used in visual word recognition – spatially- (here, temporally-) independent diphone and lexical units. This reduces the number of units and connections required by several orders of magnitude relative to TRACE. In this first report, we demonstrate that the model (dubbed TISK, for Time-Invariant String Kernel) achieves reasonable accuracy for the basic TRACE lexicon and successfully models the time course of phonological activation and competition. We close with a discussion of phenomena that the model does not yet successfully simulate (and why), and with novel predictions that follow from this architecture.