The neurocognition of multimodal interaction – the embedded, embodied, predictive processing of vocal and non-vocal communicative behaviour – has developed into an important subfield of cognitive science. It leaves a glaring lacuna, however, namely the dearth of a precise investigation of the meanings of the verbal and non-verbal communication signals that constitute multimodal interaction. Cognitively construable dialogue semantics provides a detailed and context-aware notion of meaning, and thereby contributes content-based identity conditions needed for distinguishing syntactically or form-based defined multimodal constituents. We exemplify this by means of two novel empirical examples: dissociated uses of negative polarity utterances and head shaking, and attentional clarification requests addressing speaker/hearer roles. On this view, interlocutors are described as co-active agents, thereby motivating a replacement of sequential turn organisation as a basic organising principle with notions of leading and accompanying voices. The Multimodal Serialisation Hypothesis is formulated: multimodal natural language processing is driven in part by a notion of vertical relevance – relevance of utterances occurring simultaneously – which we suggest supervenes on sequential (‘horizontal’) relevance – relevance of utterances succeeding each other temporally.