One challenge in building collaborative design tools that use speech and sketch input is distinguishing gesture pen strokes from those representing device structure, that is, object strokes. In previous work, we developed a gesture/object classifier that uses features computed from the pen strokes and the speech aligned with them. Experiments indicated that the speech features were the most important for distinguishing gestures, thus indicating the critical importance of the speech–sketch alignment. Consequently, we have developed a new alignment technique that employs a two-step process: the speech is first explicitly segmented (primarily into clauses), then the segments are aligned with the pen strokes. Our speech segmentation step is unique in that it uses sketch features for locating segment boundaries in multimodal dialog. In addition, it uses a single classifier to directly combine word-based, prosodic (pause), and sketch-based features. In the second step, segments are initially aligned with strokes based on temporal correlation, and then classifiers are used to detect and correct two common alignment errors. Our two-step technique has proven to be substantially more accurate at alignment than the existing technique that lacked explicit segmentation. It is more important that, for nearly all cases, our new technique results in greater gesture classification accuracy than the existing technique, and performed nearly as well as the benchmark manual speech–sketch alignment.