Published online by Cambridge University Press: 12 July 2004
Identification of cis-regulatory motifs has been difficult due to the short and variable length of the sequences that bind transcription factors. Using both sequence and microarray expression data, we present a method for identifying cis-regulatory motifs that uses regression trees to refine results from simple linear regression of expression levels on motif counts. Analysis of expression patterns from two separate datasets for genes showing significant differences in expression between the sexes in Drosophila melanogaster resulted in a model that identified known binding sites upstream of genes that are differentially expressed in the germline. We obtained a strong result for motif TCGATA, part of the larger, characterized binding site of dGATAb protein. We also identified an uncharacterized motif that is positively associated with sex-biased expression and was assembled from smaller motifs grouped by our model. A regression tree model provides a grouping of independent variables into multiple linear models, an advantage over a single multivariate model. In our case, this grouping of motifs suggests binding sites for cooperating factors in sex-specific expression, as well as a way of combining smaller motifs into larger binding sites.