The term ‘data mining’ can be used to describe any process where useful information is extracted from data with a large background of ‘noise’. In the context of a genome project, several stages involve data mining. Amongst the sequence data, ‘signals’ need to be detected that indicate the presence of interesting features. Often this involves differentiating between transcribed and non-transcribed bases to predict coding regions. After detection, defining the roles of these sequences involves sifting through multiple lines of evidence. If these roles are accurately reflected in genome annotation, they can be used by researchers to frame queries and interrogate the data further.