Sub-clinical mastitis (SCM) affects milk composition. In this study, we hypothesise that large-scale mining of milk composition features by pattern recognition models can identify the best predictors of SCM within the milk composition features. To this end, using data mining algorithms, we conducted a large-scale and longitudinal study to evaluate the ability of various milk production parameters as indicators of SCM. SCM is the most prevalent disease of dairy cattle, causing substantial economic loss for the dairy industry. Developing new techniques to diagnose SCM in its early stages improves herd health and is of great importance. Test-day Somatic Cell Count (SCC) is the most common indicator of SCM and the primary mastitis surveillance approach worldwide. However, test-day SCC fluctuates widely between days, causing major concerns for its reliability. Consequently, there would be great benefit to identifying additional efficient indicators from large-scale and longitudinal studies. With this intent, data was collected at every milking (twice per day) for a period of 2 months from a single farm using in-line electronic equipment (346 248 records in total). The following data were analysed: milk volume, protein concentration, lactose concentration, electrical conductivity (EC), milking time and peak flow. Three SCC cut-offs were used to estimate the prevalence of SCM: Australian ≥ 250 000 cells/ml, European ≥200 000 cells/ml and New Zealand ≥ 150 000 cells/ml. At first, 10 different Attribute Weighting Algorithms (AWM) were applied to the data. In the absence of SCC, lactose concentration featured as the most important variable, followed by EC. For the first time, using attribute weighted modelling, we showed that the concentration of lactose in milk can be used as a strong indicator of SCM. The development of machine-learning expert systems using two or more milk variables (such as lactose concentration and EC) may produce a predictive pattern for early SCM detection.