1. Introduction
Both genetic and common environmental effects are important in understanding sources of variation in quantitative phenotypes (e.g., risk or susceptibility factors for diseases). However, due to lack of evidence for the effects of shared environment, family studies or twin studies may result in over-estimation of genetic effects (Hopper, Reference Hopper, Spector, Snieder and MacGregor2000). Because of this, an extended twin design is often preferred (e.g., Stoel et al., Reference Stoel, De Geus and Boomsma2006) including phenotypes of the spouses (Eaves et al., Reference Eaves, Martin, Meyer, Corey and Cloninger1999). To account for the shared environmental effects, a large number of different types of individuals, their siblings and spouses need to be studied and information on the number of years spent together in a specific/shared environment, age up to which the environment (household) was shared and phenotypes need to be collected. To summarize, the estimated division between the genetics and environment in a population-specific manner, heritability is usually used as a summary statistics to describe the proportion of phenotypic variation attributable to genetic factors.
For quantitative traits, heritability estimation is often performed using a polygenic model (Henderson, Reference Henderson1984; Lynch and Walsh, Reference Lynch and Walsh1998; Abney et al., Reference Abney, McPeek and Ober2000). A polygenic model (or infinite locus or infinitesimal model) has been defined (Fisher, Reference Fisher1918) and used by assuming infinite number of additive unlinked loci (and infinite genome length), which implies 0·5 as a constant degree of genetic relationship between full siblings (Fisher, Reference Fisher1918). However, as is well known, the length of the genome is finite and specific to different species. The size of the human genome corresponds to an equivalent of 85 independent loci (Visscher et al., Reference Visscher, Medland, Ferreira, Morley, Zhu, Cornes, Montgomery and Martin2006). Because of this, Visscher et al. (Reference Visscher, Medland, Ferreira, Morley, Zhu, Cornes, Montgomery and Martin2006) first estimated the actual values of relationship coefficients (entries of the covariance matrix) using a set of neutral markers and then incorporated these values into the relationship matrix instead of a fixed 0·5 for full siblings. This practice resulted in much higher heritability estimates, than otherwise expected, based on collected phenotypes for human height, which is a typical quantitative trait (see Visscher et al., Reference Visscher, Medland, Ferreira, Morley, Zhu, Cornes, Montgomery and Martin2006; Xu, Reference Xu2006). The obtained heritability of h 2≈0·8 was close to estimates from twin studies (Silventoinen et al., Reference Silventoinen, Sammalisto, Perola, Boomsma, Cornes, Davis, Dunkel, De Lange, Harris, Hjelmborg, Luciano, Martin, Mortensen, Nisticò, Pedersen, Skytthe, Spector, Stazi, Willemsen and Kaprio2003).
For computational easiness, a finite polygenic model has been proposed as a finite locus approximation for polygenic model (Thompson and Skolnick, Reference Thompson, Skolnick, Pollack, Kempthorne and Bailey1977; Du et al., Reference Du, Hoeschele and Gage-Lahti1999; Du and Hoeschele, Reference Du and Hoeschele2000). However, the drawback in such models is that the heritability estimates are clearly dependent on the number of loci assumed in the model. Here, we propose another kind of finite approximation for a polygenic model, where instead of assuming infinite genome length, elements of the relationship matrix corresponding to full siblings (degree of genetic relationship between siblings) are treated as random variables and estimated simultaneously with the variance components. Unlike Visscher et al. (Reference Visscher, Medland, Ferreira, Morley, Zhu, Cornes, Montgomery and Martin2006), we do not utilize molecular marker information in estimation, but we assume the mean 0·5 and known standard deviation 0·0384 (cf. Guo, Reference Guo1996; Visscher et al. Reference Visscher, Medland, Ferreira, Morley, Zhu, Cornes, Montgomery and Martin2006) for the unknown coefficients of genetic relationship between full siblings. This formulation provides a more flexible model, which should be robust for deviations and pedigree errors in the data.
In the Indian Migration Study (IMS) (Lyngdoh et al., Reference Lyngdoh, Kinra, Shlomo, Reddy, Phabhalaran, Smith and Ebrahim2006), factory workers from four different parts of India are examined for various anthropometric and biochemical measurements apart from other characteristics. The spouses of the factory workers and siblings or relatives (in a very few cases when it was not possible to involve siblings) of both the factory workers and their spouses are examined. The main aim of IMS is to study the migration pattern in India; data on the place of origin, age when the place of origin was abandoned, age at the time of marriage and the number of years spent with the spouse in a common household are collected. It is to be noted that the siblings or relatives are taken from the place of origin of the respective factory worker or the spouse. Hence, IMS provided a unique opportunity to study the genetic as well as the shared environmental effects by modelling the dependency between the factory worker and his sibling/relative, factory worker and his spouse and the spouse and her sibling/relative (see Fig. 1).
The present work is motivated by IMS. In the setting of a simple linear mixed model, the effects of shared environment is modelled as a function of the number of years the environment was shared. In the present context, an environment refers to a common household in either a rural or an urban region.
In section 2, we construct a model describing the genetic and environmental effects on the phenotypes. We also define age-dependent proportions of total variations that are due to polygenic and environmental effects as meaningful summary statistics in the present context. In section 3, a Bayesian approach to the estimation of parameters is discussed and, in section 4, the proposed approach is applied to the IMS data. Finally, we conclude with the discussion section.
2. Model
In this section, we construct a model under the setting of the IMS, where the observation unit of the analysis is a group of four individuals defined by a male factory worker, his spouse, his sibling and his spouse's sibling. However, the idea presented here is very general and can be extended easily to other data types by appropriately specifying the relationships between the individuals within the unit of the analysis.
In the sequel, subscripts (1, 2, 3, 4) are used for the factory worker, his spouse, his sibling and his spouse's relative, respectively. Let a 4×1 vector of quantitative phenotypes corresponding to the factory worker i, i=1, 2,…, n and related observations be y i=(y i1, y i2, y i3, y i4)t, where t denotes the transpose. That is, y i1 corresponds to the measurement of the factory worker i, y i2 corresponds to the measurement of the spouse of the factory worker i and so forth. Let x 2=(x i1, x i2, x i3, x i4)t be the dummy variables or continuous measurements of the covariates of interest.
Consider a simple linear mixed model
where μ is the overall mean and β is the effect of covariates x i. Further, the additive genetic value or the polygenic effect vector, G i=(G i1, G i2, G i3, G i4)t, is assumed to be jointly (multivariate) normally distributed with the mean vector of zeros and variance–covariance matrix
Here, A i is the additive genetic relationship matrix (Henderson, Reference Henderson1984), which characterizes the degree to which related individuals jointly share their genomes identical-by-descent (IBD) and is given by
and sibling covariances c i(j, j′)∊[0, 1], (j, j′)=(1, 3), (2, 4) and i=1, 2,…, n are either taken from the common literature (when there is no inbreeding) as fixed constants equal to 0·5 or here are assumed to be independent and identically distributed random variables with expectation 0·5 and known variance (0·0384)2. Note that the fixed value 0·5 was originally derived as an asymptotic limiting value by assuming infinitely many loci (infinite genome length). Because all the genomes are of finite length in practice, the amount of IBD-sharing varies from one pair of siblings to another. Substantial deviations may also be expected in the presence of pedigree errors in the data. The detailed motivation for such a covariance structure is provided in Appendix A.
Further, the environment effect is modelled through the cumulative effects of the locations E i=(E i1, E i2, E i3, E i4)t. The effects E i are assumed to be drawn randomly from a multivariate normal distribution with mean vector zero and variance–covariance matrix Σei, which has elements
where the sum is over all the locations or households shared by j and j′, D il(j, j′) is the time (in years) (j, j′) spent together at location l for observation i, and σe2 is the constant multiplier of the duration matrix D i. Note that if (j, j′) did not share any location, then D il(j, j′)=0 and the corresponding observations on E i would be independent. This is the case for (j, j′) equal to {(1, 4), (4, 1), (3, 4), (4, 3)}. Similar identity is expected between observation (y i2, y i3), but in many Indian families siblings share the household even after they are married and hence the dependency between (y i2, y i3) and (y i2, y i3). This property is lacking from the settings of studies from developed countries. The rationale behind such a covariance structure is explained in Appendix A.
Excluding the interaction between G and E and under the assumption of independent and identical errors εi, the overall variance–covariance of y i, viz. Σi is then the sum of appropriate covariance terms of the polygenic effect, shared environmental effects and the model (1) error variance (σm2).
where I is a 4×4 identity matrix. The mean vector of y i is μi=μ+βx i.
In twin studies the heritability is traditionally defined directly from the observed differences between the measured correlations among monozygotic and dizygotic twins (Falconer, Reference Falconer1989). Here, we assume no dominance variance and thus focus on ‘narrow sense’ heritability (σg2/σy2). It is clear from (4) that the phenotypic variability (σy2) is modelled as an age-dependent quantity (σg2+age σe2+σm2) and this corresponds to the diagonal elements of the Σi. Note that each individual has lived with oneself for the number of years which are identical to his or her current age. In the present situation, the phenotypic variation turns out to be a function of the age when the phenotype was measured.
For a given age, we define the proportion of the total variation due to polygenic effect as
and the proportion of total variation due to environmental effect as
It is clear that heritability v g(age) is a decreasing function of age, but the rate at which it decreases with age depends on the environment variability σe2. In general, the heritability estimates of the same trait assessed in an environment with larger variability would be lower. Similarly, v e(age) is an increasing function of age and the rate is dependent on (σg2+σm2).
Omitting the age-dependency, a rather crude estimator of the heritability can also be given as
where is the empirical phenotypic variability, which can be estimated using sample variance of the observed phenotype data. Note that in expressions (5) and (6), σg2 is estimated under the proposed model, while in expressions (7) and (8), is estimated externally from the observed data. The other two variances (σe2, σm2) are as defined earlier under the proposed model.
3. Bayesian computation
The likelihood function is simply the product of n-independent 4-variate normal density
where θ=(μ, β, σg2, σe2, σm2) and |Σi| is the determinant of the matrix Σi. A priori independent normal priors are assigned to μ and βs, while mutually independent Gamma priors are assigned to the inverses of the three variance components. For discussions of informative and uninformative priors, see Gelman (Reference Gelman2006) and Dongen (Reference Dongen2006). The posterior distribution is proportional to the product of the likelihood and the prior densities
where k is the number of covariates, N(z; 0,10000) is univariate normal density with mean 0 and variance 10 000 evaluated at z and Γ(z; 0·01, 0·01) is the Gamma density with shape and scale parameters equal to 0·01 evaluated at z. The Gibbs sampler, applied to the posterior distribution of θ, involves specification of the following full conditional distributions:
where all the variables are being conditioned on their most recently sampled values. If sampling from any of these full conditional distributions is difficult, then a Metropolis–Hastings step can be used. The samples from the posterior distribution of θ (10) are generated by using the OpenBUGS 2.2.0 software (Spiegelhalter et al., Reference Spiegelhalter, Thomas, Best and Lunn2005; Thomas et al., Reference Thomas, O'Hara, Ligges and Stuartz2006).
Let (σg2(l), σe2(l), σm2(l)), l=1,…, M be the Markov Chain Monte Carlo (MCMC) samples from the posterior distribution. The sample for the proportions at MCMC round l, l=l,…, M is
The posterior median and credible intervals are then estimated using these samples.
When c i(1, 3) and c i(2,4) are allowed to be randomly distributed according to N(0·5, 0·03842), the above posterior distribution (10) is multiplied by
The full conditional distributions of c i(1, 3) and c i(2, 4) are
where
These unknown factors then get updated at each iteration. The missing data in y (under the missing at random assumption) are handled naturally in the Bayesian computation and they get updated or predicted at each iteration. The missing data in the covariates can be handled by introducing a prior distribution for covariates.
4. Data analyses
The IMS data are analysed here using three different models: (1) assuming a model y i=μ+εi and a general variance–covariance matrix Σ for each i and using the Wishart prior for it, (2) the proposed model y i=μ+βx i+G i+E i+εi with 0·5 as the constant degree of genetic relationship, that is, C i(1, 3) and C i(2, 4) are equal to 0·5 for all i, and (3) the proposed robust model y i=μ+βx i+G i+E i+εi when the degree of genetic relationship is random and distributed according to the normal distribution with mean 0·5 and variance (0·0384)2. We also estimate the proportions v g(age) and v e(age).
We analyse 372 (=n) household data each of which consists of four observations as shown in Figure 1. For the purpose of illustration, four phenotypes are analysed, viz. body height, weight, systolic blood pressure and total serum cholesterol. The covariates used are age when the phenotypes were measured, sex (1=man, 0=woman) and location (categorized as urban and rural). The total number of observations on the phenotypes is 1488. For cholesterol, 167 observations were missing, while for systolic blood pressure, four observations were missing. All the observations were available for height and weight. A natural question that arises is whether the correlation between phenotypes of spouses is larger for spouses that have been together for longer? To check this, a regression analysis of the phenotype of the factory worker y i1 on the age when the phenotype was collected, the phenotype of the spouse y i2, the duration they have lived together D i(1, 2) and the interaction y i2∗D i(1, 2) gave positive regression coefficients for the phenotype of the spouse, and the duration they have lived together. Hence, it may be concluded that the spouse correlation is larger for spouses that have been together longer. We also looked at the phenotypic variances among those below and above 40 years of age to see whether phenotypic variances increase with age and by grouping the pairs of observations, ((y i1, y i2), (y i1, y i3), (y i2, y i4)), according to the number of years spent in 5-year groups. We noticed that in our data, the phenotypic variances among those who are below 40 years of age are smaller than those above 40 years of age. There was a clear increase in the phenotypic covariations in the initial years of living together (up to 10 years) and for the later years no clear pattern was observed. This may indicate a more general model, where the shared environmental effect decreases with age (see the Discussion section).
In Appendix B, OpenBUGS code for analysing the proposed model is given. The model (1) gave an estimate of Σ with positive covariance between individuals 1 and 4. This is not desirable since these individuals neither are genetically related nor have a shared environment. Hence, this may indicate a possible lack of fit for models (2) and (3), which assume a specific structure of dependency under random mating (see the Discussion section).
Table 1 gives the posterior median and 95% credible intervals for three variance components and for four phenotypes using models (2) and (3) and Table 2 gives corresponding estimates of the proportions of variability due to polygenic effect and environmental effect at the age of 30 years along with their crude estimates. The models show similar results except for slight differences for cholesterol. This is due to the missing data for cholesterol. It is our observation that the convergence was faster with model (3) when there were missing data for phenotypes. For all the phenotypes considered, the variation due to the shared environment is less compared to the polygenic and residual variability. This is because the environmental variation is not simply σe2 but with a multiplier D i. The latter are rather close for height and weight, while polygenic variability is higher than the residual variability for systolic blood pressure and cholesterol.
Figures 2(a)–(h) show the posterior median and 95% credible intervals for the proportions v g(age) and v e(age) for different phenotypes and for different values of age. The curves due to the two models are overlapping in most cases. It is clear from Figure 3(a), which shows the posterior medians of v g(age), that the trend in the proportion is similar between weight, systolic blood pressure and cholesterol, while for height there is sudden decrease with age. This is because participants reach their maximal height in early adult life and from then onwards there is no further growth. Till age 20, the proportion of total variability due to polygenic effect is higher in systolic blood pressure compared to weight, but after the age of 30 the trend reverses. Similarly, v e(age) are shown in Figure 3(b) where height behaves differently compared to other three phenotypes. We also carried out the analysis using log-transformation for height and the proportions of total variations due to polygenic and environmental effects were close to the results obtained without log-transformation.
The posterior median and 95% credible intervals are given for some of the regression parameters in Table 3. Only the sex effect is seen on height with men tending to be taller than women. Body weight is affected by age and sex. With age, body weight increases and men have a higher weight compared to women. A similar trend is seen for systolic blood pressure with respect to age and sex. The regression coefficient of sex is negative for cholesterol, indicating that women have higher cholesterol compared to men.
5. Discussion
The household data with a typical structure of a married couple and their siblings are analysed so that the built-in dependency structure is used in specifying the variance–covariance structure. The dependency structure allows separation of overall variability into polygenic and shared environmental components. Like McGregor et al. (Reference McGregor, Knott, White and Visscher2003), our analysis represents multivariate analysis of heritability in the sense that all age strata are analysed jointly compared to age-stratified heritability estimation (Brown et al., Reference Brown, Beck, Lange, Davis, Kay, Langefelt and Rich2003). In our analyses, a general model that does not take into account the dependency structure but allows any positive definite Σi showed weak association between the observations (y i1, y i4), which may indicate a small amount of assortative mating present among the study subjects. In the proposed model, we assume that these two individuals are neither related nor have lived in the same household and the variability is directly proportional to the number of years lived in the same household. This reflects our understanding of how environmental exposures operate.
The model used in this paper is the first possible simplest model for the sibling pairs and spouse pairs association separating environmental and genetic variations. The unexplained variation is expressed in terms of the residual variance σm2. A more complex model allowing different dependencies between the sibling pairs and spouse pairs based on the age when lived together and allowing covariate information like socio-economic status (SES) influencing the traits/phenotypes might describe the data better. One such model could be developed under the assumption that the environmental dependency is stronger if the environment is shared early in life (that is, at a younger age). For example, a more complex modelling that allows the effect of the shared environment on the phenotypes to be larger in the initial years of life and to decrease with age could be carried out using appropriate functions, for example exp {−λt). The shared environment variance is then computed as the area under the function with respect to the Gaussian process. We may also need different models for different traits since the proposed model gives counter-intuitive results for height and the development of phenotypes with age could be different. The heritability obtained without adjusting for age (Table 2, ṽ g) is more comparable to the results reported from the twin studies, but the age-adjusted proportions are rather different. However, we must remember that modelling polygenic and environmental variations as time-dependent quantities is very rare and we must not compare mere numbers in the absolute sense. We think that the proposed approach opens up methodological questions and provides scope for further research in this area.
The estimates of heritability obtained here are not directly comparable to the ones reported in the literature. In the context of cholesterol, Mastropaolo et al. (Reference Mastropaolo, Matheny and Lang2001) pointed out that previous studies consist of older children and adult twin pairs have indicated environmental contribution to the variation of cholesterol among individuals of 7–68% in the populations evaluated. Our results are to be compared with the previously reported results keeping this in mind. In this paper, we defined heritability as a decreasing function of age, where its value is decreasing at a rate depending on the ratio of environmental and genetic variances. Thus, it is not surprising that height seems to be less genetic at age 100 than at age 20. However, this trend may be more pronounced if the information relevant to early life environment comes from siblings who have less variability in trait and later-life information comes more from spouses who will have greater variability.
The proposed method allows a comparison of various phenotypes between rural and urban areas by incorporating an appropriately defined covariate based on migration status. A more thorough analysis focusing on the rural and urban comparisons would be reported elsewhere since the main interest here is in estimating the relative genetic and environmental variations.
In summary, we have developed a model that uses the dependency between sibling pairs and spouse pairs to estimate the age-dependent proportions of total variation that are due to polygenic and environmental effects. Using the Bayesian approach, the variation of IBD-sharing could be incorporated into the model without using marker information and this is the first time such an idea has been tried. The approximate N(0·5, 0·0382) for IBD shared by siblings is prior information based on genomewide analysis from previous studies. The Bayesian method can sample the realized IBD from the conditional posterior distribution. This explains why the variation of IBD can be incorporated without using marker information. This idea is novel and appears to be useful in separating the genetic from environmental variances. The model would benefit from testing in populations who have experienced different exposures. As pointed out to us by the Editor, using spouse information to quantify the common environmental variance may be better than using the twin data because the twin data are hard to collect, while spouse and sibling data are easier to collect. We may have very large samples for spouse–sibling data compared to the twin data.
This work was done while the first author was a visiting researcher at the Department of Mathematics and Statistics, University of Helsinki. This work was supported by research grants (numbers 114786 and 202324) from the Academy of Finland. We would like to thank Dr Bijoy Joseph, Indic Society for Education and Development (INSEED), India for his help in preparing the data in the required format. We are grateful to the Editor and two anonymous referees for their constructive comments on the manuscript.
The IMS Group comprises: Professor K. Srinath Reddy, Dr Dorairaj Prabhakaran, Professor Tulsi Patel, Dr Lakshmy Ramakrishnan, Dr Ruby Gupta and Dr Tanica Lyngdoh (New Delhi, India); Professor R. C. Ahuja and Professor R. K. Saran (Lucknow, India); Dr Prashant Joshi and Dr N. M. Thakre (Nagpur, India); Dr K. V. R. Sarma, Professor S. Mohan Das, Dr R. K. Jain and Dr S. S. Potnis (Hyderabad, India); Professor Anura V. Kurpad, Dr Mario Vaz, Ms A. V. Barathi and Dr Murali Mohan (Bangalore, India); Dr Chittaranjan Yajnik (Pune, India); Dr Ian Baker, Professor George Davey Smith, Professor Yoav Ben Shlomo and Dr Kate Tilling (Bristol); and Professor Shah Ebrahim and Dr Sanjay Kinra (London School of Hygiene and Tropical Medicine).
The IMS is funded by Wellcome Trust project grant GR070797MF. We are grateful to the field staff conducting the migration study and to the participants.
6. Appendix A: Rationale behind the covariance structures
6.1. Covariance structure of polygenic effects
Consider a single locus with s allele effects (η1,…, ηs) We assume that these are independent and identically distributed normal variates with zero mean and variance σl2. Let us suppose that the human genome contains K such independent loci. Then G i1 can be expressed as the sum of the effects of the 2K alleles transmitted by parents and let this be Similarly, let be the genetic effect for a sibling of i1. It is easy to check that for j=1, 3,
The conditional distribution of (G i1, G i3) is the normal distribution with mean and covariance structure as specified above. The shared genome between the sibling pair is the proportion of IBD genes actually shared by them. The value 0·5 was originally obtained as an asymptotic limiting value by assuming infinite genome length. As stated in Xu (Reference Xu2006), the actual amount of IBD-sharing between human full siblings is not a constant but a variable with expectation equal to 0·5 and variance (0·0384)2. Here, the variance is specified by the length of the human genome. Hence, in our model, we allow the covariance between the polygenic effects of full siblings ij and ij′ to vary by a factor c i(j, j′)∊[0, 1] with mean 0·5 and variance (0·0384)2. For simplicity, during estimation we have assumed that c i(j, j′) has a normal distribution with mean 0·5 and variance (0·0384)2.
6.2. Covariance structure of environmental effects
Under the assumption that the environment changes over time, we define
where a(s) is the age at time s, f(a(s)) is a positive, monotone nonincreasing function of the age a(s), I ij(s; l) is an indicator function assuming the value 1 if the individuals i and j have lived together at location l at time s and is zero otherwise, {W ijl(s), s∊[0, ∞]} is a Gaussian process, with expected value E(W ijl(s))=0 for all (i,j,l) and s and covariance function cov (W ijl(s 1), W ij′l(s 2))=σe2 min (s 1, s 2) and cov (W ijl(s 1), W i′j′l′(s 2))=0, if i≠i′, l≠l′. Here, we take f(a(s))=1 but a more general function describing the model assumption of stronger environment if shared at younger ages can be modelled using, for example, exp {−λa(s)} for λ>0.
The variance–covariance matrix is then
where D il(j, j′) is time spent together by (j, j′) at the location l.