Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jan 2.
Published in final edited form as: Genet Epidemiol. 2010 Jan;34(1):10.1002/gepi.20435. doi: 10.1002/gepi.20435

Meta-Analysis of Genome-Wide Association Studies: No Efficiency Gain in Using Individual Participant Data

D Y Lin 1, D Zeng 1
PMCID: PMC3878085  NIHMSID: NIHMS141823  PMID: 19847795

Abstract

To identify genetic variants with modest effects on complex human diseases, a growing number of networks or consortia are created for sharing data from multiple genome-wide association studies on the same disease or related disorders. A central question in this enterprise is whether to obtain summary results or individual participant data from relevant studies. We show theoretically and numerically that meta-analysis of summary results is statistically as efficient as joint analysis of individual participant data (provided that both analyses are performed properly under the same modeling assumptions). We illustrate this equivalence with case-control data from the Finland-United States Investigation of NIDDM Genetics (FUSION) study. Collating only summary results will increase the number and representativeness of available studies, simplify data collection and analysis, reduce resource utilization, and accelerate discovery.

Keywords: complex diseases, GWAS consortia, joint analysis, mega analysis, SNPs, summary results

INTRODUCTION

Genome-wide association studies (GWAS) have yielded new findings for many complex human diseases. Because complex diseases are influenced by an array of genetic variants mostly with small to moderate effects, it is difficult for one GWAS to provide unequivocal findings. Indeed, the odds ratios of disease with SNPs that have been observed in GWAS thus far are typically less than 1.5, and the majority of positive findings have emerged only after aggressive data sharing across multiple studies. For example, the initial findings from individual type 2 diabetes GWAS were ambiguous, but a number of disease loci with odds ratios of 1.1 ~ 1.4 were identified conclusively after combining results from several studies (Saxena et al. 2007; Zeggini et al. 2007; Scott et al. 2007; Zeggini et al. 2008).

Recognizing the need and benefits of data sharing, GWAS investigators have formed various networks or consortia to share data on the same disease or related disorders (Kavvoural and Ioannidis 2008). For example, the Psychiatric GWAS Consortium we are involved with has enrolled 47 studies in 5 major disorders (The Psychiatric GWAS Consortium Steering Committee 2009). Some of these consortia have attempted to obtain raw data on individual participants, as opposed to summary results that are used in traditional meta-analysis. The raw data from all available studies can then be analyzed simultaneously. Such analysis is commonly called joint analysis or mega-analysis. We will use the term mega-analysis and refer to the traditional method of combining summary results as meta-analysis.

A major motivation for obtaining raw, individual-level data is the general perception that mega-analysis is statistically more efficient than meta-analysis since it utilizes much more detailed information. However, obtaining raw data is difficult, costly and time-consuming. Some investigators are unwilling or unable to share raw data. For the Tobacco and Genetics Consortium we are involved with, the majority of the investigators were unable to provide raw data due to IRB issues and/or study policies that prohibit the sharing of raw data. Excluding studies that do not contribute raw data will reduce statistical power and limit the generalizability of the findings. Furthermore, the sheer scale of GWAS data poses significant practical challenges in storing and analyzing raw data from a large number of studies.

We show in this article that meta-analysis (when performed properly) is as efficient as mega-analysis in that the estimates of any genetic effect produced by the two methods have approximately the same variance. Thus, there is no need to obtain raw data. Even if raw data are available, one can analyze the data for each study separately and then combine the summary results through meta-analysis. This will greatly facilitate the analysis, especially if raw data are available only on a subset of studies.

METHODS

We wish to combine results from K studies with nk participants in the kth study. For the analysis of each SNP, the data consist of (Yki, Xki), where Yki is the disease status (1 = disease, 0 = no disease) for the ith participant of the kth study, and Xki is the corresponding genotype score. (Under the additive mode of inheritance, the genotype score is the number of minor alleles; under the dominant model, the genotype score indicates, by the values 1 versus 0, whether or not the individual has at least one minor allele; under the recessive model, the genotype score indicates, by the values 1 versus 0, whether or not the individual has two minor alleles. For an untyped SNP, the unknown genotype score may be imputed by the expected genotype score.) We assume the following logistic regression model:

Pr(Yki=1)=eαk+βXki1+eαk+βXki, (1)

where the αk’s are study-specific intercepts, and β is the log odds ratio representing a common genetic effect across studies.

Let β̂k be the maximum likelihood estimate of β by maximizing the likelihood function of the kth study

L(αk,β)=i=1nkeYki(αk+βXki)1+eαk+βXki,

and let Vk be the variance estimate of β̂k. Then the inverse-variance meta-analysis estimate of β is

(k=1KVk1)1k=1KVk1β^k,

and its variance is estimated by

(k=1KVk1)1.

To perform mega-analysis, we obtain the maximum likelihood estimate of β and its variance estimate by maximizing the joint likelihood function

k=1KL(αk,β).

We show in the Appendix that the meta-analysis and mega-analysis estimates of β have approximately the same variance, so the two methods have approximately the same efficiency.

We can add covariates to model (1) in both meta-analysis and mega-analysis. The covariates may include environmental factors or principal components (Price et al. 2006) used to adjust for population stratification. The numbers and types of covariates need not be the same across studies. Meta-analysis of covariate-adjusted genetic effects is approximately as efficient as mega-analysis using individual-level covariate data (see the Appendix for details).

If the effects of some covariates are the same across studies, then one can improve the efficiency of mega-analysis by incorporating this restriction into the joint likelihood function and thus estimating fewer parameters. However, the efficiency gain is usually minimal because the number of covariates is much smaller than the sample sizes of typical GWAS. Interestingly, one can achieve the same efficiency gain by performing a multivariate version of meta-analysis (see the Appendix for details). The multivariate version of meta-analysis is not generally recommended because it requires additional summary results and the assumption of common covariate effects may not be appropriate.

Both meta-analysis and mega-analysis assume a common genetic effect across studies. This assumption does not affect the validity of association testing since the genetic effects are all zero under the null hypothesis of no association. However, it is important to determine whether meta-analysis or mega-analysis is more powerful when the effect sizes are unequal among studies. We show in the Appendix that the estimates produced by meta-analysis and mega-analysis are approximately the same and their variance estimates are also approximately the same when the genetic effects are unequal across studies, so that the two methods have similar statistical powers.

RESULTS

SIMULATION STUDIES

To demonstrate the equivalence between meta-analysis and mega-analysis, we present here some simulation results on combining two case-control studies. We simulated data from model (1), in which the SNP of interest had population minor allele frequencies (MAFs) of 0.3 and 0.2 in studies 1 and 2, respectively, and Xki was the number of minor alleles. We set α1 = −3, α2 = −2.2, and β = log 1.4. We also considered unequal values of β for the two studies. Note that eβ pertains to the odds ratio (OR) of disease with the SNP under the additive mode of inheritance. We obtained various combinations of the numbers of cases and controls for the two studies. For each combination of the simulation parameters, we generated 10 million data sets and performed meta-analysis and mega-analysis of each data set under model (1). The results are summarized in Table 1.

Table 1.

Mean effect estimates, standard errors and powers at the 10−7 significance level for meta-analysis and mega-analysis of case-control data

Study 1 (MAF = 0:3)
Study 2 (MAF = 0:2)
Meta-analysis
Mega-analysis
OR Cases Contls OR Cases Contls Mean SE Power Mean SE Power
1.4 1,000 1,000 1.4 1,000 1,000 1.402 0.076 0.812 1.402 0.076 0.814
1,500 1,500 500 500 1.402 0.074 0.865 1.402 0.074 0.866
500 500 1,500 1,500 1.402 0.079 0.745 1.402 0.079 0.747
750 1,500 1,500 750 1.402 0.076 0.814 1.402 0.076 0.815
1,500 750 750 1,500 1.402 0.076 0.812 1.402 0.076 0.814
1.5 1,000 1,000 1.3 1,000 1,000 1.411 0.077 0.840 1.411 0.077 0.843
1,500 1,500 500 500 1.459 0.077 0.967 1.459 0.077 0.967
500 500 1,500 1,500 1.359 0.076 0.543 1.360 0.076 0.550
750 1,500 1,500 750 1.408 0.076 0.830 1.408 0.076 0.841
1,500 750 750 1,500 1.413 0.077 0.850 1.414 0.078 0.847
1.3 1,000 1,000 1.5 1,000 1,000 1.383 0.075 0.736 1.383 0.075 0.741
1,500 1,500 500 500 1.338 0.070 0.594 1.339 0.070 0.599
500 500 1,500 1,500 1.436 0.081 0.858 1.437 0.081 0.861
750 1,500 1,500 750 1.386 0.075 0.755 1.386 0.076 0.748
1,500 750 750 1,500 1.380 0.074 0.720 1.381 0.074 0.737

When the SNP effects are the same between the two studies, the mean estimates of the SNP effects and the standard errors are identical up to the third decimal point between meta-analysis and mega-analysis, and the powers are identical up to the second decimal point. When the SNP effects are different between the two studies, there are some slight differences between the two methods, and either method can be slightly more powerful than the other.

FUSION DATA

For illustration with empirical data, we considered the Finland-United States Investigation of NIDDM Genetics (FUSION) study (Scott et al. 2007). The FUSION study genotyped 1,161 Finnish type 2 diabetes (T2D) cases and 1,174 Finnish normal glucose-tolerant (NGT) controls on 317,503 SNPs on the Illumina HumanHap300 BeadChip in stage 1 of a two-stage design. Based on the stage-1 results and the findings of other studies, the study genotyped 224 SNPs in an additional 1,204 Finnish T2D cases and 1,253 Finnish NGT controls. The subjects with missing genotypes on a particular SNP were excluded from the analysis of that SNP. All subjects have age and sex information.

We performed meta-analysis and mega-analysis of T2D status on the 224 SNPs that were genotyped in both stage 1 and stage 2 of the FUSION study. The results under the additive mode of inheritance are displayed in Figure 1. The individual estimates of odds ratios vary considerably between stages 1 and 2. The combined estimates of odds ratios and the corresponding standard error estimates are virtually identical between meta-analysis and mega-analysis, and consequently the two sets of p-values are virtually identical. The only noticeable differences lie in SNPs 114, 166 and 176, which have observed MAFs of approximately 0.9%, 1.6% and 3.1%. For SNPs with low MAFs, the individual estimates of genetic effects may be unstable, which may cause the combined estimates to be different between meta-analysis and mega-analysis. Such differences are unlikely to alter the rankings of the top SNPs because the p-values associated with rare SNPs tend to be non-significant.

Figure 1.

Figure 1

Analysis of stages 1 and 2 data from the FUSION study. The top left panel compares the individual estimates of odds ratios between stages 1 and 2; the top right panel compares the combined estimates of odds ratios between meta-analysis and mega-analysis; the bottom left panel compares the standard error estimates between the two methods; and the bottom right panel compares the − log10(p-values) between the two methods. In each panel, the red line indicates where the values on the two axes are equal.

For further illustration, we included age and sex as covariates in the logistic regression model. When age and sex are allowed to have different effects between stages 1 and 2, meta-analysis and mega-analysis again produce virtually identical results; see Figure 2. When age and sex are assumed to have common effects between stages 1 and 2 in mega-analysis, the results between the two methods are slightly more different; see Figure 3.

Figure 2.

Figure 2

Analysis of stages 1 and 2 data from the FUSION study adjusted for age and sex. The top left panel compares the individual estimates of odds ratios between stages 1 and 2; the top right panel compares the combined estimates of odds ratios between meta-analysis and mega-analysis; the bottom left panel compares the standard error estimates between the two methods; and the bottom right panel compares the − log10(p-values) between the two methods. Both meta-analysis and mega-analysis allow age and sex effects to be different between stages 1 and 2. In each panel, the red line indicates where the values on the two axes are equal.

Figure 3.

Figure 3

Analysis of stages 1 and 2 data from the FUSION study adjusted for age and sex. The top left panel compares the individual estimates of odds ratios between stages 1 and 2; the top right panel compares the combined estimates of odds ratios between meta-analysis and mega-analysis; the bottom left panel compares the standard error estimates between the two methods; and the bottom right panel compares the − log10(p-values) between the two methods. Mega-analysis assumes age and sex effects to be the same between stages 1 and 2 whereas meta-analysis does not. In each panel, the red line indicates where the values on the two axes are equal.

DISCUSSION

Publication bias is a major concern in meta-analysis of literature results. One may reduce or avoid this kind of bias by planning GWAS meta-analysis prospectively to take advantage of all available studies and all available SNPs. By using summary results rather than raw data, one can increase the number of available studies and thus enhance the power of the analysis and the generalizability of the findings.

In many applications, it is desirable to adjust for participant-level covariates, such as principal components and environmental exposures. Such data are not available in published reports. In a consortium setting, the covariate adjustments can be made within each study and the covariate-adjusted estimates of genetic effects can then be combined through meta-analysis. It is logistically much simpler to provide such adjusted estimates than to transfer raw data. Indeed, this is the strategy adopted by the Tobacco and Genetics Consortium and many other consortia. If the covariate effects are the same across studies, then the mega-analysis that incorporates that restriction tends to be more efficient than the traditional meta-analysis. However, the efficiency gain is generally minimal and the same efficiency gain can be achieved by using a multivariate version of meta-analysis (see the Appendix for details).

We have focused on binary traits. In a related paper, Olkin and Sampson (1998) showed that, for comparing treatments with respect to a continuous outcome in clinical trials, meta-analysis is equivalent to mega-analysis if the treatment effects and error variances are constant across trials. It follows from the arguments of the Appendix that all the conclusions of this article hold for quantitative traits and indeed for any traits under any study designs; the details are given in Lin and Zeng (2009).

By working with raw data, one can ensure that all studies use the the same quality-control criteria and estimate the same quantities. However, such standardization and harmonization of information can be achieved by requiring all participating investigators to follow a common set of guidelines on quality control and statistical analysis so that the data are filtered and analyzed in the same way across studies before summary results are submitted.

Acknowledgments

The authors are grateful to Drs. Michael Boehnke and Heather Stringham and other FUSION investigators for providing the data used in this article. They are also grateful to Dr. Kuo-Ping Li for his programming assistance. This research was supported by the National Institutes of Health.

APPENDIX

TECHNICAL DETAILS

We adopt the notation of the Methods section. Let α̂k and β̂k be the maximum likelihood estimates (MLEs) of αk and β based on the likelihood function of the kth study, and let α̃k and β̃ be the MLEs of αk and β based on the joint likelihood function. Note that β̃ is the mega-analysis estimate of β. Write θk = (αk, β), θ̂k = (α̂k, β̂k) and θ̃k = (α̃k, β̃). Also, define

Ik(θk)=i=1nkυki(θk)Xki2{i=1nkυki(θk)Xki}2/i=1nkυki(θk),

where υki(θk) = eαk + βXki/(1 + eαk + βXki)2. According to the MLE theory (Cox and Hinkley 1979), the variances of β̂k and β̃ are estimated by Vk=Ik1(θ^k) and

Var(β)={k=1KIk(θk)}1,

respectively. The inverse-variance meta-analysis estimate of β is

β^={k=1KIk(θ^k)}1k=1KIk(θ^k)β^k, (2)

and its variance is estimated by

Var(β^)={k=1KIk(θ^k)}1.

Note that Var(β̂) takes the same form as Var(β̃): the only difference is that Ik is evaluated at θ̂k in the former and at θ̃k in the latter. Denote n=k=1Knk. Under model (1) of the Methods section, α̂k and α̃k converge to αk while β̂k and β̃ converge to β (as sample sizes nk increase), so that β̂ also converges to β while Var(n1/2β̂) and Var(n1/2β̃) converge to a common constant. Thus, n1/2(β̂ − β) and n1/2 (β̃ − β) are asymptotically normal with mean 0 and with a common variance, which implies that meta-analysis and mega-analysis are asymptotically equivalent.

To accommodate covariates, we extend equation (1) of the Methods section as follows:

Pr(Yki=1)=eαk+βXki+γkTZki1+eαk+βXki+γkTZki, (3)

where Zki is the vector of covariates for the ith participant of the kth study, and γk is the corresponding vector of log odds ratios. By incorporating the unit component into Zki and the intercept αk into γk, equation (3) can be written in a more compact form

Pr(Yki=1)=eβXki+γkTZki1+eβXki+γkTZki.

The likelihood functions given in the Methods section are modified to reflect the inclusion of covariates in the model. Write θk = (β, γk). Let θ̂k and θ̃k denote the MLEs of θk based on the likelihood function of the kth study and the joint likelihood function, respectively. Then all the results of the previous paragraph hold with the redefinition of

Ik(θk)=i=1nkυki(θk)Xki2{i=1nkυki(θk)XkiZkiT}{i=1nkυki(θk)ZkiZkiT}1{i=1nkυki(θk)XkiZki},

where υki(θk)=eβXki+γkTZki/(1+eβXki+γkTZki)2.

If the effects of covariates are the same across studies, then equation (3) becomes

Pr(Yki=1)=eαk+βXki+γTZki1+eαk+βXki+γTZki. (4)

By expanding Xki to include Zki, equation (4) can be written as

Pr(Yki=1)=eαk+βTXki1+eαk+βTXki,

in which the vector β represents both the genetic effect and the covariate effects. Redefine

Ik(θk)=i=1nkυki(θk)XkiXkiT{i=1nkυki(θk)Xki}{i=1nkυki(θk)XkiT}/i=1nkυki(θk),

where υki(θk) = eαk + βTXki/(1 + eαk + βTXki)2. By the arguments of the first paragraph, β̂ and β̃ are asymptotically normal with mean β and with a common covariance matrix. Thus, performing the multivariate version of meta-analysis on the vector of parameters β yields an estimate of the genetic effect that is asymptotically as efficient as the mega-analysis estimate when covariate effects are the same across studies.

Because model (3) has K sets of covariate effects whereas model (4) only has one set, mega-analysis is generally more efficient under model (4) than under model (3). Thus, univariate meta-analysis, which is asymptotically equivalent to mega-analysis under model (3), is generally less efficient than mega-analysis under model (4). However, the efficiency loss is minimal in large samples. Although one can avoid the efficiency loss by performing multivariate meta-analysis, it is more difficult to obtain multivariate than univariate summary statistics.

All the above results assume that the genetic effects are the same across studies. This assumption does not affect the type I error of association testing since all genetic effects are zero under the null hypothesis of no association. Nevertheless, it is of practical importance to determine the relative power of meta-analysis versus mega-analysis when genetic effects are unequal. By taking the differences between the score functions of Lk(αk, β) and k=1KLk(αk,β) and applying the mean-value theorem, we can show that

β={k=1KIk(θk)}1k=1KIk(θk)β^k,

where θk lies between θ̂k and θ̃k. Thus, β̃ takes the same form as β̂ shown in equation (2), the difference being that Ik is evaluated at θk in the former and at θ̂k in the latter. As indicated before, the only difference between Var(β̃) and Var(β̂) is that Ik is evaluated at θ̃k in the former and at θ̂k in the latter. Note that Ik depends on θk through υki(θk) only. It can be shown that υki(θk) does not change its values drastically when θk varies between θ̂k and θ̃k in case-control studies with modest genetic effects. Thus, β̂ and β̃ are approximately the same, and so are Var(β̂) and Var(β̃). Consequently, the power of meta-analysis is similar to that of mega-analysis even when genetic effects are unequal across studies.

References

  1. Cox DR, Hinkley DV. Theoretical Statistics. Chapman and Hall; 1979. [Google Scholar]
  2. Kavvoura1 FK, Ioannidis JPA. Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls. Human Genetics. 2008;123:1–14. doi: 10.1007/s00439-007-0445-9. [DOI] [PubMed] [Google Scholar]
  3. Lin DY, Zeng D. On the relative efficiency of using summary statistics versus individual level data in meta-analysis. 2009 doi: 10.1093/biomet/asq006. Unpublished technical report. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Olkin I, Sampson A. Comparison of meta-analysis versus analysis of variance of individual patient data. Biometrics. 1998;54:317–22. [PubMed] [Google Scholar]
  5. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  6. Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PIW, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. doi: 10.1126/science.1142358. [DOI] [PubMed] [Google Scholar]
  7. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–1345. doi: 10.1126/science.1142382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. The Psychiatric GWAS Consortium Steering Committee. A framework for interpreting genome-wide association studies of psychiatric disorders. Molecular Psychiatry. 2008;14:10–17. doi: 10.1038/mp.2008.126. [DOI] [PubMed] [Google Scholar]
  9. Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JRB, Rayner NW, Freathy RM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007;316:1336–1341. doi: 10.1126/science.1142364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, de Bakker PIW, Abecasis GR, Almgren P, Andersen G, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genetics. 2008;40:638–645. doi: 10.1038/ng.120. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES