The Power of Gene-based Rare Variant Methods to Detect Disease-Associated Variation and Test Hypotheses about Complex Disease

Citation:   Moutsianas, Agarwala et al. (2015)

Main manuscript:

Moutsianas L, Agarwala V, Fuchsberger C, Flannick J, Rivas MA, Gaulton KJ, Albers PK, the GoT2D Consortium, McVean G, Boehnke M, Altshuler D & McCarthy MI (2015) The Power of Gene-Based Rare Variant Methods to Detect Disease-Associated Variation and Test Hypotheses About Complex Disease. PLoS Genet 11(4): e1005165. doi:10.1371/journal.pgen.1005165

Relevant references:

HAPGEN2 simulations platform
Su, Z., Marchini, J. & Donnelly, P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27 (16), 2304-2305 (2011).
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. & 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics 27 (15), 2156-2158 (2011).


Genome and exome sequencing in large cohorts enables characterization of the role of rare variation in complex diseases. Success in this endeavor, however, requires investigators to test a diverse array of genetic hypotheses which differ in the number, frequency and effect sizes of underlying causal variants. In this study, we evaluated the power of gene-based association methods to interrogate such hypotheses, and examined the implications for study design. We developed a flexible simulation approach, using 1000 Genomes data, to (a) generate sequence variation at human genes in up to 10K case-control samples, and (b) quantify the statistical power of a panel of widely used gene-based association tests under a variety of allelic architectures, locus effect sizes, and significance thresholds. For loci explaining ~1% of phenotypic variance underlying a common dichotomous trait, we find that all methods have low absolute power to achieve exome-wide significance (~5-20% power at α=2.5×10-6) in 3K individuals; even in 10K samples, power is modest (~60%). The combined application of multiple methods increases sensitivity, but does so at the expense of a higher false positive rate. MiST, SKAT-O, and KBAC have the highest individual mean power across simulated datasets, but we observe wide architecture-dependent variability in the individual loci detected by each test, suggesting that inferences about disease architecture from analysis of sequencing studies can differ depending on which methods are used. Our results imply that tens of thousands of individuals, extensive functional annotation, or highly targeted hypothesis testing will be required to confidently detect or exclude rare variant signals at complex disease loci.

Author list and affiliations

  • Loukas Moutsianas 1,*
  • Vineeta Agarwala 2-3,*
  • Christian Fuchsberger 4
  • Jason Flannick 3,5
  • Manuel A. Rivas 1
  • Kyle J. Gaulton 1
  • Patrick K. Albers 1
  • GoT2D Consortium 6
  • Gil McVean 1
  • Michael Boehnke 4
  • David Altshuler 3,5,7-8
  • Mark I. McCarthy 1,9
  1. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
  2. Program in Biophysics, Harvard University, Cambridge, MA 02138, USA
  3. Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
  4. Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48103, USA
  5. Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA
  6. A full list of GoT2D Consortium members and affiliations appears in the Supplemental Note.
  7. Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
  8. Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
  9. Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford OX3 7LJ, UK
* These authors contributed equally to this work


Description of data download and file structure

Download complete data
Download archive
(73983 files, 3.0Gb)
This page contains download links to all datasets used in the manuscript, together with the scripts that were used to generate them, and the results of all gene-based tests run on them (as well as of single variant tests). All files can be downloaded together as a single big file. Moreover, there are additional options for downloading datasets, scripts, or results only, and for downloading each of these by architecture, gene, or test.

  • Datasets

    The number, frequencies, and effect sizes of causal variants in each dataset vary, but all datasets across a set explain the same percentage of variance on the liability scale. The default is 1%, unless otherwise stated, and is included in the filename of each dataset. In all architectures but AR6, only deleterious effects were introduced (and all such datasets contain the infix "Dvar"). Datasets were a mixture of deleterious and protective effects were introduced contain the infix "Mvar".

    As an example,   STOX1_HM.Dvar0.005_54.cases.vcf.gz   and   STOX1_HM.Dvar0.005_54.controls.vcf.gz   in   AR2/DATASETS/Dvar0.005/STOX1_DIR/   contain the case and control datasets for simulation 54 of gene STOX1, simulated under AR2 with only deleterious effects having been introduced and with the total variance explained by the locus being 0.5%. All datasets are in GZIP compressed VCF format.

  • Results

    All reported results, unless otherwise stated, have been run with an allele-frequency cut-off for variant inclusion of 1%. This is true for all results but these in directories   AR2/RESULTS/Dvar0.01_MAFthr0.05   and   AR2/RESULTS/Dvar0.01_MAFthr0.005 , where MAF cut-offs of 5% and 0.5% were used, respectively.

    The scheme followed for the presentation of results is identical to that used by PSEQ, with result files obtained from other gene-based methods being re-formatted in the same way, for consistency. Each results file contains 9 columns, of which columns 1, 2, 5, 6, and 7 are of interest to the reader, containing the dataset being tested, the locus, the number of variants being tested, the test used, and the resulting p-value, respectively, in this order.

  • Scripts

    The scripts provided in this site are the ones which were used to produce the respective datasets. Although they cannot be re-run to produce identical datasets (due to intermediate files having been removed after dataset generation), they are informative with respect to the number, positions, and effect sizes of the variants which were introduced in each dataset.

    For instance, script   STOX1_HM.Dvar0.005_54.script   in   AR2/SCRIPTS/Dvar0.005/STOX1_DIR/   was used to create datasets   STOX1_HM.Dvar0.005_54.controls.vcf.gz   and   STOX1_HM.Dvar0.005_54.cases.vcf.gz , mentioned above. The relevant information is contained in line 5 of each script (where the HAPGEN2 command is run).

    A separate section on the website ("Dataset Generation", currently under construction) will contain required files and instructions for building datasets under various settings, including the ones employed in this work.

All files are GZIP compressed TAR archives — the file format is *.tar.gz

Please be aware that some browsers handle downloaded files differently, which can result in changed MD5 checksums


The datasets (cases and controls) are all provided in GZIP compressed VCF (Variant Call Format) version 4.1.
See   VCFtools   or   1000 Genomes Project   for specifications about the VCF format.


Please note that the scripts provided are the exact ones used for the generation of the datasets used in the analysis and cannot be run directly.
See   Information about dataset generation   for the generation of datasets with identical (or various other) characteristics to the ones provided.