By Mark McCarthy & Anubha Mahajan.
Today, Nature Genetics published our manuscript describing the latest iteration of the series of genome wide association analysis for type 2 diabetes that we (as the DIAGRAM consortium) have completed over the past decade. For this round, we assembled genome-wide association data from nearly 900,000 individuals from 32 studies, focusing on individuals of European descent. Just under 10% of these participants had type 2 diabetes, making this comfortably the largest such study yet conducted for type 2 diabetes.
In addition to increasing the number of samples tested, this analysis was the first T2D association analysis to take full advantage of the much more detailed imputation reference panels now available. By upgrading from the 1000 Genomes panel (of a few hundred European genomes) to the Haplotype Reference Consortium panel (of around 30,000 genomes), we were able to undertake a much more robust survey of the contribution to T2D risk made by low frequency alleles.
It’s worth for a moment contemplating the staggering volume of data that a study such as this generates, and the scale of the advance over the past decade. The type 2 diabetes analysis conducted in 2007 as part of the Wellcome Trust Case Control Consortium featured 500K SNPs and 5K individuals (a total of 2.5billion genotypes). A decade on, the current study includes 27M SNPs and 900K individuals (25trillion genotypes). If each genotype was a 1cm marble, 25 trillion would be enough to fill – from pitch to brim – around 20 stadia the size of Wembley.
These numbers would be expected to bring increased power to detect and characterise novel association signals, and the present study does not disappoint. The details are available in the manuscript, but the main “discovery” findings are these:
- We detect 243 loci at genome-wide significance, including 135 loci never previously implicated in type 2 diabetes predisposition.
- By performing conditional analyses around these primary signals, we identify a further 160 secondary signals within these loci, for a total of 403 significant signals across the 243 loci. (Note that this study was limited to European descent individuals, so there are a set of about 40-50 additional signals that were first described in non-European samples: because of ethnic differences in allele frequency and effect size, not all of these reached genome-wide significance in the present study, so the total number of confirmed T2D signals sits around the 450 mark).
- We detect far more signals at the lower ranges of the allele frequency spectrum (minor allele frequency <5%), including 14 low frequency and rare variants with estimated allelic odds ratios greater than 2.0. This harvest of low frequency and rare alleles of large impact provides interesting new alleles for functional studies. However, before diving into functional studies, it’s important to note that some of these signals will need further analysis to confirm that they are genuine association signals, given the diverse artefacts that can complicate rare variant analysis (and the difficulties inherent in obtaining external replication). However, these discoveries at the lower end of the frequency spectrum do not materially alter the conclusions about the genetic architecture of type 2 diabetes reached in previous studies (such as Fuchsberger et al, Nature 2016). Put simply, the T2D effect sizes we are seeing at these low frequency and rare variant signals remain comparatively modest, such that both individually, and collectively, these variants in the low frequency and rare variant space account for a much smaller proportion of the overall variance in disease risk than the common risk-variants that we continue to uncover through these same approaches. The overall picture here is one that is typical of a post-reproductive disease, where there has been very little if any selective pressure: under those circumstances, disease-associated variation will follow neutral expectation, and most of it will be reside in common shared variation.
- At a sizeable proportion of loci, the increase in sample size, and the finer resolution of the imputation panel, has allowed us to dissect regional patterns of variant correlation (“linkage disequilibrium”) and home in on the specific variants most likely to be driving the association signal. At 18 loci, we could assign a “lead” variant that has more than 99% chance of being the causal variant (given the implicit assumptions behind the method). If we take more liberal thresholds for “proof”, then those numbers grow to 51 (for a >80% chance) or 100 (for a >50% chance). This “fine mapping” success offers a boost to functional studies that aim to characterise the variants that drive these signals. We are currently completing a large multi-ethnic association study for T2D (which combines these data from Europeans with equivalent GWAS analyses performed in T2D case-control samples of East Asian, South Asian, Hispanic, and African descent): we expect this to bring further resolution of causal variants by allowing us to capitalise on trans-ethnic differences in linkage disequilibrium.
The identification of GWAS signals and the causal variants that lie within them is, however, but a means to an end. Our loftier goals are an understanding of the processes and pathways that are central to type 2 diabetes pathogenesis (particularly those that might be amenable to preventative or therapeutic modulation), and improved tools for the clinical management of diabetes (particularly those that allow us to define individuals at greatest risk of diabetes and/or its complications). How do studies like this advance those goals?
- First, the identification of GWAS loci and the variants that drive them puts us on a path to defining the genes and proteins through which they exert their impact on diabetes pathogenesis. Those proteins become potential targets for the development of novel therapeutics. Alternatively, they provide entry points for the discovery of biomarkers that can be used to stratify risk and/or monitor therapy. For around 10% of the GWAS loci, the causal variant can be positioned within the coding sequence of a protein-coding gene: when that is so, we have a direct “address” for the relevant protein. For the remainder, the causal variant lies in non-coding sequence, and some detective work is required to work out which of the nearby genes is under the regulatory control of the diabetes risk-variant. Overall, according to our estimates, the running total of “target genes” for type 2 diabetes, identified through these various approaches runs to ~70, each of them offering a piece in the complex puzzle of diabetes pathogenesis. (We are recompiling and updating this list for a review article we are writing, so watch this space).
- Second, one of the key questions that has emerged as the number of risk loci for complex diseases has increased, relates to the extent to which this proliferation of genetic signals means a proliferation in the biology so implicated. This has been encapsulated recently in discussions around the “omnigenic” model proposed by Jonathan Pritchard and colleagues. Some have chosen to take a rather nihilistic interpretation which can be summarised as follows: “What is the point of finding all of these loci and understanding the biological processes they implicate, if it turns out that there are almost as many signals as there are genes, and we will end up finding all genes are somehow involved in diabetes development?”. This isn’t the place for a detailed rebuttal of that view, but suffice to say that such a nihilistic interpretation ignores the subtlety of genetic regulation: even if the same gene is “involved” in multiple diseases, that doesn’t mean that the direction of effect is the same, that the tissue distribution of effects is the same, nor that the effects occur at the same stages in development. The evidential counterargument to the nihilistic omnigenic perspective would be to demonstrate that the variants influencing type 2 diabetes risk are not the same (in terms of identity or function) as those influencing other diseases, and also to highlight ways in which the biology implicated by the growing number of type 2 diabetes loci “converges” on a limited set of processes. In this study, we show both of these. We demonstrate how we can use the combination of this genetic fine-mapping, and patterns of enrichment across regulatory annotations in diabetes-relevant tissues (notably the pancreatic islet) to further enhance the localisation of the variants that are driving these association signals, and to derive mechanistic insights into the processes involved at a growing number of these loci. We show (as we and others have before) that T2D-risk variants are disproportionately located in DNA sequences involved in the regulation of islet-gene transcription, once again highlighting the central role of the regulation of insulin secretion. The expectation, in the near-term, is that the integration of these genetic and genomic approaches with other strategies for characterising regulatory variant function (including high throughput empirical assays, and machine-learning approaches) will get us closer to an understanding of “regulatory grammar” that we can use to further finesse our ability to pinpoint the causal variants, especially at those loci where tight linkage disequilibrium limits the power of genetic fine-mapping.
- Third, we have deployed the latest GWAS findings to explore the extent to which a composite measure of individual genetic risk (as quantified through a polygenic risk score) can stratify type 2 diabetes risk. We generated a polygenic risk score using the non-UK Biobank data from the present study, and then used it to place the ~400K European participants within UK Biobank into one of 40 ranked bins (ordered from lowest to highest T2D polygenic risk). The prevalence of T2D in UK Biobank is around 4% across the board, but the prevalence in those 40 bins ranged from just over 1% (in the lowest PRS bin) to over 10% (in the highest). The UK Biobank sample is relatively young and disproportionately healthy, but, if those differentials are sustained at the population level (something that needs to be tested), then given UK national lifetime diabetes prevalence rates of 15%, the implication is that there are over 1M individuals in the UK with a lifetime risk of diabetes (based purely on their genetics) that exceeds 50%.
There are, of course, a number of questions to be addressed before one could turn this into a public policy based around targeted intervention. Do those differential risk estimates hold up at the population level? How well do these risk scores perform in non-Europeans populations? To what extent does the genetics merely capture information that is already manifest through classical diabetes risk factors such as BMI, family history and ethnicity? What are the relative social and health benefits (and costs) associated with strategies targeted to those at highest risk as opposed to population-wide public health interventions?
Finally, we believe that it is important not just to share our findings, but to encourage others to use these data to support their own research. To this end, we are making the data we have generated available to others in a variety of ways.
To coincide with the release of the paper you will be able to:
- Access all the tables via the web, enabling simple manipulations (sorting, filtering) at mccarthy.well.ox.ac.uk
- Access the summary level data at the DIAGRAM Consortium
- Access the association data and manipulate its display in the AMP T2D Genetics portal at type2diabetesgenetics.org (see figure below).