In our recent paper (Mahajan et al, Nature Genetics 2018), we asked a simple question: how reliably does the finding that a common coding variant is associated with type 2 diabetes allow the “obvious” inference implicating that gene in disease pathogenesis?



One of the major challenges faced by human genetics is to unravel the mechanisms whereby the associations discovered through large-scale GWAS and sequencing efforts exert their impact on disease predisposition. Characterising these processes is essential if we are to turn those discoveries into biological insights and novel translational opportunities.

Given the large number of such discoveries (for many complex traits the list of associated loci reaches into the low hundreds), there is an understandable desire to latch onto what appear to be the “low-hanging fruit” when it comes to mechanistic inference. This might be an association which involves a nonsynonymous coding variant in a gene for which, on the basis of the gene’s known role, a plausible link to disease pathology can be made. Or, it might be a regulatory variant which can be linked to a downstream effector through colocalisation with a cis-eQTL detected in analysis of RNA from a disease-relevant tissue or cell-line. Many of the specific hypotheses arising from these observations will turn out to be confirmed when subsequent functional experiments are performed: but others may, over time, be revealed to have been naïve and over-simplistic.

As we gather more and more large-scale functional data, we can expect to prioritise functional efforts to loci where there is accumulation of diverse types of data pointing to the same putative mechanism. But until then, how much can we rely on individual pieces of evidence to point us in the right direction?

This was one of the main things we considered in this study. The focus of the analysis was on detecting nonsynonymous coding variants associated with type 2 diabetes (T2D) in a large multi-ethnic (though mostly European-descent) meta-analysis, involving ~450,000 individuals (20% with T2D) who had been genotyped with the exome array or from whom equivalent data could be extracted.

The motivation for this study had been the opportunity that the exome array offered to peer into the low-frequency and (to a lesser degree) rare variant space that was, until recently at least, opaque to GWAS based approaches. (The advent of very large reference panels such as HRC means that many such variants can now be accurately imputed from GWAS data, at least in Europeans, but at the time the exome array was designed, imputation of variants below 1-2% was decidedly ropey; and exome sequence data too meagre to offer any useful power). The concept behind the exome array was therefore to delve into this space by populating an array with as complete as possible (given the state of exome sequence based discovery at the time this was designed around 5 years ago) an inventory of low-frequency and rare coding variants. The focus on the exome was an obvious one given that variants disrupting coding sequence are much more amenable to biological inference and follow-up. Importantly, coding variants are also more likely to be causal (“pound for pound”) than regulatory variants.

This last point is a crucial one. Across many complex traits, one of the consistent observations is that most signals map to noncoding (presumably regulatory) sequence. But not all. Across traits, around 5-10% of the overall GWAS signal is invested in (nonsynonymous) coding variants. Given that the exome constitutes only about 1.5% of the genome and not all coding variants are nonsynonymous that translates to an enrichment of about 5-10-fold. In other words, if you find a common coding (by “coding” in this post, we mean nonsynonymous coding) variant amongst the set of variants that are contributing to one of these associations (and characteristically these associations involve a set of highly-correlated variants that are all in the mix as potential causal candidates), then, all else being equal, that coding variant has an increased chance of being the one responsible for the association signal. In fact, that often leads to the tacit assumption that the coding variant IS the one responsible for the association signal. And since it’s coding, the corollary is that the gene within which it resides is causally implicated in disease risk. Some have used the analogy that these coding variant signals serve as “smoking guns” providing the vital evidence with which to assign “guilt” at an associated locus when there appear to be many possible perpetrators (in the form of variants contributing to the association signals, and/or genes nearby through which they might be acting).

But that may be too simplistic. What if that coding variant sits on a long haplotype of highly correlated variants? That haplotype might carry tens, even hundreds of other variants all of which have similar evidence of association. Those variants will of course be non-coding, and often of uncertain functional impact: it is likely that few, if any, of them will have the functional “charisma” of the coding variant. But because there are so many of them, and because the boost for being a coding variant (that is the approximately 10-fold enrichment in the chance of being causal) is not so massive, then it may be MORE likely that one of the non-coding variants is driving the association.

An analogy may help. There are 32 teams entering the World Cup this summer. Germany are currently favourites (at odds of about 5/1). If those odds are correct, then it is clear that both of the following statements are true: (a) of all teams, Germany has the highest probability of winning; but (b) it’s more likely than not that one of the less-favoured teams will lift the trophy. Being the favourite does not necessarily equate to a high chance of winning.

In the same vein, when we have a coding variant sitting amongst a large set of equivalently-associated variants, it can be simultaneously true that: (a) of all the variants, the coding variant is the most likely to be causal; but (b) it’s more likely than not that it’s one of the non-coding variants on the haplotype that is actually responsible. And at loci where the prediction under (b) turns out to be the true state of nature, then any inferences about biology based around the coding variant association will turn out to be false and misleading (and potentially rather wasteful of follow-up effort).

In analysis of the 450,000 subjects typed for around 250,000 variants on the exome array (most of them coding, though the array does also include a proportion of non-coding variants that come in useful, such as known GWAS signals), we found 69 coding variant associations (representing 40 distinct signals at 38 genes) that reached the threshold for study-wide significance. Around 60% of those mapped to known GWAS loci, and 90% of those variants were common. (So despite our efforts to do a better job of scanning low-frequency variants, we picked up relatively few signals in that space: this speaks to the dominant role played by common variants in common disease predisposition, but that’s a story for another day).

Because the exome array provides only a very patchy view of the variation in the regulatory “sea” that sits between the coding “islands”, we had to turn to GWAS data to fill in the gaps, and to place those coding variant associations in their wider context. The fact that so many of our associations were common made this possible in a way that would not have worked if the coding variant associations had been rarer. We interrogated a set of around 500,000 European individuals with GWAS data, in whom we performed fine-mapping for 37 of the 40 coding variant signals (there were three where this wasn’t feasible for a variety of technical reasons). Fine-mapping seeks to break down the tight correlations between associated variants and home in on those that are most likely to be driving the association.
How would the coding variants fare in such an analysis? Would they rise to the top and be highlighted as the causal variants, or would they be supplanted by other (non-coding) variants on the same haplotype?

In the first analysis, we afforded the coding variants no special status. We used genetic evidence alone to try and tease apart causal variants from “hitch-hikers”. At only two loci (out of the 37) did the coding variants emerge with strong credentials as the causal variants for their respective association signals. But we felt this more than anything reflected strong correlations amongst common variant signals and the limited power (even in 500,000 subjects) to discriminate between them on genetic evidence alone.

So, in the next round of analysis, we attempted to bring additional data to the table. In particular, we were keen to afford coding variants the boost that their enrichment across GWAS signals deserves (that is to recognise the fact that “pound for pound”, the average coding variant is more likely to be involved in a GWAS signal than the equivalent noncoding variant).

We used data from a large study from colleagues at DECODE which had analysed GWAS for over 200 traits, and estimated genome-wide enrichments not just for coding variants vs non-coding variants, but also captured some of the granularity within each (eg coding variants with different levels of predicted impact on protein function). We could use these estimates to up/down weight each of the variants on the risk haplotypes, and get a more detailed view of which variants were most likely to be driving the associations, one that was informed both by the genetic fine-mapping results, and these assessments of genome-wide functional enrichment.







This analysis allowed us to divide the 37 loci up into three more or less equal classes.

First, there were 16 “top dogs”. At these signals it was pretty clear that the coding variants were indeed causal. Once given the boost their coding status “deserved”, these coding variants rose to the top of the pile. Fortunately, this group included many of those loci (such as SLC30A8, GCKR and KCNJ11) where the same coding variant associations had already been found by GWAS, and where empirical studies (along with clues from additional rare, high impact coding alleles in those same genes) had clearly established the role of those variants (and those genes) with respect to T2D pathogenesis.

Second, were the 13 “red herrings”. At these loci, we could be confident that the coding variants were not causal. Even when coding variants in these regions were given a boost in recognition of their “status”, we could clearly see that other (non-coding) variants nearby were much more likely to be causal. This is the class of loci where it might have all too easy to have fallen victim to the simplistic assumption that the coding variant and the gene it sits within were causal, and to have embarked on expensive and time-consuming (and ultimately frustrating) experiments to characterise the mechanisms responsible. It’s worth noting that any data demonstrating that the specific coding variant has a material impact on the function of the encoded protein (which might have been cited in support of the progress of such functional work) is pretty much irrelevant to the story: the coding variants in this group are being excluded from a causal role largely on genetic grounds in analyses that already assume the variant has the functional impact expected of a generic missense allele. In general, such functional studies carry little weight unless they can connect disruption of that gene and its protein directly to the disease trait of interest (rather to some indirect molecular phenotype of unclear pathogenetic relevance).

That leaves the third group ( “ugly ducklings”?) These fall between the two in that the data neither compellingly support nor convincingly refute a contribution of the coding variants. A good example is at the PPARG gene. There’s not much doubt (given other data) that this is the causal gene here: PPARG encodes the targets of thiazolidinedione drugs, hosts rare large effect alleles that lead to a syndrome of lipodystrophy and diabetes, and the phenotype of the T2D GWAS signal is marked by insulin resistance (which fits with the rare variant phenotype). The assumption that the Pro12Ala allele is the driver of the common variant T2D association dates back to pre-GWAS days, even though it has been rather difficult to demonstrate that this variant influences PPARG function. Our analysis still gives the Pro12Ala top billing in the coding-variant boosted analysis, but it also indicates that most of the evidence for causation sits in the regulatory variants around Pro12Ala. One explanation is that both coding and regulatory variants are in play at this signal (and perhaps at others in this group).

However, it is important not to push those inferences too far. Bear in mind that the weights that we are bringing into these analyses (based on the DeCODE data) remain “crude” and generic. They generate estimates of genome-wide enrichment for whole classes of variants (e.g. low impact missense coding variants). There may be millions of such variants in a class. We then apply those generic estimates to individual loci and use them to paint each of the variants at that locus in the appropriate “colour”. This clearly has limitations given that the classes are broad, and those enrichments represent averages across many variants whose true functional impacts will vary widely.

Imagine if, for example one of the regulatory variants on the T2D risk haplotype at PPARG had a particularly strong impact on a key enhancer critical for PPARG expression. At the moment, the boost afforded to that variant (by virtue of being a non-coding variant within (any) DNAse hypersensitivity site) would not do justice to that variant’s potential to be causal at this locus. At the same time, the generic boost offered to the Pro12Ala allele (by virtue of being a coding variant) might, on the basis of the limited evidence that it actually does anything directly to PPARG function, overestimate its potential to underlie the association. This would tilt the balance in favour of the non-coding variant and away from the equipoise between the two that arises from use of the generic weights. (The analogy would be the need to re-estimate World Cup odds in light of the knowledge that Germany’s star players are secretly nursing serious injuries, and that the Brazil has a number of key players hitting peak form).

The size of the haplotype is another factor to consider. It is clearly more difficult for a coding variant to “shine through” when it sits on a very large haplotype, harbouring many hundreds of “competing” non-coding variants, than it does when it sits on a much shorter haplotype.
This means that, whilst the overall picture seems clear, the inferences at individual loci, especially those in the third (ugly duckling) group, may well need to be updated as more detailed information becomes available. This includes more granular estimates of the enrichment properties of variant sub-classes. For regulatory variants, it almost certainly means more sophisticated tissue-specific annotations (with the tissue of interest matching the trait and locus being examined – e.g. adipose in the case of the T2D signal near PPARG).

More broadly, these data should make us a little more nervous of jumping to conclusions at GWAS loci based on single quanta of what might be termed “functionally-inferential” evidence, be that the presence of a coding variant in a credible set, an enticing cis-eQTL that links a regulatory variant to one of the nearby genes, or data from a mouse knockout experiment that recapitulates aspects of the phenotype.

Often, those data will provide vital clues that help to solve the mystery — at least in terms of defining the effector transcript through which a GWAS signal is mediated. They certainly provide a starting point for the functional experiments that should follow. But starting points is what they should be, not destinations in themselves. As more and more functional data accumulate (including knockout phenotypes for all genes in multiple animal models, and pooled CRISPR screens for relevant phenotypes in key cellular models) it will become increasingly possible to construct mechanistic narratives that are already replete with multiple lines of pre-existing evidence. And it will become easier to appreciate exactly how much weight we should place on any one piece of evidence (that is, to understand the sensitivity and specificity of each type of clue).

As we conclude in our paper: “The term “smoking gun” has often been used to describe the potential of functional coding variants to provide causal inference with respect to pathogenetic mechanisms.

This study provides a timely reminder that, even when a suspect with a smoking gun is found at the scene of a crime, it should not be assumed that they fired the fatal bullet.”

One last word. Studies like this are only possible because many hundreds of researchers from scores of labs in dozens of countries agreed to share their data, and because hundreds of thousands of participants signed up to those studies in the first place. It’s a privilege to be the ones able to bring these data together and tell this particular story, but we want to recognise the contributions of all those who have played a part.

Mark McCarthy, Anubha Mahajan

Anubha is the Senior Lead for genetic discovery efforts in T2D and related conditions working for the McCarthy Group at Wellcome Centre for Human Genetics, University of Oxford.

Robert Turner Professor of Diabetic Medicine, Group Head, Wellcome Trust Centre for Human Genetics, Group Head / PI, Grant Holding Senior Scientist, Consultant Physician.