By Mark McCarthy & Rachel Freathy.
Today, we published a paper in Nature in which we explore the genetic contribution to variation in birth weight. We describe how this information can help us to tease apart the contributions made by nature (i.e. inherited genetic variation) and nurture (i.e. the sum of environmental exposures) to the observed relationships between growth in early life and the predisposition to diseases such as diabetes and hypertension many decades later. A couple of blogs describing our main findings are available here and here.
At the heart of the study was a genome wide association study (GWAS) of birth weight involving over 150,000 people for whom we had information on birth weight and on genetic variations throughout the genome (genome-wide genotypes). This wasn’t the first such GWAS effort for birth weight, but it was by far the largest. The EGG consortium which has been leading these analyses for the past six years or so, had published previous GWAS in 2010 and 2013. These studies had involved ~11,000 and ~27,000 individuals in their GWAS discovery stages, and identified two, and seven, birth weight association signals, respectively.
Since the 2013 paper, the EGG consortium effort had been steadily growing as more and more data sets with birth weight phenotypes were genotyped, and since the researchers responsible for those data sets had generously agreed to share those data. The motivation here, of course, is that larger sample sizes typically bring greater power to detect additional signals of lesser effect.
By the time we were ready to kick off the meta-analysis that, at the end of the day, provided around half the data contributing to the current (2016) paper, we had gathered GWAS and birth weight data on over 70,000 individuals. For the most part, these were birth cohorts, bringing the expectation that the birth weight phenotypes to which we had access would be of “high” quality. The birth measurements had been recorded contemporaneously by medical personnel, and additional data were available that allowed us to exclude potential outliers (eg twin and/or preterm births) and to adjust for important factors that influence birth weight (most obviously, gestational age). The increase in GWAS sample size (from ~27,000 to 70,000) had the desired effect, growing the number of genome-wide significant loci for birth weight from the 7 reported in 2013 to around 20.
It was while we were compiling this “traditional” GWAS meta-analysis that the first tranche of GWAS data from UK Biobank was scheduled for release. We knew that around 50% of UK Biobank participants had provided self-reported birth weight data as part of the medical survey conducted at recruitment. Several of the investigators in EGG already had approvals in place to examine birth weight (and related metabolic phenotypes). It seemed natural to consider whether we could include these data in the meta-analysis. With that first tranche of GWAS data including nearly 150,000 individuals (half of them with birth weight), might we be able to double the size of our meta-analysis in one fell stroke?
UK Biobank, for those who aren’t familiar, enrolled, between 2006 and 2010, around 500,000 subjects from the UK, aged 40 to 69 years, to participate in a study of the contributions of genes and environment to human disease. All participants took part in a detailed clinical examination, answered a series of computer-based surveys about diverse aspects of lifestyle and health, and donated biosamples (blood, urine). All agreed to have their biobank data linked to evolving health information collected from hospital episode statistics, registry information and other sources. Selected groups of participants were targeted for repeat visits, and other subgroups for more intensive phenotypic analysis including measurements of physical activity and imaging (MRI). All 500,000 have genome-wide genotype data available. A bespoke genotyping array was used for the genotyping. It was designed to optimise genome-wide coverage of variation in UK populations (through imputation), whilst also being enriched for putatively functional genetic variants such as those in the coding regions of the DNA. (These data are being released in two tranches: the 150,000 we used here, and another 350,000 due to be released in the next few months). The scale, scope and diversity of the data within UK Biobank are remarkable in terms of the types of questions that can be addressed, many of them for the first time. The accessibility of the resource to the global research community has resulted in a data set that has rapidly become transformative for many research groups.
Inevitably, however, not all information is recorded with equal precision in UK Biobank, and birth weight might have been a case in point. The birth weight measures were based entirely on self-report (based on information presumably recalled after an interval of several decades) raising concerns about precision. Crucially, gestational age was not available for adjustment. Worryingly, the distribution of raw BW results was decidedly non-uniform (see figure), presumably reflecting digit preference in recollected birth weight. In other words, many people would have recalled their birth weight as an imperial unit integer (e.g. “around 7 lbs”). Information on birth weight was only present in around half of participants (and only 8% of those eligible had agreed to participate in UK Biobank in the first place), raising questions about representation. The contrast with the meticulously collected birth weight phenotype data available from the other EGG cohorts was marked.
Given these concerns, we were uncertain how to proceed. Would the increased sample size afforded by bringing UK Biobank and EGG together really improve our prospects for detecting real genetic associations? Or would combining the two types of data be detrimental to our power, with the less precise UK Biobank data diluting out true signals emerging from EGG?
There were some reasons for believing that we could rely on the UK Biobank data. Using the full UK Biobank dataset, one of us (RF) working with colleagues in Exeter, had shown that the UK Biobank birth weight data had “face validity” based on the observed relationships to exposures known, from other epidemiological studies, to influence birth weight. So, UK Biobank birth weight measures showed expected variation with regard to the gender and ethnicity of the baby. Babies born to mothers who smoked during pregnancy were smaller than those born to mothers who did not. Of particular interest to us, UK Biobank participants who reported a paternal history of diabetes had birth weights below the average for the entire data set; whilst those with a maternal history, tended to be on the large size. All of these observations encouraged us to have confidence in the robustness of the UK Biobank data.
To decide the best approach for the integration of the EGG and UK Biobank results, we set ourselves a test (and, crucially, agreed to be bound by the outcome). We considered the seven regions of the genome we had first reported to be associated with birth weight in the 2013 paper and compared the association effect sizes observed in the UK Biobank data set with those reported in the earlier paper. Bearing in mind the latter would have benefited from some degree of winners’ curse over-inflation, we agreed that, provided the effect sizes in UK Biobank exceeded (on average) 70% of those seen in the better-curated EGG birth cohorts, we would be safe to proceed to a full meta-analysis. (The alternative would have been to do some variation on the mutual “top-hit look-up” strategy). We were reassured to see directionally consistent replication of all seven signals in UK Biobank, with effect size estimates around 75% of those seen in the birth cohorts.
On the basis of these findings, we proceeded to a full meta-analysis of the 150,000 samples. We quickly felt vindicated. The number of genome-wide significant loci jumped from 20 to 60. We found no evidence of heterogeneity of effect size for those loci between the European EGG cohorts and UK Biobank.
So how is it that so much information could be extracted from what appeared such a messy and imprecise phenotype? There are some lessons here that speak to the intrinsic value of UK Biobank as well as to some of the potential limitations of GWA meta-analysis approaches that build signal from multiple smaller cohorts. These lessons extend well beyond measures of birth weight.
Essentially, UK Biobank offers a “strength in numbers” that can compensate for what may appear quite marked phenotypic imprecision. Those generic benefits include:
- A (largely) representative population sample size of 500,000;
- Unified phenotype (participant characteristic) collection with harmonized protocols and careful study-wide QC;
- Unified genotyping with a single efficient bespoke GWAS array, followed by a single imputation run with the same reference samples, and the capacity for a single, centralised analysis;
- A wealth of phenotypes that can used for adjustment, inference and exploration.
These contrast with the heterogeneity of phenotypes, covariate adjustments, arrays, populations, imputation reference panels and analysis methods (and analysts!) that are a feature of many GWAS meta-analysis. As those GWAS meta-analyses get larger and more cumbersome (some involve over 100 participating studies) it is inevitable that, despite the best efforts of everyone involved, heterogeneity and errors creep in that can lead to some attenuation of the true association signal.
That is not to argue against the value of the GWAS meta-analysis approach. Rather it is to make the rather obvious point that, within reason, the more data the better. Our experience with birth weight demonstrates that the judicious combination of both kinds of data can prove hugely rewarding in terms of our ability both to discover genetic loci, and to characterise their impact.
There’s little wonder that UK Biobank has established itself so rapidly as a foundational resource for medical research, and that it has spurred the development of analogous data sets in many other countries. With GWAS data on the remaining 350,000 participants due in the coming months, along with a swath of biochemical measurements, and with the prospect of ever deeper genetic and genomic characterisation, the impact of this study is set to grow.