Informatics

Presented today at ASHG 2018 in San Diego, Chipendium is a web application for determining the identity of your microarray platform from a list of genotype variants ids.

It is common for genotyping datasets to lose record of the exact chip platform used, especially where there are numerous versions of each chip. Chipendium will detect the chip platform of your data. You can then strand align your dataset using Will Rayner’s strand website.

Upload your Illumina chip metadata and Chipendium will interrogate our catalogue to suggest the most likely chips platforms. Currently our catalogue contains over 166 chips (Oct 2018) and is regularly kept up to date.

Implementation

Chipendium is a collabaration between two McCarthy group members. The server-side logic (STAASIS) was developed by Will Rayner in Perl as part of his work into strand alignment. The middleware and webapp were written by Neil Robertson in Java/Javascript/HTML.

Find out more about Chipendium here.

Toppar, by Dr Thorhildur Juliusdottir, is a database-driven interactive association result browser written in JavaScript and HTML. It uses the Flot library for plotting, zooming and panning and DataTables for listing the data.

Toppar has now been published in Bioinformatics, please find the paper online here: https://doi.org/10.1093/bioinformatics/btx840

Data integration and visualization help geneticists make sense of large amounts of data. To help facilitate interpretation of genetic association data we developed Toppar, a customizable visualization tool that stores results from association studies and enables browsing over multiple results, by combining features from existing tools and linking to appropriate external databases.

The McCarthy Group deployment of the Toppar web application can be found here.

We have developed ScatterShot, a web application for generating cluster plot images for chip genotyping experiments. This initial deployment has been customised for the UK Biobank dataset.

Our aim has been to produce a centralised web application so that the end user can simply access clusterplot images without the nuisance of accessing, downloading, storing, handling and configuring the underlying data itself; which in the case of UK Biobank is significant!

Find out more about ScatterShot for UK Biobank here

I am a software developer and data analyst for the McCarthy Group at the Wellcome Centre for Human Genetics and OCDEM.

My personal webpage at Wellcome can be found here

To impute and meta-analyse multiple data sets it is essential that the data are aligned to a common reference, almost always the forward strand of the current human genome build. To do this and/or to move data between different builds of the human genome I have created a set of files (here) for the most common Illumina and Affymetrix genotyping chips.

For the Illumina chips listed the zip file contains three files:

  • .strand file
  • .miss file
  • .multiple file

The strand file contains six columns:

  • SNP id
  • Chromosome
  • Position
  • Percentage match to genome
  • Strand
  • Alleles

The SNP ids used are those from the annotation file and so are not necessarily the latest from dbSNP.

The alleles listed are the Illumina TOP alleles, if you are in any doubt whether your data file can be used with these strand files a check of the non A/T G/C SNPs alleles vs the strand file should confirm this for you. If there are differences then it is likely your genotype file has been created using a different set of alleles, in this case if you can provide me with a list of the SNP ids and their alleles on the chip (a plink bim file is best) it is likely I can create a strand file for you.

The .miss file gives the ids of the SNPs that did not reach the required threshold for mapping to the genome, the position and strand of the best match are given.

The .multiple file contains SNPs that had more than 1 high quality match to the genome (default >=90%), the number in the second column showing exactly how many matches. In these instances the highest percentage match is taken for the .strand file. If there is a second number present this shows the total number of matches that had identical >90% matches, in these cases the match used in the file will be random and may not be the one expected.

The Illumina strand files, as mentioned, by default assume that your genotype calling algorithm has standardised the genotype allele calls to the Illumina TOP strand, however this is not always the case. In these instances then the most likely alternative is using the alleles derived from the Source Sequence in the annotation file, I have labelled the files, where they exist, as SourceStrand and will be adding them as time, and demand, allows.

The Affymetrix strand files are fewer in number as this information is provided by default by Affymetrix and is also present for multiple genome builds. Where there are files I have used the SNP_A identifiers as these are invariant between the different genome builds.

The poster presentation for this work from the ASHG in 2011 is here

 

 

 

At the current time one of the standard file formats for chip genotype data is a ped/map or plink binary ped (bed/bim/fam) format. This is normally converted to VCF format for imputation and further analysis.

An issue with this conversion is that Plink currently sets allele 1 as the minor and allele 2 as the major allele and whilst this is somewhat close the reference allele definition (in that reference is usually the common allele) it is not always the case.

Therefore these files are designed to be used in plink with the –reference-allele (or –a1-allele, –a2-allele) command, allowing allele 1 or 2 to be set as the reference to ensure it is correctly assigned in any resulting VCF conversion.

NOTE: at the present time the files are not 100% correct for indels, whilst the allele assignment is correct, the allele listed may be truncated and this may cause issues with plink.

http://www.well.ox.ac.uk/~wrayner/strand/RefAlt.html

Rigours data quality control prior to imputation is vital to ensure high quality output, to simplify validating that this QC has been done we have developed a program to compare a plink .bim file against the HRC or 1000G reference panel SNP list. The program produces an overall summary as well as a set of files that can be applied to the data set to correct any issues found.

To run the program, in addition to the reference panel file, also requires the plink .bim and (from the plink –freq command) .frq files.

The program checks for issues with:

  • Strand
  • Alleles
  • Position
  • Ref/Alt assignments
  • Frequency differences > 0.2 (this can be changed)

Flagged for removal are:

  • A/T & G/C SNPs where the bim file MAF > 0.4
  • SNPs with differing alleles
  • SNPs with > 0.2 allele frequency difference
  • SNPs not in reference panel

These removals can be run in their entirety using the plink command script created or individually as desired. There is also a frequency comparison file that can be plotted in your favourite graphical package (or even Excel) to gain an overview of the bim file frequency vs reference.

http://www.well.ox.ac.uk/~wrayner/tools/#Checking

Some programs such as zCall produce output with the alleles labelled as A/B. The files created on the link below are to update the A/B notation to the TOP alleles, thereby allowing the use of the strand and position files (Strand Files) to generate a data set on the forward strand of genome build of choice.

These files are formatted for use in plink with the –update-alleles command.

http://www.well.ox.ac.uk/~wrayner/strand/ABtoTOPstrand.html

 

A detail from the McCarthy Group’s OWL LIMS. OWL interfaces with an FluidX XTR 96 rack scanner over TCP/IP; the scan is initiatiated by OWL, the samples tubes are scanned and results are returned. The database is then interrogated and the technician is presented with a display summarizing the changes that have been made to the rack. The technician has the choice of cancelling or committing the rack scan.

The display is designed to give intuitive visual feedback to the user about the physical state of the rack. The featured image at the top of the page is a logic defying test case to showcase all possible states. Actual use cases will typically be much more like the image below in which a single new tube has been added to an existing rack.

Rack scan display from OWL

The single yellow circle is surrounded by a darker outer circle; this indicates an previously empty well into which a new, unallocated tube has been placed. The remaining 95 tubes were previously present and are unchanged.

In the image below we have a more complex use case involving tube shuffles, allocations and no reads. This rack scan should not be committed, the rack should be cleaned and scanned again.

Rack scan display from OWL

Four of the wells (in orange) have not been read, so the rack will need to be rescanned. Three of these wells contained existing tubes. Six of the wells are coloured in blue, which indicates that these tubes have been shuffled within the rack. Nine new tubes have been allocated to the rack, all into previously empty wells.

Images are rendered in browser with SVG using the Snap.svg Javascript library.

I am a software developer and data analyst for the McCarthy Group at the Wellcome Centre for Human Genetics and OCDEM.

My personal webpage at Wellcome can be found here