Rigours data quality control prior to imputation is vital to ensure high quality output, to simplify validating that this QC has been done we have developed a program to compare a plink .bim file against the HRC or 1000G reference panel SNP list. The program produces an overall summary as well as a set of files that can be applied to the data set to correct any issues found.
To run the program, in addition to the reference panel file, also requires the plink .bim and (from the plink –freq command) .frq files.
The program checks for issues with:
- Ref/Alt assignments
- Frequency differences > 0.2 (this can be changed)
Flagged for removal are:
- A/T & G/C SNPs where the bim file MAF > 0.4
- SNPs with differing alleles
- SNPs with > 0.2 allele frequency difference
- SNPs not in reference panel
These removals can be run in their entirety using the plink command script created or individually as desired. There is also a frequency comparison file that can be plotted in your favourite graphical package (or even Excel) to gain an overview of the bim file frequency vs reference.