To impute and meta-analyse multiple data sets it is essential that the data are aligned to a common reference, almost always the forward strand of the current human genome build. To do this and/or to move data between different builds of the human genome I have created a set of files (here) for the most common Illumina and Affymetrix genotyping chips.
For the Illumina chips listed the zip file contains three files:
- .strand file
- .miss file
- .multiple file
The strand file contains six columns:
- SNP id
- Percentage match to genome
The SNP ids used are those from the annotation file and so are not necessarily the latest from dbSNP.
The alleles listed are the Illumina TOP alleles, if you are in any doubt whether your data file can be used with these strand files a check of the non A/T G/C SNPs alleles vs the strand file should confirm this for you. If there are differences then it is likely your genotype file has been created using a different set of alleles, in this case if you can provide me with a list of the SNP ids and their alleles on the chip (a plink bim file is best) it is likely I can create a strand file for you.
The .miss file gives the ids of the SNPs that did not reach the required threshold for mapping to the genome, the position and strand of the best match are given.
The .multiple file contains SNPs that had more than 1 high quality match to the genome (default >=90%), the number in the second column showing exactly how many matches. In these instances the highest percentage match is taken for the .strand file. If there is a second number present this shows the total number of matches that had identical >90% matches, in these cases the match used in the file will be random and may not be the one expected.
The Illumina strand files, as mentioned, by default assume that your genotype calling algorithm has standardised the genotype allele calls to the Illumina TOP strand, however this is not always the case. In these instances then the most likely alternative is using the alleles derived from the Source Sequence in the annotation file, I have labelled the files, where they exist, as SourceStrand and will be adding them as time, and demand, allows.
The Affymetrix strand files are fewer in number as this information is provided by default by Affymetrix and is also present for multiple genome builds. Where there are files I have used the SNP_A identifiers as these are invariant between the different genome builds.
The poster presentation for this work from the ASHG in 2011 is here