To impute and meta-analyse multiple data sets it is essential that the data are aligned to a common reference, almost always the forward strand of the current human genome build. To do this and/or to move data between different builds of the human genome I have created a set of files (here) for the most common Illumina and Affymetrix genotyping chips.

For the Illumina chips listed the zip file contains three files:

  • .strand file
  • .miss file
  • .multiple file

The strand file contains six columns:

  • SNP id
  • Chromosome
  • Position
  • Percentage match to genome
  • Strand
  • Alleles

The SNP ids used are those from the annotation file and so are not necessarily the latest from dbSNP.

The alleles listed are the Illumina TOP alleles, if you are in any doubt whether your data file can be used with these strand files a check of the non A/T G/C SNPs alleles vs the strand file should confirm this for you. If there are differences then it is likely your genotype file has been created using a different set of alleles, in this case if you can provide me with a list of the SNP ids and their alleles on the chip (a plink bim file is best) it is likely I can create a strand file for you.

The .miss file gives the ids of the SNPs that did not reach the required threshold for mapping to the genome, the position and strand of the best match are given.

The .multiple file contains SNPs that had more than 1 high quality match to the genome (default >=90%), the number in the second column showing exactly how many matches. In these instances the highest percentage match is taken for the .strand file. If there is a second number present this shows the total number of matches that had identical >90% matches, in these cases the match used in the file will be random and may not be the one expected.

The Illumina strand files, as mentioned, by default assume that your genotype calling algorithm has standardised the genotype allele calls to the Illumina TOP strand, however this is not always the case. In these instances then the most likely alternative is using the alleles derived from the Source Sequence in the annotation file, I have labelled the files, where they exist, as SourceStrand and will be adding them as time, and demand, allows.

The Affymetrix strand files are fewer in number as this information is provided by default by Affymetrix and is also present for multiple genome builds. Where there are files I have used the SNP_A identifiers as these are invariant between the different genome builds.

The poster presentation for this work from the ASHG in 2011 is here




At the current time one of the standard file formats for chip genotype data is a ped/map or plink binary ped (bed/bim/fam) format. This is normally converted to VCF format for imputation and further analysis.

An issue with this conversion is that Plink currently sets allele 1 as the minor and allele 2 as the major allele and whilst this is somewhat close the reference allele definition (in that reference is usually the common allele) it is not always the case.

Therefore these files are designed to be used in plink with the –reference-allele (or –a1-allele, –a2-allele) command, allowing allele 1 or 2 to be set as the reference to ensure it is correctly assigned in any resulting VCF conversion.

NOTE: at the present time the files are not 100% correct for indels, whilst the allele assignment is correct, the allele listed may be truncated and this may cause issues with plink.

Some programs such as zCall produce output with the alleles labelled as A/B. The files created on the link below are to update the A/B notation to the TOP alleles, thereby allowing the use of the strand and position files (Strand Files) to generate a data set on the forward strand of genome build of choice.

These files are formatted for use in plink with the –update-alleles command.