Question: GWAS analysis with Illumina HumanOmni5-4 BeadChip - recommended software and workflow
0
3.2 years ago by
Lna0
Lna0 wrote:

Hi,

I am completely new to sequence analysis and a want to perform a GWAS analysis using an Illumina HumanOmni5 BeadChip. I received data of my 16 samples, which incluce jpeg and idat files 3 tables in txt format (full data, snp and sample).

I was looking for a recommended Workflow (any suggestions would be of great help to me) and Tools to analyse my data. I understand  that I have to perform quality control of the Chip data before I can start the actual GWAS analysis. GWASTools seemed a proper tool for quality control and PLINK for subsequent analysis. I have some problems finding out about the details...

1. Which tools should I use (being new to the field)? Is there a tool (freeware) that can perform all necessary steps? What is the difference between the tasks GenomeStudio can perform and additional tools like GWASTools oder PLINK? Is PLINK also able to do all necessary qc steps?

2. Which of the data files for my Illumina-Chip do I have to read into GWASTools (or other programs)? Is it necessary to use the idat files (it seems it is not straightforward to use them in GWASTools)? It should be possible to write out txt files that include user selected columns of the the measured data in GenomeStudio (equal to the three tables that were already included). Can I work solely based on these files and which columns have to be included?

modified 3.2 years ago by Stephanie M. Gogarten640 • written 3.2 years ago by Lna0
Answer: GWAS analysis with Illumina HumanOmni5-4 BeadChip - recommended software and wor
1
3.2 years ago by
University of Washington
Stephanie M. Gogarten640 wrote:

GenomeStudio does calling of genotypes, meaning it takes the idat files and determines the genotype of each of your samples for each SNP on the array. It can output the resulting genotypes, as well as the intensity values that were used to call them, as plain text files. It sounds like you already have these. The text files can then be used as a starting point for GWASTools. The "Data Cleaning" vignette will walk you through how to convert the plain text to GDS format, which is used by GWASTools. To follow all the data cleaning steps in that document, you will need the genotype columns, the normalized intensity (usually called "X" and "Y"), the B allele frequency, and the Log R Ratio. Using all the intensity data allows you to do more QC in GWASTools than is possible in PLINK, as PLINK only works with genotypes.

The GWASTools Data Cleaning vignette also includes examples of analyzing relatedness and population structure with the SNPRelate package (which can help check the identity of your samples, and is needed for subsequent analysis if they are related or from different ancestral backgrounds). The last step also shows you how to do an association test. You can do several types of association testing in GWASTools itself (linear or logistic regression, and survival analysis). You can also export to PLINK format for further analysis. The two export options are 1) plinkWrite in GWASTools, which is slow (and writes the uncompressed PED format rather than compressed BED) but gives you more control over the sample identifiers and phenotypes in the output file, and 2) snpgdsGDS2BED in SNPRelate, which is much faster but you may have to use PLINK utilities to update your sample information later.

Thank you for your suggestions! I managed to read all my data and do all steps in the GWASTools vignette. When I started to work with plink, I compared the outcome of the two tools. I checked the number of SNPs which were included with my MAF and missing rate filter and I compared the p-values I obtained doing simple association tests with a quantitative trait and no covariates. The results differed considerably in for the filters as well as for the association results. For me it seems it should be quite clear how to compute the MAF and missingness. The number of excluded SNPs in case of the missingness is 2575 in GWASTools and 4307 in plink, which is not really close... In case of the association results it seems that the same method is applied in both cases. Shouldn't the results be the same? Am I doing anything wrong or is there an explanation for the difference?

If you are applying exactly the same filters and methods the results should be the same. It seems likely that either GWASTools or PLINK or both is not doing exactly what you think it's doing (different default options, perhaps). Are you certain that the samples and SNPs in both your GDS and PLINK files are exactly the same? Can you supply more details on the commands you are using in both cases?

In the meanwhile, I managed to track down one of the problems that leads to a difference between the missingness filter in Plink and GWASTools. There is quite a number of SNPs on the X-chromosome in my data set, which have theta values for males that are close to 0.5 except of 1.0, which implies heterozygosity where it shouldn't exist.

Example:

       Theta_1 Theta_4 Theta_8 Theta_9 Theta_10 Theta_11 Theta_12 Theta_13 Theta_14 Theta_15 Theta_16 Theta_19 Theta_20 Theta_22 Theta_25 Theta_26 Theta_27 Theta_28
0.08    0.06   0.065   0.701    0.059    0.694    0.065    0.703    0.068    0.059    0.698    0.684    0.666    0.692    0.069    0.681    0.665    0.677

Of course, the clustering leads to the following genotypes: AA, AA, AA, AB, AA and so on, where it should be AA, AA, AA, BB, AA ...

Plink automatically sets heterozygous genotypes to missing for male x-chromosomal data, which increases the number of filtered SNPs for a specific missing rate considerably. GWASTools seems to just keep the obviously wrong genotypes, or do I miss anything? This leads a the difference in filtered SNPs between the two programs. It would be very helpful if these erroneous SNPs could not only be tracked down, but could also be corrected.

The "alleleFrequency' function in GWASTools will ignore heterozygotes for males on the X chromosome, but assocRegression does not set any genotypes to missing automatically. I would suggest using the "hetBySnpSex" function to identify X chrom SNPs with high heterozygosity in males and filter those from your association test results.

If you want to correct these SNPs rather than just filtering them, you would need to go back to the raw data and change the genotype calls.

Concerning the association results, I am sure that the used samples are the same and even if the number of SNPs would be different (due to different filters maybe), this shouldn't make a difference for the result of a specific SNP, since association tests are independent for each SNP, are they? I'd rather think the problem are different default options or something like that. In both cases I used very simple commands:

GWASTools

assoc.list <- lapply(unique(chr), function(x) {

covar <- NULL
if(x != 25) covar <- c("sex")
start <- which(chr == x)[1]
end <-which(chr==x)[length(which(chr==x))]
assocRegression(genoData, outcome="cd34", covar=covar, model.type="linear",
snpStart=start, snpEnd=end)
})

plink --bfile q_all --geno 0.15 --maf 0.06 --make-bed --nonfounders --out q_filt

plink --adjust --assoc --bfile q_filt --out quant.filt --qt-means

q_all are the files I exported from GWASTools, including my quantitative trait variable.

I suspect the difference in P-values is due to the difference in the way plink codes the X chromosome for males: http://pngu.mgh.harvard.edu/~purcell/plink/faq.shtml#faq9

GWASTools codes males as (0,2), while plink codes them as (0,1).

Now I found another problem, which causes the differences between the missing rate in Plink and GWASTools. I generated the input ped file for Plink with Bioconductor:

phenotype.col="cd34", rs.col="rsID", mapdist.col=NULL)

For some SNPs, single data points are missing in the resulting ped-file.

Here is an example:

The genotypes of the SNP I obtain with getGenotype are: 0 0 1 0 0 0 1 0 1 0 0 0 0 2 0 1 1 1 0 0 0 0 0 0 0 0 1 0

which should come from the raw data:

B,B
B,B
A,B
B,B
B,B
B,B
A,B
B,B
A,B
B,B
B,B
B,B
B,B
A,A
B,B
A,B
A,B
A,B
B,B
B,B
B,B
B,B
B,B
B,B
B,B
B,B
A,B
B,B

When I check the ped file, I get:

A A
A A
A B
0 0
A A
A A
A B
A A
A B
A A
A A
A A
A A
B B
A A
A B
A B
A B
A A
A A
A A
A A
A A
A A
A A
A A
A B
A A

The data is suddenly missing in the forth line. I found SNPs for which half of the data was missing, which explains the higher missing rate in PLINK.

I'm not sure what's going on here, but you can try a few things to troubleshoot:

1) check that the "alleleA" and "alleleB" columns in your SNP annotation are correct

2) run "plinkCheck" to compare the file you produced to the original GDS

3) use SNPRelate::snpgdsGDS2BED and see if you have the same problem.

I ran plinkCheck to compare the files. This is the output, it implies everything should be ok:

Checking SNPs against map file...
OK
Checking sample data and genotypes in each line of ped file...
OK
Checking that all samples were found in ped file...
OK

Then I used SNPRelate::snpgdsGDS2BED to generate the ped-File by writing the bed file and recoding it to ped format.

snpgdsGDS2BED("geno.gds","Plink.gds",snp.id=snp.gds,snpfirstdim=TRUE,verbose=TRUE)

The resulting file seems to be ok. No additional missing genotypes are inserted. By counting the "A","B", and "0" entries in the files generated by plinkWrite and snpgdsGDS2BED directly, I obtain:

A: 102457425

B: 136753035

0: 345724

snpgdsGDS2BED:

A: 102488387

B: 136798925

0: 268872

which means that 76852 entries are missing in the Plink output.

It seems something is wrong with plinkWrite and plinkCheck is not working, or is there any parameter that could have influenced the output in this way?

Can you check if any of the SNPs are monomorphic? In GWASTools, monomorphic SNPs are coded B/B, B/B, etc for all samples (if B is the major allele). I believe PLINK will code monomorphic SNPs as 0/0: from the docs, "If the SNP is monomorphic, by default the allele code out will be 0 and all individuals will have a count of 0 (or NA)" (http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml).

Some of the erroneous SNPs are monomorphic, some are not. The example I posted above is obviously not, but there is one SNP, for example, which is monomorphic and for which six of 28 genotypes are missing in the ped-file.

Which chromosomes are affected? Are they autosomal, or sex chromosomes?