Hi,
I am completely new to sequence analysis and a want to perform a GWAS analysis using an Illumina HumanOmni5 BeadChip. I received data of my 16 samples, which incluce jpeg and idat files 3 tables in txt format (full data, snp and sample).
I was looking for a recommended Workflow (any suggestions would be of great help to me) and Tools to analyse my data. I understand that I have to perform quality control of the Chip data before I can start the actual GWAS analysis. GWASTools seemed a proper tool for quality control and PLINK for subsequent analysis. I have some problems finding out about the details...
1. Which tools should I use (being new to the field)? Is there a tool (freeware) that can perform all necessary steps? What is the difference between the tasks GenomeStudio can perform and additional tools like GWASTools oder PLINK? Is PLINK also able to do all necessary qc steps?
2. Which of the data files for my Illumina-Chip do I have to read into GWASTools (or other programs)? Is it necessary to use the idat files (it seems it is not straightforward to use them in GWASTools)? It should be possible to write out txt files that include user selected columns of the the measured data in GenomeStudio (equal to the three tables that were already included). Can I work solely based on these files and which columns have to be included?
Thank you for your answers!
Thank you for your suggestions! I managed to read all my data and do all steps in the GWASTools vignette. When I started to work with plink, I compared the outcome of the two tools. I checked the number of SNPs which were included with my MAF and missing rate filter and I compared the p-values I obtained doing simple association tests with a quantitative trait and no covariates. The results differed considerably in for the filters as well as for the association results. For me it seems it should be quite clear how to compute the MAF and missingness. The number of excluded SNPs in case of the missingness is 2575 in GWASTools and 4307 in plink, which is not really close... In case of the association results it seems that the same method is applied in both cases. Shouldn't the results be the same? Am I doing anything wrong or is there an explanation for the difference?
If you are applying exactly the same filters and methods the results should be the same. It seems likely that either GWASTools or PLINK or both is not doing exactly what you think it's doing (different default options, perhaps). Are you certain that the samples and SNPs in both your GDS and PLINK files are exactly the same? Can you supply more details on the commands you are using in both cases?
In the meanwhile, I managed to track down one of the problems that leads to a difference between the missingness filter in Plink and GWASTools. There is quite a number of SNPs on the X-chromosome in my data set, which have theta values for males that are close to 0.5 except of 1.0, which implies heterozygosity where it shouldn't exist.
Example:
Of course, the clustering leads to the following genotypes: AA, AA, AA, AB, AA and so on, where it should be AA, AA, AA, BB, AA ...
Plink automatically sets heterozygous genotypes to missing for male x-chromosomal data, which increases the number of filtered SNPs for a specific missing rate considerably. GWASTools seems to just keep the obviously wrong genotypes, or do I miss anything? This leads a the difference in filtered SNPs between the two programs. It would be very helpful if these erroneous SNPs could not only be tracked down, but could also be corrected.
The "alleleFrequency' function in GWASTools will ignore heterozygotes for males on the X chromosome, but assocRegression does not set any genotypes to missing automatically. I would suggest using the "hetBySnpSex" function to identify X chrom SNPs with high heterozygosity in males and filter those from your association test results.
If you want to correct these SNPs rather than just filtering them, you would need to go back to the raw data and change the genotype calls.
Concerning the association results, I am sure that the used samples are the same and even if the number of SNPs would be different (due to different filters maybe), this shouldn't make a difference for the result of a specific SNP, since association tests are independent for each SNP, are they? I'd rather think the problem are different default options or something like that. In both cases I used very simple commands:
GWASTools
Plink:
q_all are the files I exported from GWASTools, including my quantitative trait variable.
I suspect the difference in P-values is due to the difference in the way plink codes the X chromosome for males: http://pngu.mgh.harvard.edu/~purcell/plink/faq.shtml#faq9
GWASTools codes males as (0,2), while plink codes them as (0,1).
Also GWASTools does not adjust the P-values, so in the plink output using "--adjust" make sure you are comparing to unadjusted values.
Now I found another problem, which causes the differences between the missing rate in Plink and GWASTools. I generated the input ped file for Plink with Bioconductor:
plinkWrite(genoData, pedFile="Plink", individual.col="scanID",
phenotype.col="cd34", rs.col="rsID", mapdist.col=NULL)
For some SNPs, single data points are missing in the resulting ped-file.
Here is an example:
The genotypes of the SNP I obtain with getGenotype are: 0 0 1 0 0 0 1 0 1 0 0 0 0 2 0 1 1 1 0 0 0 0 0 0 0 0 1 0
which should come from the raw data:
B,B
B,B
A,B
B,B
B,B
B,B
A,B
B,B
A,B
B,B
B,B
B,B
B,B
A,A
B,B
A,B
A,B
A,B
B,B
B,B
B,B
B,B
B,B
B,B
B,B
B,B
A,B
B,B
When I check the ped file, I get:
A A
A A
A B
0 0
A A
A A
A B
A A
A B
A A
A A
A A
A A
B B
A A
A B
A B
A B
A A
A A
A A
A A
A A
A A
A A
A A
A B
A A
The data is suddenly missing in the forth line. I found SNPs for which half of the data was missing, which explains the higher missing rate in PLINK.
I'm not sure what's going on here, but you can try a few things to troubleshoot:
1) check that the "alleleA" and "alleleB" columns in your SNP annotation are correct
2) run "plinkCheck" to compare the file you produced to the original GDS
3) use SNPRelate::snpgdsGDS2BED and see if you have the same problem.
I ran plinkCheck to compare the files. This is the output, it implies everything should be ok:
Then I used SNPRelate::snpgdsGDS2BED to generate the ped-File by writing the bed file and recoding it to ped format.
The resulting file seems to be ok. No additional missing genotypes are inserted. By counting the "A","B", and "0" entries in the files generated by plinkWrite and snpgdsGDS2BED directly, I obtain:
plinkWrite:
snpgdsGDS2BED:
which means that 76852 entries are missing in the Plink output.
It seems something is wrong with plinkWrite and plinkCheck is not working, or is there any parameter that could have influenced the output in this way?
Can you check if any of the SNPs are monomorphic? In GWASTools, monomorphic SNPs are coded B/B, B/B, etc for all samples (if B is the major allele). I believe PLINK will code monomorphic SNPs as 0/0: from the docs, "If the SNP is monomorphic, by default the allele code out will be 0 and all individuals will have a count of 0 (or NA)" (http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml).
Some of the erroneous SNPs are monomorphic, some are not. The example I posted above is obviously not, but there is one SNP, for example, which is monomorphic and for which six of 28 genotypes are missing in the ped-file.
Which chromosomes are affected? Are they autosomal, or sex chromosomes?
Both. The "0"s are simply spread all over the file.