Question

selecting SNPs and samples in GENESIS

0

Entering edit mode

Stephanie M. Gogarten ▴ 870

@stephanie-m-gogarten-5121

Last seen 22 days ago

University of Washington

This question was sent by email:

We performing GWAS on whole genome data and using GENSIS package for the association analysis.

Just a few small Questions:

How many SNPs should be used for KING relationship matrix?
What is the best way to select number of PCs to be used as covariates?
How many SNPs should be used to estimate PCs using PC-Air and PC-Relate method?
Does the association model take care of NA in the phenotype data or the samples need to be removed before performing association?

genesis • 773 views

ADD COMMENT • link 5.7 years ago Stephanie M. Gogarten ▴ 870

score 0 · Answer 1 · 2018-08-07

The devel version of GENESIS includes a vignette that may be helpful: http://bioconductor.org/packages/devel/bioc/vignettes/GENESIS/inst/doc/assoc_test_seq.html

1) How many SNPs should be used for KING relationship matrix?

We recommend LD pruning to select SNPs. The SNPRelate function snpgdsLDpruning can be used for this. We usually set a minor allele frequency threshold in the pruning function to eliminate rare variants. After pruning, we usually end up with 200,000 - 300,000 SNPs.

2) What is the best way to select number of PCs to be used as covariates?

You want to select PCs that are informative for distinguishing populations. A good way to do this is make a parallel coordinates color-coded by population or self-identified race, as illustrated in the vignette. Look for the last PC that separates groups of colors instead of looking like noise.

3) How many SNPs should be used to estimate PCs using PC-Air and PC-Relate method?

The recommendations for LD pruning apply here also. We often do another round of LD pruning using only unrelated samples (selected with the pcairPartition function).

4) Does the association model take care of NA in the phenotype data or the samples need to be removed before performing association?

fitNullModel will remove any samples with NA in the phenotype data prior to fitting the null model. However, I recommend explictly selecting non-missing samples with the sample.id argument, because it makes it much easier to keep track of exactly how many samples are being used in your analysis and reduces the possibility of errors.