Question

Targeted RNASeq/CaptureSeq DE Analysis

0

Entering edit mode

tyssowski • 0

@tyssowski-11483

Last seen 7.2 years ago

Harvard Medical School

I am using EdgeR to find DE genes in my data. In order to reduce necessary sequencing depth, I have been enriching my sequencing libraries for 300 genes of interest before sequencing (described here: http://www.ncbi.nlm.nih.gov/pubmed/24705597) and then running EdgeR DE analysis on only those genes. I am comparing gene expression between two experimental conditions.

I have been getting a list of DE genes using the following function:

edgeRDElist = function(subset, groups){
y = DGEList(counts = subset, group=groups)
y = calcNormFactors(y)
y = estimateCommonDisp(y)
y = estimateTagwiseDisp(y)
et = exactTest(y)
topTags(et, 20)
de <- decideTestsDGE(et,adjust.method="fdr",p.value=0.05)
print(subset[de == -1,])
return(rownames(subset[de == -1,]))
}

Where subset is a matrix of read counts for a set of samples (consisting of biological replicates) and groups identifies the experimental condition.

Of the 300 genes that I am capturing (and therefore running the analysis on), 270 are “test” genes that could be DE between the two experimental conditions and 30 are “control” genes that I am reasonably sure should not change.

I am concerned that what I'm currently doing not appropriate for this analysis because it assumes that most of the genes are not DE. I am also concerned that using only a small number of genes is not compatible. Are these valid concerns? If so, is there anyway to get around them using EdgeR or another program?

Thanks!

Kelsey

EdgeR deseq2 RNA differential expression • 1.6k views

ADD COMMENT • link updated 7.6 years ago by Aaron Lun ★ 28k • written 7.6 years ago by tyssowski • 0

score 3 · Accepted Answer · 2016-09-15

The normalization step assumes that most of the genes in the matrix are not DE. This is usually a reasonable assumption for bulk RNA-seq, but in your case, it may be better to normalize based on your "control" genes, for which the assumption should be somewhat safer (depending on how confident you are in their non-DE'ness):

ycon <- y[control.genes,]
ycon <- calcNormFactors(ycon)
y$samples$norm.factors <- ycon$samples$norm.factors

This will apply the normalization factors computed from the control genes to the analysis for all genes (which is allowable, as long as the library sizes are unchanged between ycon and y, which should be the case if you use the code above). It would be better if you had more control genes, but if the variability is low and the coverage of the control genes is high, then it shouldn't be a problem to get reasonably precise estimates of the normalization factors.

More generally, 300 genes is plenty for empirical Bayes shrinkage to provide a benefit during dispersion estimation. You're still using information from multiple genes to help with inference, so I wouldn't worry about that. The only thing I would suggest is to switch to estimateDisp with glmQLFit and glmQLFTest instead of exactTest.