Hello,
I am analyzing a large cohort (1000+ samples) for differential expression analysis.
I am using Deseq2 to process raw counts from samples correcting by two confounding factors (namely sex and histology origin).
As output we obtain a shortlist of interesting candidates with strong adjusted p-value (e.g. p < e-14) and would like to plot the result so that I can visually show such a good result.
Searching around in the net (e.g. PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different? , or DESeq2 - Acquiring batch-corrected values for PCA and hierarchical clustering) I resolved to try to do this using limma batchEffectCorrection() on vst(dds).
Once I plot the output as a boxplot, it does not look as distribution close to a p.value < 1e-14 ( running a t.test on them results in 0.002)
I am pretty sure there is something I am missing in my processing.
Can you suggest a way to improve what I am doing so that what I see resembles the p.value obtained by Deseq analysis?
Here it's what I have done:
library(limma)
library(DESeq2)
library(ggplot2)
genes=read.table("raw_counts.csv",h=T,row.names=1)
info=read.table("matched_metadata.csv",sep='\t',h=T,row.names=1)
ddsMat = DESeqDataSetFromMatrix(countData=genes,
colData=info,
design=~donor_sex + histology + STATUS)
ddsMat <- estimateSizeFactors(ddsMat)
ddsMat <- estimateSizeFactors(ddsMat)
idx <- rowSums( counts(ddsMat, normalized=TRUE) >= 5 ) >= 8
ddsMat <- ddsMat[idx,]
#now run deseq2
dds <- DESeq(ddsMat, parallel=T,BPPARAM = param)
myres2 = results(dds_results, parallel=T,BPPARAM = param,lfcThreshold = 1)
#now VST+ batch removal of the 2 confounding factors
dds_vst= vst(dds,blind =F )
dds_batch_removed = removeBatchEffect(assay(dds_vst),batch= info$histology,
batch2=info$donor_sex ,
design=model.matrix(~info$STATUS))
#now plot this
a =as.data.frame(t(dds_batch_removed))
a$STATUS= info$STATUS
ggplot(a,aes(x=STATUS , y= ENSG00000171819.4)) + geom_boxplot()
--
My result for the top gene is like this in the image below, I think I'm doing something wrong in the process
https://drive.google.com/open?id=1bhfEzQlHj5c_ElhyrQQaSwSsrk83zpGk
Thanks a lot for your help
Mattia

Can you post the plotCounts() for this gene straight from DESeq2? This may help us see the differences better. Also, what is the mean of VST values in the two groups (after you've used removeBatchEffect)?
Hello,
thanks a lot for the prompt reply.
Here the image of the plot counts :
plotCounts(dds=dds,gene="ENSG00000171819.4", intgroup= 'STATUS')https://drive.google.com/open?id=1i16bFQgbVxC9775Xya5eMtpwhyc8zK8d
And here the results of the mean VST values in the two groups:
Finally, these are the results from the DESeq2 analysis for this gene just in case it may help.
> myres2[rownames(myres2)=='ENSG00000171819.4',] log2 fold change (MAP): STATUS Cases vs Control Wald test p-value: STATUS Cases vs Control DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000171819.4 75.91621 2.722376 0.1932445 8.912937 4.969837e-19 2.029135e-14Thanks again,
Mattia