Is there a conceptual difference between vst/rlog transforms and lfcShrink?
1
2
Entering edit mode
kieran.mace ▴ 40
@kieranmace-11647
Last seen 7.0 years ago

When performing a blind=FALSE vst or rlog transformation on the data, it returns log2 values, that have been "corrected" by accounting for gene expression noise.

Similarly, lfcShrink also returns log2(FC) values that have been shrunk due to gene expression noise?

I therefore wonder, are these two processes conceptually similar? are they in fact conceptually identical? If not, I'd like to gain an understanding for what their differences are.

deseq2 • 13k views
ADD COMMENT
3
Entering edit mode
@mikelove
Last seen 1 day ago
United States

rlog and lfcShrink are conceptually similar, see from the DESeq2 paper:

"The rlog transformation is calculated by fitting for each gene a GLM with a baseline expression (i.e., intercept only) and, computing for each sample, shrunken LFCs with respect to the baseline, using the same empirical Bayes procedure as before (Materials and methods)." 

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

So instead of using the design, the rlog creates a matrix with an intercept and a coefficient for each sample. See Methods of DESeq2 paper for more details.

ADD COMMENT
0
Entering edit mode

... whereas the VST is conceptually different. It looks at the trend between variance and mean in the data, and then tries to find a strictly monotonous transformation of the data so that this trend is removed. In practice, the transformation will approach the logarithm function for high values and the square root function for small values (incl. 0), and smoothly interpolate inbetween.

ADD REPLY
0
Entering edit mode

That being said, being a novice in differential expression analysis and have been following DESEq2 pipeline, is there a better transformation method? Or when does rlog is more superior vst and vice versa if that's even a point?

Many thanks.

ADD REPLY
1
Entering edit mode

I recommend the VST now, which can be run quickly with vst(). rlog() is slow and relies more heavily on the assumption of the data distribution.

ADD REPLY
0
Entering edit mode

Thanks for the clarifications.

That brings me to an additional question, will the number and list of differentially expressed genes generated changes if we use either rlog or vst using the BLIND = FALSE argument or will it be the same? 

Please advise.

ADD REPLY
0
Entering edit mode

It will be different because the output is different. Using blind=FALSE avoids overestimating the variance, and so I tend to use this approach.

ADD REPLY
0
Entering edit mode

Hi Michael,

I'm having similar list and numbers of DEG's.

So I'm doing differential expression analysis on drought tolerance at reproductive-stage drought stress with two contrasting genotypes and two conditions with 4 replications.

My codes are as follows; Please and kindly comment if there's something wrong on the way I write it out.

##this data frame will become a deseq table#
colData <- data.frame(genotype=rep(c("IL","Swarna"),each=8, sep = "_"),condition=rep(rep(c("Control","Drought"),each=4),times=2),sep = "_")

rownames(colData) <- colnames(tx.all$counts)

 

dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~genotype+condition+genotype:condition))

 

colData(dds)$condition<-relevel(colData(dds)$condition, ref = "Control")

 

dds$group<-factor(paste0(dds$genotype, dds$condition)) ##combine the factors of interest into a single factor with all combinations of the original factors##
design(dds) <- ~group ##change the design to include just this factor##

 

#apply the most minimal filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples##
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

 

For the trasnformation i made a separate run using rlog and vst.

rld<-rlogTransformation(dds,blind=FALSE) 
head(assay(rld), 3)

vsd <- vst(dds, blind=FALSE)
head(assay(vsd), 3)

dds<-DESeq(dds, betaPrior = TRUE, parallel = TRUE)
resultsNames(dds)

 

res.05_NILD_NILC <- results(dds, contrast=c("group","ILDrought", "ILControl"), alpha=.05, parallel = TRUE)
res.05_SWAD_SWAC <- results(dds, contrast=c("group","SwarnaDrought", "SwarnaControl"), alpha=.05, parallel = TRUE)
res.05_NILC_SWAC <- results(dds, contrast=c("group","ILControl", "SwarnaControl"), alpha=.05, parallel = TRUE)
res.05_NILD_SWAD <- results(dds, contrast=c("group","ILDrought", "SwarnaDrought"), alpha=.05, parallel = TRUE)

 

resSig <- subset(res.05_NILD_SWAD, padj < 0.05)
summary(resSig)
table(resSig$padj < 0.05)

 

write.csv(as.data.frame(resSig),file="NILDroughtvsSwarnaDrought_gene-level_ressig-padjat0.05_vsd.csv")

I'm getting same list and number of DEG's when I used rlog or vst on a separate run. How should I go about correcting this? Please and kindly advise.

Many thanks,

Asher

 

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_1.1.1                tximport_1.2.0             genefilter_1.56.0         
 [4] pheatmap_1.0.8             RColorBrewer_1.1-2         ggplot2_2.2.1             
 [7] gplots_3.0.1               DESeq2_1.14.1              SummarizedExperiment_1.4.0
[10] Biobase_2.34.0             GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
[13] IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0   

 

ADD REPLY
3
Entering edit mode

rlog() and vst() are only for visualization and have no effect on the DESeqDataSet where the counts are stored and modeled when you run DESeq().

See the first sentence here: http://www.bioconductor.org/help/workflows/rnaseqGene/#exploratory-analysis-and-visualization

ADD REPLY
0
Entering edit mode

Thanks, Michael,

That is my understanding as well based on your paper and reading on the DESEq2 community. 

So using either rlog or vst will not (greatly) affect differential expression analysis at all and generating the same number and list of differentially expressed genes is okay and not at all connected with using either rlog or vst?

With my samples being n<30 (16 samples) would using rlog be better than suing vst?

Please advise.

Thanks and best regards,

Asher

ADD REPLY
1
Entering edit mode

You can really use either. See the vignette for all we have to say about their differences.

ADD REPLY
0
Entering edit mode

When sorting DE genes by fold change, which method would you recommend to look at the top X genes with the highest logFC: a) the logFC from reslts(DESeq()) b) the logFC from rlog() and/or vst() c) the logFC from lfcShrink() Could someone please shed some light into this? maybe pros/cons of each method or when is reasonable to use one over the other? thank you!

ADD REPLY
1
Entering edit mode

(c) the LFCs from lfcShrink are well tested for this purpose. Check out the apeglm Bioinformatics paper for numerous benchmarks.

(b) does have some shrinkage effects, but it's in some sense accidental. Methods like apeglm focus the shrinkage on the coefficient of interest, and the prior is adapted to that coefficient in particular. A major difference is that from (b) you also don't have a posterior, whereas with (c) you can generate posterior probabilities for the LFC estimates.

ADD REPLY
0
Entering edit mode

Thank you Michael for your quick response! So as a bench researcher, I usually prioritize genes I want to study after a DE analysis based on adjusted P-values and FC. Would you say that the best way to do this is to use the adjP values from the analysis of the unnormalized counts from DESeq() and to use the logFC from a lfcShrink()? I just want to use a table that non-bioinformatic savvy people can easily look at to make decisions on the bench side. Thank you for your help!

ADD REPLY
0
Entering edit mode

If you want a set of genes with an FDR bound, I would recommend DESeq() on counts followed by results() with an lfcThreshold of biologically relevant significance.

If you want to rank genes from largest effect to smallest, I would recommend lfcShrink().

We do have lfcThreshold inside of lfcShrink() which produces genes where the false sign rate is controlled. This is somewhat similar to FDR, but many more biomedical researchers are familiar with and grasp the meaning of FDR bounded sets.

ADD REPLY

Login before adding your answer.

Traffic: 498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6