Search
Question: Is there a conceptual difference between vst/rlog transforms and lfcShrink?
0
gravatar for kieran.mace
9 months ago by
kieran.mace10
kieran.mace10 wrote:

When performing a blind=FALSE vst or rlog transformation on the data, it returns log2 values, that have been "corrected" by accounting for gene expression noise.

Similarly, lfcShrink also returns log2(FC) values that have been shrunk due to gene expression noise?

I therefore wonder, are these two processes conceptually similar? are they in fact conceptually identical? If not, I'd like to gain an understanding for what their differences are.

ADD COMMENTlink modified 9 months ago by Michael Love19k • written 9 months ago by kieran.mace10
3
gravatar for Michael Love
9 months ago by
Michael Love19k
United States
Michael Love19k wrote:

rlog and lfcShrink are conceptually similar, see from the DESeq2 paper:

"The rlog transformation is calculated by fitting for each gene a GLM with a baseline expression (i.e., intercept only) and, computing for each sample, shrunken LFCs with respect to the baseline, using the same empirical Bayes procedure as before (Materials and methods)." 

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

So instead of using the design, the rlog creates a matrix with an intercept and a coefficient for each sample. See Methods of DESeq2 paper for more details.

ADD COMMENTlink written 9 months ago by Michael Love19k

... whereas the VST is conceptually different. It looks at the trend between variance and mean in the data, and then tries to find a strictly monotonous transformation of the data so that this trend is removed. In practice, the transformation will approach the logarithm function for high values and the square root function for small values (incl. 0), and smoothly interpolate inbetween.

ADD REPLYlink modified 9 months ago • written 9 months ago by Wolfgang Huber13k

That being said, being a novice in differential expression analysis and have been following DESEq2 pipeline, is there a better transformation method? Or when does rlog is more superior vst and vice versa if that's even a point?

Many thanks.

ADD REPLYlink modified 9 months ago by Michael Love19k • written 9 months ago by tarun20
1

I recommend the VST now, which can be run quickly with vst(). rlog() is slow and relies more heavily on the assumption of the data distribution.

ADD REPLYlink written 9 months ago by Michael Love19k

Thanks for the clarifications.

That brings me to an additional question, will the number and list of differentially expressed genes generated changes if we use either rlog or vst using the BLIND = FALSE argument or will it be the same? 

Please advise.

ADD REPLYlink written 9 months ago by tarun20

It will be different because the output is different. Using blind=FALSE avoids overestimating the variance, and so I tend to use this approach.

ADD REPLYlink written 9 months ago by Michael Love19k

Hi Michael,

I'm having similar list and numbers of DEG's.

So I'm doing differential expression analysis on drought tolerance at reproductive-stage drought stress with two contrasting genotypes and two conditions with 4 replications.

My codes are as follows; Please and kindly comment if there's something wrong on the way I write it out.

##this data frame will become a deseq table#
colData <- data.frame(genotype=rep(c("IL","Swarna"),each=8, sep = "_"),condition=rep(rep(c("Control","Drought"),each=4),times=2),sep = "_")

rownames(colData) <- colnames(tx.all$counts)

 

dds <- DESeqDataSetFromTximport(tx.all, colData, formula(~genotype+condition+genotype:condition))

 

colData(dds)$condition<-relevel(colData(dds)$condition, ref = "Control")

 

dds$group<-factor(paste0(dds$genotype, dds$condition)) ##combine the factors of interest into a single factor with all combinations of the original factors##
design(dds) <- ~group ##change the design to include just this factor##

 

#apply the most minimal filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples##
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

 

For the trasnformation i made a separate run using rlog and vst.

rld<-rlogTransformation(dds,blind=FALSE) 
head(assay(rld), 3)

vsd <- vst(dds, blind=FALSE)
head(assay(vsd), 3)

dds<-DESeq(dds, betaPrior = TRUE, parallel = TRUE)
resultsNames(dds)

 

res.05_NILD_NILC <- results(dds, contrast=c("group","ILDrought", "ILControl"), alpha=.05, parallel = TRUE)
res.05_SWAD_SWAC <- results(dds, contrast=c("group","SwarnaDrought", "SwarnaControl"), alpha=.05, parallel = TRUE)
res.05_NILC_SWAC <- results(dds, contrast=c("group","ILControl", "SwarnaControl"), alpha=.05, parallel = TRUE)
res.05_NILD_SWAD <- results(dds, contrast=c("group","ILDrought", "SwarnaDrought"), alpha=.05, parallel = TRUE)

 

resSig <- subset(res.05_NILD_SWAD, padj < 0.05)
summary(resSig)
table(resSig$padj < 0.05)

 

write.csv(as.data.frame(resSig),file="NILDroughtvsSwarnaDrought_gene-level_ressig-padjat0.05_vsd.csv")

I'm getting same list and number of DEG's when I used rlog or vst on a separate run. How should I go about correcting this? Please and kindly advise.

Many thanks,

Asher

 

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_1.1.1                tximport_1.2.0             genefilter_1.56.0         
 [4] pheatmap_1.0.8             RColorBrewer_1.1-2         ggplot2_2.2.1             
 [7] gplots_3.0.1               DESeq2_1.14.1              SummarizedExperiment_1.4.0
[10] Biobase_2.34.0             GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
[13] IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0   

 

ADD REPLYlink written 9 months ago by tarun20
1

rlog() and vst() are only for visualization and have no effect on the DESeqDataSet where the counts are stored and modeled when you run DESeq().

See the first sentence here: http://www.bioconductor.org/help/workflows/rnaseqGene/#exploratory-analysis-and-visualization

ADD REPLYlink written 9 months ago by Michael Love19k

Thanks, Michael,

That is my understanding as well based on your paper and reading on the DESEq2 community. 

So using either rlog or vst will not (greatly) affect differential expression analysis at all and generating the same number and list of differentially expressed genes is okay and not at all connected with using either rlog or vst?

With my samples being n<30 (16 samples) would using rlog be better than suing vst?

Please advise.

Thanks and best regards,

Asher

ADD REPLYlink written 9 months ago by tarun20
1

You can really use either. See the vignette for all we have to say about their differences.

ADD REPLYlink written 9 months ago by Michael Love19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 203 users visited in the last hour