Question: Deseq2 rlog or 'regular' normalization?
0
3.7 years ago by
United States
sunil.mangalam0 wrote:

Hi,

I am facing some difficulties with data transformations of my single cell RNA-Seq data analyzed using DESeq2. This data set is different from typical RNA-Seq experiments.. For example, there is a subset of genes which will be present in one group and totally absent in the other, unlike typical data sets where down regulated genes will still be expressed at lower level. The variation between replicates is also high, and so, we have at least ten replicates for each condition. Even with all these limitations, I am able to get a meaningful result from this analysis.

But the rlog transformation is not optimal for my analysis. I get a warning that more than 10% of the genes have outliers and it suggests doing vst.The vst works for without any warning, but I am still worried if this is optimal or if there is a way to do a better transformation for making heat maps and PCA.

Also, is there a way to extract the normalized values without any transformation? DESeq used to output this, but this function is not in the DESeq2 vignette.

SunilSukumaran

Research Associate

Monell Chemocal Senses Center

USA

modified 3.7 years ago • written 3.7 years ago by sunil.mangalam0
Answer: Deseq2 rlog or 'regular' normalization?
0
3.7 years ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:

If you just want normalized counts, use

counts( dds, normalized=TRUE )

This hasn't changed since DESeq1.

If you need advice on using VST or rlog, please first tell us what downstream analysis you want to perform with the transformed data.

Sorry, I got confused. Is rlg(dds) is logarithm transformation? Then if we just want to normalize the data without any transformation, counts( dds, normalized=TRUE ) would be the code?

rlog and vst give log2-like transformed counts. This is also covered in the vignette in detail.

Thank you,

Sorry, how I can explain the the second principal component between two data sets?? I mean I have applied plotPCA by data from rld, I want to know the reason behind the PCA2 that make two data sets separate not overlapped.

You may want to read some internet references on PCA (there are loads). Briefly PC2 is the dimension which captures the second most amount of variance in the samples, while being orthogonal (you can think, roughly, "pointing in a totally different direction") to PC1.

Thanks a lot

I was plotting PCA with un transformed data

dds=DESeqDataSetFromMatrix(countData = merged,colData = mycols, design        =       ~ condition)

dds.norm <-  estimateSizeFactors(dds)

plotPCA(dds.norm, intgroup = "condition")
Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"DESeqDataSet"’

How can I plot PCA without transformation please?

I also tried

mat=log2(counts(dds.norm, normalized=TRUE)+epsilon)

plotPCA(mat, intgroup = "condition")
Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’

hi jivara,

We have extensive documentation. Please read that beforehand. There is a vignette:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

And a workflow for beginners:

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

Answer: Deseq2 rlog or 'regular' normalization?
0
3.7 years ago by
United States
sunil.mangalam0 wrote:

I want to do a principal component analysis, calculate Pearson's correlation coefficient between pairwise comparison of all samples, and a hierarchical clustering and heat map where I use only a subset of genes that pass a  differential expression cutoff (typically fold change +/-2, padj 0.05.

Reading through some other posts, I noticed some cases where using  DESeq(dds, betaPrior=FALSE) is advised in cases where differences between the two conditions are very large. Do you advice trying this setting?

Thanks

Sunil

the betaPrior advice is probably not relevant here, that was for a particular other case.

the rlog warning is thrown when the majority (90%) of the row sum occurs in a single sample for many rows (more than 10%) with medium to high expression. so this is very sparse data which does not resemble a negative binomial distribution for many genes.

It's not clear to me what's the best transformation, or how to deal with data which looks like, e.g.

0,1,0,0,0 vs 0,1,0,0,10000

You can imagine lots of scenarios: Is the 10000 the real signal? Are the 0's drop-outs (missing values) or low expression? Performance depends on all the assumptions about signal and noise and on the metrics for evaluation, which is an area of active research in single-cell.

The point of the warning is that, if you want to try a DESeq2 transformation you should use the VST and not the rlog.

Thanks Michael,

The behavior you describe is exactly what I see for many genes (14.2% in my dataset to be exact).I think there will be dropouts in many cases , for example when genes have a different intron/exon structure in the cell type we analyze  than what is in the gtf files we use for analysis.

I tried to derive Pearson's correlation coefficient with my vst values and it was not very meaningful. But the untransformed normalized values gave more meaningful results. But for a heatmap where I included only genes with padj <0.05, the vst values were better...

Sunil