Question

Deseq2 rlog or 'regular' normalization?

0

Entering edit mode

sunil.mangalam • 0

@sunilmangalam-8881

Last seen 8.6 years ago

United States

Hi,

I am facing some difficulties with data transformations of my single cell RNA-Seq data analyzed using DESeq2. This data set is different from typical RNA-Seq experiments.. For example, there is a subset of genes which will be present in one group and totally absent in the other, unlike typical data sets where down regulated genes will still be expressed at lower level. The variation between replicates is also high, and so, we have at least ten replicates for each condition. Even with all these limitations, I am able to get a meaningful result from this analysis.

But the rlog transformation is not optimal for my analysis. I get a warning that more than 10% of the genes have outliers and it suggests doing vst.The vst works for without any warning, but I am still worried if this is optimal or if there is a way to do a better transformation for making heat maps and PCA.

Also, is there a way to extract the normalized values without any transformation? DESeq used to output this, but this function is not in the DESeq2 vignette.

Thanks for your help!

SunilSukumaran

Research Associate

Monell Chemocal Senses Center

Philadelphia

USA

Deseq2 rlog transformation data visualization • 13k views

ADD COMMENT • link 10.3 years ago sunil.mangalam • 0

score 0 · Answer 1 · 2015-09-27

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 5.5 years ago

Zentrum für Molekularbiologie, Universi…

If you just want normalized counts, use

counts( dds, normalized=TRUE )

This hasn't changed since DESeq1.

If you need advice on using VST or rlog, please first tell us what downstream analysis you want to perform with the transformed data.

ADD COMMENT • link 10.3 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Sorry, I got confused. Is rlg(dds) is logarithm transformation? Then if we just want to normalize the data without any transformation, counts( dds, normalized=TRUE ) would be the code?

Please help me to be clear, I just need either normalization or transformation

ADD REPLY • link 7.7 years ago AZ ▴ 20

0

Entering edit mode

rlog and vst give log2-like transformed counts. This is also covered in the vignette in detail.

ADD REPLY • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Thank you,

Sorry, how I can explain the the second principal component between two data sets?? I mean I have applied plotPCA by data from rld, I want to know the reason behind the PCA2 that make two data sets separate not overlapped.

ADD REPLY • link 7.7 years ago AZ ▴ 20

0

Entering edit mode

You may want to read some internet references on PCA (there are loads). Briefly PC2 is the dimension which captures the second most amount of variance in the samples, while being orthogonal (you can think, roughly, "pointing in a totally different direction") to PC1.

ADD REPLY • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Thanks a lot

I was plotting PCA with un transformed data

dds=DESeqDataSetFromMatrix(countData = merged,colData = mycols, design = ~ condition)

dds.norm <- estimateSizeFactors(dds)

plotPCA(dds.norm, intgroup = "condition")
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"DESeqDataSet"’

How can I plot PCA without transformation please?

I also tried

mat=log2(counts(dds.norm, normalized=TRUE)+epsilon)

plotPCA(mat, intgroup = "condition")
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’

ADD REPLY • link 7.7 years ago AZ ▴ 20

0

Entering edit mode

hi jivara,

We have extensive documentation. Please read that beforehand. There is a vignette:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

And a workflow for beginners:

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

ADD REPLY • link 7.7 years ago Michael Love 43k

score 0 · Answer 2 · 2015-09-27

0

Entering edit mode

sunil.mangalam • 0

@sunilmangalam-8881

Last seen 8.6 years ago

United States

Thanks a lot for for your reply, Simon.

I want to do a principal component analysis, calculate Pearson's correlation coefficient between pairwise comparison of all samples, and a hierarchical clustering and heat map where I use only a subset of genes that pass a differential expression cutoff (typically fold change +/-2, padj 0.05.

Reading through some other posts, I noticed some cases where using DESeq(dds, betaPrior=FALSE) is advised in cases where differences between the two conditions are very large. Do you advice trying this setting?

Thanks

Sunil

ADD COMMENT • link 10.3 years ago sunil.mangalam • 0

0

Entering edit mode

the betaPrior advice is probably not relevant here, that was for a particular other case.

the rlog warning is thrown when the majority (90%) of the row sum occurs in a single sample for many rows (more than 10%) with medium to high expression. so this is very sparse data which does not resemble a negative binomial distribution for many genes.

It's not clear to me what's the best transformation, or how to deal with data which looks like, e.g.

0,1,0,0,0 vs 0,1,0,0,10000

You can imagine lots of scenarios: Is the 10000 the real signal? Are the 0's drop-outs (missing values) or low expression? Performance depends on all the assumptions about signal and noise and on the metrics for evaluation, which is an area of active research in single-cell.

The point of the warning is that, if you want to try a DESeq2 transformation you should use the VST and not the rlog.

ADD REPLY • link 10.3 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael,

The behavior you describe is exactly what I see for many genes (14.2% in my dataset to be exact).I think there will be dropouts in many cases , for example when genes have a different intron/exon structure in the cell type we analyze than what is in the gtf files we use for analysis.

I tried to derive Pearson's correlation coefficient with my vst values and it was not very meaningful. But the untransformed normalized values gave more meaningful results. But for a heatmap where I included only genes with padj <0.05, the vst values were better...

Sunil

ADD REPLY • link 10.3 years ago sunil.mangalam • 0