Search
Question: DESeq2 normalization and group-wise or pair-wise comparisons
0
gravatar for bge
14 days ago by
bge0
bge0 wrote:

Hi,

I have data for 26 different tissues and I want to perform exhaustive
pairwise comparisons between them. I want to find normalized expression
levels for the samples and differentially expressed genes between tissues.

For both I intend to use DESeq2.

From my reading and testing, I believe that the samples have two
undesirable characteristics:

  o  the replicates of a two tissues show high 'within-group variability'.
     The manual explains to me that this can bias the dispersion calculations.

  o  the sample data have a wide range of coverage: when I run DESeq2 on all
     of the samples at once and calculate the size factors, I find a range
     of .1 to 41.

My questions are

  o  is it reasonable to use DESeq2 expression level values rather
     than using read counts to normalize samples?

  o  is there an objective way to decide whether to run the differential
     expression analysis 'all together or split into pairs of groups?'

(I have not found satisfying answers to these questions in my web searches
and manual reading.)

As always, I am grateful for assistance and consideration.

Brent

ADD COMMENTlink modified 14 days ago • written 14 days ago by bge0

Hi,

I had made a PCA plot and saw that replicates of most tissues cluster reasonably well but the replicates that belong to two tissues spread across about half of the vertical extent of the plot. I know that this is bad, it distorts the 'globally' calculated dispersion values. Is there an objective or semi-objective way to decide how much within-goup variation is too much? (Additionally, I made scatter plots comparing log2fc, p-values, and padjust to compare the two results for one of the pair-wise comparisons. These show large differences, especially the p-values and padjust, which show an alarming lack of correlation. More usefully, the 'split into pairs' analysis gives 878 significant genes whereas the 'all together' analysis gives 566 significant genes, which is consistent with exaggerated dispersion values in the 'all together' analysis. More reassuringly, about 85% of the 566 genes are common to the results of both analyses.)

Also, I used your plotDispEsts() function on both the two condition run and the all condition run. The two condition run shows a lot of dispersion shrinkage, perhaps by between 1/4 and 1/3, whereas the the all condition run shows nearly no shrinkage.

I garbled my first question. I want to find gene expression levels by condition. I can use something akin to RPKM or benefit from the DESeq2 size factors by using the gene expression mean value, mu_ij, divided by the sample size factor to find 'q_ij'. Is it reasonable to take the latter approach? I suppose this depends, again, on the nature of the data.

Thank you for taking the time to consider these questions.

Brent

 

ADD REPLYlink modified 13 days ago • written 14 days ago by bge0

hi Brent,

Depending on how many samples you have per tissue, you might consider to run DESeq2 on pairs, if those two tissues would distort the global dispersion value. The case where I think you still are better off estimating dispersion over all tissues is if you have low sample size per tissue, and here I might say, less than 5 samples per tissue. There is no perfect rule though. There are advantages and disadvantages to different approaches. 

Your second comment makes sense: shrinkage is reduced when there are more samples over which to estimate the parameter.

"I want to find gene expression levels by condition. I can use something akin to RPKM or benefit from the DESeq2 size factors by using the gene expression mean value, mu_ij, divided by the sample size factor to find 'q_ij'. Is it reasonable to take the latter approach? I suppose this depends, again, on the nature of the data."

The answer to this depends on how you made the gene count matrix. If you used tximport() upstream, you could just work with the txi$abundance table. Otherwise, you need to divide out gene length to obtain something that is on the scale of gene expression across genes. How did you input the count matrix?

ADD REPLYlink written 13 days ago by Michael Love15k

Hi Michael,

Regarding the together versus split question, thank you for pointing out the shrinkage dependence on the number of analyzed samples. I am looking at plotDispEsts() plots for analyses of tissue subsets and I was confusing the effects of within-group variability and the number of tissues. I will read yet again the DESeq2 paper.

Your advice on using the DESeq2 gene expression mean value data is very helpful too. I did not keep in mind the effect of gene length.

I am grateful for your illuminating comments.

Best Wishes,

Brent

ADD REPLYlink written 12 days ago by bge0
0
gravatar for Michael Love
14 days ago by
Michael Love15k
United States
Michael Love15k wrote:

Can you say more about the within-group variability? The issue described in the vignette is when some groups cluster tightly and others are very spread out in a PCA plot for example.

ADD COMMENTlink written 14 days ago by Michael Love15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 264 users visited in the last hour