Question

How do use disperions estimated from my within group estimates to calculate differential expression between groups?

0

Entering edit mode

rna seq ▴ 90

@rna-seq-4145

Last seen 7.7 years ago

Hello,

I have 8 samples. 4 control and 4 treated. I can obtain a within sample dispersion estimate if I make a dds object using only the control samples.

condition <- factor(c(rep("Control1", 2), rep("Control2", 2)))

And obtain dispersion estimate from this WITHIN group comparison.

I would like to use the dispersion estimates obtained from this WITHIN group comparison and apply it to the BETWEEN group comparison

condition <- factor(c(rep("Control", 4), rep("Treatment", 4)))

in order to perform differential expression analysis on the BETWEEN group comparison using the WITHIN group dispersion estimate.

Can someone tell me how to accomplish this.

TIA.

Deseq2 estimatedispersions • 1.6k views

ADD COMMENT • link 7.8 years ago • updated 7.7 years ago rna seq ▴ 90

score 0 · Answer 1 · 2016-07-21

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 55 minutes ago

United States

Can you explain why you want to do this, rather than using the standard dispersion estimates of DESeq2 which takes into account all of the samples and their expected counts given the design? You may end up with some spurious results if you ignore the variability of the treated samples.

ADD COMMENT • link 7.8 years ago Michael Love 41k

score 0 · Answer 2 · 2016-07-26

0

Entering edit mode

rna seq ▴ 90

@rna-seq-4145

Last seen 7.7 years ago

Hi Michael,

The majority of my genes are differentially expressed between conditions.

If I plot the estimated dispersion WITHIN conditions either treated or control, my dispersions are much less then when I include all samples both WITHIN and BETWEEN groups.

Perhaps I am missing something here/not understanding the model, but it seems reasonable to estimate the biological variance of a gene using only the within group variance of a gene especially if you have enough samples to perform this estimate?

Thanks for your help.

ADD COMMENT • link 7.8 years ago rna seq ▴ 90

0

Entering edit mode

hi,

Note that you can add a comment to an existing answer by clicking Add Comment. You've added a new answer, which is responding to the top post, which was your original question.

So you haven't posted your code, but typically, when you include a design formula that specifies which samples are in which group then the dispersion estimates are based off of the within group variance.

You should post all your code and your sessionInfo() so I and others can determine if there is an error.

ADD REPLY • link 7.8 years ago Michael Love 41k

0

Entering edit mode

Hi,

Here is my code:

b_frame<-read.csv("input_data.txt", header=T, row.names=1)
#b_frame<-na.omit(b_frame)
condition <- factor(c(rep("A", 3), rep("B", 3)))
##create column data
coldata <- data.frame(row.names=colnames(b_frame), condition)

##make DESeqDataSetFromMatrix
dds<-DESeqDataSetFromMatrix(countData=b_frame, colData=coldata, design=~condition)
dds<-estimateSizeFactors(dds)
dds <- DESeq(dds)
res <- results(dds)
table(res$padj<0.05)
## Order by adjusted p-value
res <- res[order(res$padj), ]
## Merge with normalized count data

resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
write.csv(resdata, file="A1_Jenning_DESeq2.csv")

##AND sessionInfo()

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

l

ADD REPLY • link 7.7 years ago rna seq ▴ 90

0

Entering edit mode

Previously you said you had 8 samples, 4 control and 4 treated, and within control there was some grouping of 2 vs 2. Here you have 3 control and 3 treated.

The point is, it matters which dispersion is estimated. If there are differences among the control and treated you can account for that with a batch term in the design. But I'm still not sure exactly how many samples you have and what is the difference between control1 and control2, so I can't give concrete advice.

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Let us say I have 8 samples. The first 4 are control and the second 4 are treatment.

b_frame<-read.csv("input_data.txt", header=T, row.names=1)
#b_frame<-na.omit(b_frame)
condition <- factor(c(rep("control", 4), rep("treatment", 4)))
##create column data
coldata <- data.frame(row.names=colnames(b_frame), condition)

##make DESeqDataSetFromMatrix
dds1<-DESeqDataSetFromMatrix(countData=b_frame, colData=coldata, design=~condition)
dds1<-estimateSizeFactors(dds1)
dds1 <- DESeq(dds1)

"I am not sure if I am accounting for the differences with a batch term here, I am assuming the condition term did this automatically?"

From the dds1 object I get the estimated dispersions.

I was curious to see what the estimated dispersions were from this file so I split the control samples in half so control1 is the first 2 samples of the original control and control2 is the second 2 samples of the original control and performed the same analysis

b_frame<-read.csv("input_data.txt", header=T, row.names=1)
c_frame<-b_frame[,1:4]
condition <- factor(c(rep("control1", 2), rep("control2", 2)))
##create column data
coldata <- data.frame(row.names=colnames(b_frame), condition)

##make DESeqDataSetFromMatrix
dds1<-DESeqDataSetFromMatrix(countData=c_frame, colData=coldata, design=~condition)
dds2<-estimateSizeFactors(dds2)
dds2 <- DESeq(dds2)

When I compare the estimatedDispersions from dds2 to dds1 the estimatedDispersions are much smaller in dds2.

Thanks

ADD REPLY • link 7.7 years ago rna seq ▴ 90

0

Entering edit mode

I'm sorry, I'm having a hard time following this question with the hypotheticals. The dispersion estimation depends on which samples you provide. If you provide the software with control and treated samples, it estimates using the within group variance of those samples. If you only provide it the control samples, and create an artificial division among those, it will use the within-(artificial)-group variance of those samples. If the control samples have less variability than the treatment samples, then the second estimate will be lower (although this is not the proper way to estimate control sample variability, because you've made an artificial division among these samples). Still, any lower estimate you derive from the control samples alone shouldn't be used to compare control and treated, because, as I've said previously, you want the model to use the variability of all the samples in the comparison, not a subset.

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

WIll the following setup estimate within group variances for the control and treatment?

library(DESeq2)

c_frame<-read.csv("data.csv", header=T, sep=",", row.names=1)

condition <- factor(c(rep("control", 4), rep("treatment", 4)))

coldata <- data.frame(row.names=colnames(c_frame), condition)

dds<-DESeqDataSetFromMatrix(countData=c_frame, colData=coldata, design=~condition)

dds<-estimateSizeFactors(dds)

dds <- DESeq(dds)

res <- results(dds)

Also, if I have a dataset the has a greater fold change dynamic range than RNAseq but also has a larger variance, is there a specific estimateDispersions parameter to use? Is there an appropriate test to use with the results functions?

Thanks for your patience and help.

ADD REPLY • link 7.7 years ago rna seq ▴ 90

0

Entering edit mode

"WIll the following setup estimate within group variances for the control and treatment?"

yes.

"Also, if I have a dataset the has a greater fold change dynamic range than RNAseq but also has a larger variance, is there a specific estimateDispersions parameter to use? Is there an appropriate test to use with the results functions?"

You don't have to do anything special. The DESeq() functions will handle estimation regardless of the range of fold changes and within-group variance.

ADD REPLY • link 7.7 years ago Michael Love 41k

score 0 · Answer 3 · 2016-08-03

0

Entering edit mode