Question

Confused in the usage of formula in DESeq or DESeq2

2

Entering edit mode

skyoflcz ▴ 20

@skyoflcz-6938

Last seen 9.5 years ago

China

Hi Michael,

I’ve tried to apply DESeq in two sample comparison, and it works well. It is really great for the NGS data analysis.

Now I tried to analysis multiple groups. Though I checked the section about glm model and refer to limma package and look for some answers in BioC posts, I was still stuck and confused. Will you help me to clarify it?

The first question:

For example, if the experiment contain two genotypes(”W“ and ”M“), and each with a factor Treatment (“U” and ”T” stand for “untreated“ and ”treated“ respectively). So I got four groups (“WU”,”WT”,”MU”,”MT”).

Now I want to find out genes response to treatment independent of genotypes, so I should compared WT vs WU, and get a differential expression gene set designed as A, while MT vs MU got a list B, and then find out the intersection of A and B(A∩B)?

If I want to find out genes that differentially response to treatment between genotype, i.e. the differences between set A and Set B as (A∪B-A∩B)?

Are those cases related to the formulas as follows:

fit0=fitNbinomGLMs(countdata,count~genotype+treatment)
fit1=fitNbinomGLMs(countdata,count~genotype+treatment+genotype:treatment

How should I design my formulas for that two cases (in DESeq or DESeq2)? And how to extract the results?

Some guys told me that they would like to design a group with four levels (“WU”,”WT”,”MU”,”MT”).

And code like:

fit0=fitNbinomGLMs (countdata, count~1)

fit1=fitNbinomGLMs (countdata,count~group)

And make pair contrast, and then combine results. Will it make the same thing?

The second question:

About the heat map. I read through the “Differential expression of RNA-Seq data at the gene level -the DESeq package”, and noticed that the heat map can be generated with raw counts after some kind of transform like “Variance stabilizing transformation” (DESeq2 might have different kind of transform). I may misunderstood, but the data for transform has not yet been tested? So whether it is significant is not yet supported by the statistic?

So can we just used the log2foldchange values?

fit0=fitNbinomGLMs (countdata, count~1)

fit1=fitNbinomGLMs (countdata,count~group)

pvalsGLM = nbinomGLMTest( fit1, fit0 )

padjGLM = p.adjust( pvalsGLM, method="BH" )

fit1.padj<-rbind (fit1, padjGLM)

After the filtration (padjGLM<0.1, make sure I did not misunderstand it, this value means, at least one group showed a significance with control?)

Use the log fold2change values to make a heatmap?

These questions trouble me a lot last week. I will appreciate your any kind help. Thank you.

Best wishes!

Chengzhi Liu

deseq2 deseq • 9.0k views

ADD COMMENT • link updated 6.9 years ago by bio2001 ▴ 60 • written 9.5 years ago by skyoflcz ▴ 20

score 3 · Answer 1 · 2014-10-28

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 19 hours ago

United States

"Now I want to find out (1) genes response to treatment independent of genotypes ... (2) genes that differentially response to treatment between genotype"

I'll give you the design and results for DESeq2 (we recommend users switch to DESeq2). I'll assume W is the base level for genotype and U is the base level for treatment (see vignette for instructions on this).

Use a design ~ genotype + treatment + genotype:treatment. The interaction term is for testing genes which respond differently to treatment across genotype. It accounts for the difference seen in the samples which are M genotype and T treatment beyond what would be expected given the M vs W effect in the U samples and the T vs U effect in the W samples. It helps to check a statistical reference on interaction models to understand how this works, or find a local statistician who can answer further questions.

To extract results, these are the genes which answer your question I labeled (2) above:

results(dds, name="genotypeM.treatmentT")

The treatment effect for genotype W is:

results(dds, contrast=c("treatment","T","U"))

The treatment effect for genotype M is these two terms combined. To build a results table for this:

results(dds, contrast=list(c("treatment_T_vs_U","genotypeM.treatmentT")))

"I may misunderstood, but the data for transform has not yet been tested? So whether it is significant is not yet supported by the statistic?"

Yes, the transformed data has nothing to do with statistical testing, unless you subset the transformed data to those rows with a low p-value from the tests.

From the DESeq2 vignette FAQ:

"5.4 How do I use the variance stabilized or rlog transformed data for differential testing? The variance stabilizing and rlog transformations are provided for applications other than differential testing, for example clustering of samples or other machine learning applications. For differential testing we recommend the DESeq function applied to raw counts as outlined in Section 1.3. "

ADD COMMENT • link 9.5 years ago Michael Love 41k

0

Entering edit mode

Thank you for your quick response and detailed instruction. Your kind answer really helps me a lot.

I now try to make it with DESeq2.

Could you possible help me to confirm my understanding? I hope it will not take you too much time.

“ I'll assume W is the base level for genotype and U is the base level for treatment”

Yes, you are right.

Follow your instruction, my codes are as following:

> head(lldata) #counts data

WU WT MU MT

gene_0001 192 422 440 307

gene_0002 554 613 554 1289

gene_0003 5274 564 13660 784

gene_0004 54 48 64 46

gene_0005 186 195 194 126

gene_0006 348 328 360 300

> genotype <- factor(rep(c("W","M"),each=2),level=c("W","M"))
> treatment<- factor(rep(c("U","T"),2),level=c("U","T"))
> colData <- data.frame(row.names=names(lldata),genotype=genotype,treatment=treatment)
> dds <- DESeqDataSetFromMatrix(
         countData=lldata,
         colData=colData,
         design=~genotype+treatment+genotype:treatment)
> dds <- DESeq(dds)
> resultsNames(dds)

[1] "Intercept"            "genotype_M_vs_W"      "treatment_T_vs_U"     "genotypeM.treatmentT"

Result1 <-results(dds, name="genotypeM.treatmentT")
Result2 <-results(dds,contrast=c("treatment","T","U")
Result3 <-results(dds, contrast=list(c("treatment_T_vs_U","genotypeM.treatmentT")))

Are these OK?

"To extract results, these are the genes which answer your question I labeled (2) above:”

So,

1. The Result2 and the Result3 give out genes that genotype-specific differentially response to treatment beyond the common effect of treatment?

2. Result1 lists the genes response to treatment independent of genotypes, i.e. the common effect of treatment (or the intersection of (MT-MU) and (WT-WU))?

3. In fact, I can’t quite catch which groups are compared by the argument ”contrast=c("treatment","T","U") .

“The variance stabilizing and rlog transformations are provided for applications other than differential testing, for example clustering of samples or other machine learning applications. ”

According to the vignette, it should be the raw counts to be fed to DESeq(1&2).So the transform data should not contain the gene length information.

4. I wonder when clustering, whether should gene length be taken into account? If so, is it reasonable to do the rpkm on the transformed data?

Thank you very much.

Liu,

ADD REPLY • link 9.5 years ago skyoflcz ▴ 20

0

Entering edit mode

"1. The Result2 and the Result3 give out genes that genotype-specific differentially response to treatment beyond the common effect of treatment?"

not exactly. I think I described these two results tables clearer in my post above.

The treatment effect for genotype W is:

results(dds, contrast=c("treatment","T","U"))

The treatment effect for genotype M is:

results(dds, contrast=list(c("treatment_T_vs_U","genotypeM.treatmentT")))

"2. Result1 lists the genes response to treatment independent of genotypes, i.e. the common effect of treatment (or the intersection of (MT-MU) and (WT-WU))?"

Not really. The interaction is the difference in the treatment effect between the genotypes.

It's not the same as intersecting a set of genes with small p-values from pairwise comparisons.

Interaction models are complicated, so you might benefit from speaking to a local statistician who can explain this, or looking up in a statistics reference, to understand what the results table represents.

"3. In fact, I can’t quite catch which groups are compared by the argument ”contrast=c("treatment","T","U") ."

This compares treated samples for genotype W with untreated samples for genotype W.

"4. I wonder when clustering, whether should gene length be taken into account? If so, is it reasonable to do the rpkm on the transformed data?"

Dividing out a common gene length (or subtracting out from the log counts) will have no effect on sample clustering. Dividing out gene length for clustering genes might make more sense, but we often go a step further and just center the data along the genes, to look for common patterns regardless of the range of counts (whether due to length or expression strength).

Note that v1.2 is 1 year old and 2 releases behind the current v1.6. We recommend you use the current version of Bioconductor.

ADD REPLY • link 9.5 years ago Michael Love 41k

0

Entering edit mode

It is really kind of you.

I’m sorry sometimes I might just have some misunderstandings in English.

Could you possible help me to answer these three more questions?

If a gene A was differential expression between WT and WU, and A also show differential between MT and MU, which result (result1, result2, result3 or none of the three) will only contain genes of that kind.

If a gene B show differential expression between WT and WU, but not between MT and MU, which result (result1, result2, result3 or none of the three) will only contain genes of that kind.

If a gene B show differential expression between MT and MU, but not between WT and WU, which result (result1, result2, result3 or none of the three) will only contain genes of that kind.

If these were answered, I think I will catch it…

Thanks a lot

ADD REPLY • link 9.5 years ago skyoflcz ▴ 20

1

Entering edit mode

these can't be answered so simply, because the size of the differential expression (the log fold change) matters here.

the tables you defined above are:

(1) interaction term, (2) treatment effect in W, (3) treatment effect in M

with enough statistical power, you would generally expect small p-values for these tables:

if a gene is DE in W, but not in M: (2) and (1)
if a gene is DE in M, but not in W: (3) and (1)
if a gene is DE in W and M, but a different size of difference: (1), (2), (3)
if a gene is DE in W and M, and the same size of difference: (2) and (3)

So you see, (1) will have small p-values in a number of cases. Note that these are general trends, and not hard rules. This does not mean that (1) will have small p-value whenever (2) has p-value < alpha and (3) has p-value > alpha. It's not determined like this.

ADD REPLY • link 9.5 years ago Michael Love 41k

0

Entering edit mode

I think I should update my statistic knowledge …

"if a gene is DE in W, but not in M: (2) and (1)

if a gene is DE in M, but not in W: (3) and (1)

if a gene is DE in W and M, but a different size of difference: (1), (2), (3)

if a gene is DE in W and M, and the same size of difference: (2) and (3) "

ADD REPLY • link 9.5 years ago skyoflcz ▴ 20

0

Entering edit mode

By checking the results I think I now understand it. It’s very kind of you to help me. Thank you very much!

ADD REPLY • link 9.5 years ago skyoflcz ▴ 20

0

Entering edit mode

Sorry to jump in here, but my question is similar enough that it doesn't really merit another thread. I'm interested in treatment effect in multiple genotypes, as above, but my treatment (and genotype) has more than two levels . There's a natural baseline for each treatment (U, T1, T2) , but not so for each genotype(W, U, V, Z). If I was working within limma, I'd have ~genotype/treatment-1 as my design, but I'm sticking with the ~a+b+a:b format in DESeq2. I can see that for my 'arbitrary baseline' genotype W your second form of results would (and does) work, but for the other genotypes, I don't get any resultsNames with 'vs' in them, so I can't use the 3rd format directly. Is contrast=list(c("treatmentT1", "genotypeV.treatmentT1") the correct format for getting the T1 effect in genotype V, for example. Doesn't seem very natural, sorry, coming from a limma background.

ADD REPLY • link 9.5 years ago Gavin Kelly ▴ 680

1

Entering edit mode

For your dataset, DESeq2 uses expanded model matrices (edit Oct 2015: note this no longer applies to current release v1.10), so that shrinkage is symmetric regardless of the way you order the factor levels of genotype or treatment. So you should use:

results(dds, contrast=list(c("treatmentT1", "genotypeV.treatmentT1"),c("treatmentU","genotypeV.treatmentU")))

for the T1 vs U effect for genotype V.

ADD REPLY • link 9.5 years ago • updated 8.5 years ago Michael Love 41k

score 0 · Answer 2 · 2014-10-29

Sorry, I forgot the sessionInfo in my comment.

>sessionInfo()

R version 3.0.2 (2013-09-25)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] RColorBrewer_1.0-5 gplots_2.14.2
[3] DESeq2_1.2.10 RcppArmadillo_0.4.450.1.0
[5] Rcpp_0.11.3 GenomicRanges_1.14.4
[7] XVector_0.2.0 IRanges_1.20.7
[9] Biobase_2.22.0 BiocGenerics_0.8.0
[11] pasilla_0.2.19 BiocInstaller_1.12.1

loaded via a namespace (and not attached):
[1] annotate_1.40.1 AnnotationDbi_1.24.0 bitops_1.0-6
[4] caTools_1.17.1 DBI_0.3.1 DESeq_1.14.0
[7] gdata_2.13.3 genefilter_1.44.0 geneplotter_1.40.0
[10] grid_3.0.2 gtools_3.4.1 KernSmooth_2.23-13
[13] lattice_0.20-29 locfit_1.5-9.1 RSQLite_1.0.0
[16] splines_3.0.2 stats4_3.0.2 survival_2.37-7
[19] tools_3.0.2 XML_3.98-1.1 xtable_1.7-4

score 0 · Answer 3 · 2017-06-09

Hi Michael,

I need help in clarifying one aspect in these comparisons.

I did run DESeq using a model similar to what was discussed in this thread.

~ genotype + treatment + genotype:treatment

I extract these results.

1. results(dds, name="genotypeM.treatmentT")

2. results(dds, contrast=c("treatment","T","U"))

3. results(dds, contrast=list(c("treatment_T_vs_U","genotypeM.treatmentT")))

Now I observe that the number of DE genes in #3 are less than #1.

#3 (the treatment effect for genotype 'M') have lesser number of DE genes than #1 (the interaction effect). Is this possible ? I think it may be due to more interaction effect than treatment effect.

But shouldn't the DE genes for treatment effect for genotype 'M' include the DE genes from #1 (and consequently larger) ?

Thanks for your suggestions.