Hi Michael,
I’ve tried to apply DESeq in two sample comparison, and it works well. It is really great for the NGS data analysis.
Now I tried to analysis multiple groups. Though I checked the section about glm model and refer to limma package and look for some answers in BioC posts, I was still stuck and confused. Will you help me to clarify it?
The first question:
For example, if the experiment contain two genotypes(”W“ and ”M“), and each with a factor Treatment (“U” and ”T” stand for “untreated“ and ”treated“ respectively). So I got four groups (“WU”,”WT”,”MU”,”MT”).
Now I want to find out genes response to treatment independent of genotypes, so I should compared WT vs WU, and get a differential expression gene set designed as A, while MT vs MU got a list B, and then find out the intersection of A and B(A∩B)?
If I want to find out genes that differentially response to treatment between genotype, i.e. the differences between set A and Set B as (A∪B-A∩B)?
Are those cases related to the formulas as follows:
fit0=fitNbinomGLMs(countdata,count~genotype+treatment)
fit1=fitNbinomGLMs(countdata,count~genotype+treatment+genotype:treatment
How should I design my formulas for that two cases (in DESeq or DESeq2)? And how to extract the results?
Some guys told me that they would like to design a group with four levels (“WU”,”WT”,”MU”,”MT”).
And code like:
fit0=fitNbinomGLMs (countdata, count~1)
fit1=fitNbinomGLMs (countdata,count~group)
And make pair contrast, and then combine results. Will it make the same thing?
The second question:
About the heat map. I read through the “Differential expression of RNA-Seq data at the gene level -the DESeq package”, and noticed that the heat map can be generated with raw counts after some kind of transform like “Variance stabilizing transformation” (DESeq2 might have different kind of transform). I may misunderstood, but the data for transform has not yet been tested? So whether it is significant is not yet supported by the statistic?
So can we just used the log2foldchange values?
fit0=fitNbinomGLMs (countdata, count~1)
fit1=fitNbinomGLMs (countdata,count~group)
pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )
fit1.padj<-rbind (fit1, padjGLM)
After the filtration (padjGLM<0.1, make sure I did not misunderstand it, this value means, at least one group showed a significance with control?)
Use the log fold2change values to make a heatmap?
These questions trouble me a lot last week. I will appreciate your any kind help. Thank you.
Best wishes!
Chengzhi Liu
Thank you for your quick response and detailed instruction. Your kind answer really helps me a lot.
I now try to make it with DESeq2.
Could you possible help me to confirm my understanding? I hope it will not take you too much time.
“ I'll assume W is the base level for genotype and U is the base level for treatment”
Yes, you are right.
Follow your instruction, my codes are as following:
WU WT MU MT
gene_0001 192 422 440 307
gene_0002 554 613 554 1289
gene_0003 5274 564 13660 784
gene_0004 54 48 64 46
gene_0005 186 195 194 126
gene_0006 348 328 360 300
> genotype <- factor(rep(c("W","M"),each=2),level=c("W","M")) > treatment<- factor(rep(c("U","T"),2),level=c("U","T")) > colData <- data.frame(row.names=names(lldata),genotype=genotype,treatment=treatment) > dds <- DESeqDataSetFromMatrix( countData=lldata, colData=colData, design=~genotype+treatment+genotype:treatment) > dds <- DESeq(dds) > resultsNames(dds) [1] "Intercept" "genotype_M_vs_W" "treatment_T_vs_U" "genotypeM.treatmentT"
Are these OK?
"To extract results, these are the genes which answer your question I labeled (2) above:”
So,
1. The Result2 and the Result3 give out genes that genotype-specific differentially response to treatment beyond the common effect of treatment?
2. Result1 lists the genes response to treatment independent of genotypes, i.e. the common effect of treatment (or the intersection of (MT-MU) and (WT-WU))?
3. In fact, I can’t quite catch which groups are compared by the argument ”contrast=c("treatment","T","U") .
“The variance stabilizing and rlog transformations are provided for applications other than differential testing, for example clustering of samples or other machine learning applications. ”
According to the vignette, it should be the raw counts to be fed to DESeq(1&2).So the transform data should not contain the gene length information.
4. I wonder when clustering, whether should gene length be taken into account? If so, is it reasonable to do the rpkm on the transformed data?
Thank you very much.
Liu,
"1. The Result2 and the Result3 give out genes that genotype-specific differentially response to treatment beyond the common effect of treatment?"
not exactly. I think I described these two results tables clearer in my post above.
The treatment effect for genotype W is:
The treatment effect for genotype M is:
"2. Result1 lists the genes response to treatment independent of genotypes, i.e. the common effect of treatment (or the intersection of (MT-MU) and (WT-WU))?"
Not really. The interaction is the difference in the treatment effect between the genotypes.
It's not the same as intersecting a set of genes with small p-values from pairwise comparisons.
Interaction models are complicated, so you might benefit from speaking to a local statistician who can explain this, or looking up in a statistics reference, to understand what the results table represents.
"3. In fact, I can’t quite catch which groups are compared by the argument ”contrast=c("treatment","T","U") ."
This compares treated samples for genotype W with untreated samples for genotype W.
"4. I wonder when clustering, whether should gene length be taken into account? If so, is it reasonable to do the rpkm on the transformed data?"
Dividing out a common gene length (or subtracting out from the log counts) will have no effect on sample clustering. Dividing out gene length for clustering genes might make more sense, but we often go a step further and just center the data along the genes, to look for common patterns regardless of the range of counts (whether due to length or expression strength).
Note that v1.2 is 1 year old and 2 releases behind the current v1.6. We recommend you use the current version of Bioconductor.
It is really kind of you.
I’m sorry sometimes I might just have some misunderstandings in English.
Could you possible help me to answer these three more questions?
If a gene A was differential expression between WT and WU, and A also show differential between MT and MU, which result (result1, result2, result3 or none of the three) will only contain genes of that kind.
If a gene B show differential expression between WT and WU, but not between MT and MU, which result (result1, result2, result3 or none of the three) will only contain genes of that kind.
If a gene B show differential expression between MT and MU, but not between WT and WU, which result (result1, result2, result3 or none of the three) will only contain genes of that kind.
If these were answered, I think I will catch it…
Thanks a lot
these can't be answered so simply, because the size of the differential expression (the log fold change) matters here.
the tables you defined above are:
(1) interaction term, (2) treatment effect in W, (3) treatment effect in M
with enough statistical power, you would generally expect small p-values for these tables:
So you see, (1) will have small p-values in a number of cases. Note that these are general trends, and not hard rules. This does not mean that (1) will have small p-value whenever (2) has p-value < alpha and (3) has p-value > alpha. It's not determined like this.
I think I should update my statistic knowledge …
"if a gene is DE in W, but not in M: (2) and (1)
if a gene is DE in M, but not in W: (3) and (1)
if a gene is DE in W and M, but a different size of difference: (1), (2), (3)
if a gene is DE in W and M, and the same size of difference: (2) and (3) "
By checking the results I think I now understand it. It’s very kind of you to help me. Thank you very much!
Sorry to jump in here, but my question is similar enough that it doesn't really merit another thread. I'm interested in treatment effect in multiple genotypes, as above, but my treatment (and genotype) has more than two levels . There's a natural baseline for each treatment (U, T1, T2) , but not so for each genotype(W, U, V, Z). If I was working within limma, I'd have ~genotype/treatment-1 as my design, but I'm sticking with the
~a+b+a:b
format in DESeq2. I can see that for my 'arbitrary baseline' genotype W your second form of results would (and does) work, but for the other genotypes, I don't get any resultsNames with 'vs' in them, so I can't use the 3rd format directly. Iscontrast=list(c("treatmentT1", "genotypeV.treatmentT1")
the correct format for getting the T1 effect in genotype V, for example. Doesn't seem very natural, sorry, coming from a limma background.For your dataset, DESeq2 uses expanded model matrices (edit Oct 2015: note this no longer applies to current release v1.10), so that shrinkage is symmetric regardless of the way you order the factor levels of genotype or treatment. So you should use:
results(dds, contrast=list(c("treatmentT1", "genotypeV.treatmentT1"),c("treatmentU","genotypeV.treatmentU")))
for the T1 vs U effect for genotype V.