Question: A couple of questions about baseMean, RPKM, and multi-factor analysis for DESeq2
0
gravatar for xfwang
4.8 years ago by
xfwang20
United States
xfwang20 wrote:

 

 

Dear DESeq2 community,

I am trying to use DESeq2 to detect DEG for my RNAseq data because it can deal with multi-factor designs, although there is not only multi-factor analysis in my experiment data. Let me clarify my datasets firstly.

For my first and second dataset, there is only one factor with 2 levels (They are unbalanced experiments. One treatment level is of 4 biological replication and the other is of 5). So, what I need is just to do 2 pairwise comparisons for each of them by reading the count table (the counts are raw counts got from Htseq-count) by

STR_dds <- DESeqDataSetFromMatrix(countData = STR_count, colData = STR_colData, design = ~ trt )

PFC_dds <- DESeqDataSetFromMatrix(countData = PFC_count, colData = PFC_colData, design = ~ trt ), then doing the differential expression analysis

STR_dds <-DESeq(STR_dds)

PFC_dds <-DESeq(PFC_dds).

In the meantime, I also used  Cuffdiff to do the tests. However, when I got the results from these 2 pipelines (Cuffdiff and DESeq2), they are so different. Both of them are with FDR at 0.05. For one of the dataset, there are 325 genes detected as DEG from Cuffdiff, while it is only 6 genes that are differentially expressed by DESeq2. For another dataset, there are 25 DEG from Cuffdiff, while it is only 1 gene from DESeq2. It looks like either Cufflinks is too liberal, or makes false positive calls, or HTseq-DEseq2 is to conservative, or makes false negative calls. It is hard for me to make a conclusion based on the different results. 

So, I am trying to see the expression levels for each gene and check the correlation between cuffdiff and DESeq2 results. However, it is RPKM (my data is single-end) from cuffdiff (cufflinks) but baseMean from DESeq2. So, my question is how DESeq2 to calculate the baseMean for each gene? Is there any relationship to RPKM? For baseMean, it is a just average of the normalized count values, taken over all samples. On my understanding, it is normalized only by library size not genes length, right? Could I get FPKM from DESeq2? If I can't, I can also calculate it manually based on gene count and length. But, I wonder know if I could get it from DESeq2 and my answer is no but I don't know the exact reason.

For my third dataset, it is multi-factor design. There are 2 factors in my experiment, genotype and condition, with 2 levels for each of them. I fit a model with interaction. Here are the codes that I used to read data in:

GT_Con_colData <- data.frame(row.names = colnames(GT_Con_countData), GT = c(rep('WT', 11), rep('NEO', 11)), Con = c(rep('MS', 5), rep('NMS', 6), rep('MS', 6), rep('NMS', 5)))

GT_Con_dds <- DESeqDataSetFromMatrix(countData = GT_Con_countData , colData = GT_Con_colData, design = formula(~ GT+Con+GT:Con) )

Then, I did test by

GT_Con_dds <- DESeq(GT_Con_dds)

GT_Con_dds_res <- results(GT_Con_dds)​

The result shows that there is nothing significant. How to explain the results and what is the next step based on this non-significant results. Is it showing there is no interaction effect for my data? Actually, for this test, my full model is GT+Con+GT:Con (H1), and my reduced model is GT+Con (H0). Since we fail to reject our H0, we can conclude that the interaction effect is not significant for this dataset. Is my understanding correct? If yes, then the null model is enough to fit my data. Could I just drop/remove the interaction from my model, then the full model is GT+Con? Does is it matter the order of GT and Con because I see from vignette the variable of interest should be put at the end of the formula? In this case, what is the reduced model? Also, could you illuminate what is the code should I use  if I am interested in all of contrasts based on this non-signifciant interaction effect?

I hope I make it clear for my questions in this big email (actually, I did some research online and in DESeq2 community. It really helps but I am still confused about my questions). Thank you so much!

Best,

Xiaofei 

 

deseq2 • 5.0k views
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by xfwang20
Answer: A couple of questions about baseMean, RPKM, and multi-factor analysis for DESeq2
1
gravatar for Michael Love
4.8 years ago by
Michael Love25k
United States
Michael Love25k wrote:

hi Xiaofei,

"However, when I got the results from these 2 pipelines (Cuffdiff and DESeq2), they are so different."

This is not so surprising. There are dozens of differences in the statistical procedures of these software. Furthermore p-values are tail probabilities which are highly sensitive to changes in models. My recommendation would be to look at plots of example genes to get a better sense of what is in these sets.

"However, it is RPKM (my data is single-end) from cuffdiff (cufflinks) but baseMean from DESeq2."

baseMean isn't the best thing to look at for examining DESeq2 results. I would recommend looking at counts(dds, normalized=TRUE) which gives you a sense of the spread within each group, outlier counts, etc. See the vignette of the current release (v1.6) for an example of plotCounts() function.

"The result shows that there is nothing significant. How to explain the results and what is the next step based on this non-significant results. Is it showing there is no interaction effect for my data?"

Yes, without specifying any other arguments to results(), the last term is tested, which in this case is the interaction effect. This means you did not observe a statistical significant interaction effect.

"Actually, for this test, my full model is GT+Con+GT:Con (H1), and my reduced model is GT+Con (H0)"

This sounds like you are referring to a likelihood ratio test, but you performed a Wald test. They typically produce similar results, but you should be clear that you performed a Wald test, as this is the default for DESeq() followed by results(). There is more information on this in the vignette section on likelihood ratio tests and in the R help page for ?DESeq.

"Does is it matter the order of GT and Con because I see from vignette the variable of interest should be put at the end of the formula? "

The order does not matter, you can use the 'contrast' or 'name' argument of results() to specify which result table to build. The order only matters for when you call results() without specifying any arguments, as the software needs to pick a variable and factor levels to contrast. Instead of picking randomly, it picks the last variable and the last factor level over the first factor level.

"Also, could you illuminate what is the code should I use  if I am interested in all of contrasts based on this non-signifciant interaction effect?"

Please first read the full R help page for ?results for information on how to produce results tables.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Michael Love25k

Hi Michael,

Thank you so much for responses to my questions so quickly! Actually, I read the vignette again and did more research for this project. But, I still have some questions in my mind about DESeq2 (I will 'Add comment' for clarifications here and also ask some new questions). 

"This is not so surprising. There are dozens of differences in the statistical procedures of these software. Furthermore p-values are tail probabilities which are highly sensitive to changes in models. My recommendation would be to look at plots of example genes to get a better sense of what is in these sets."

What do you mean by 'example genes'? Do you mean the DEG detected from Cuffdiff? Also, you recommended me to look at the plots, what are kind of plots here?

"This sounds like you are referring to a likelihood ratio test, but you performed a Wald test. They typically produce similar results, but you should be clear that you performed a Wald test, as this is the default for DESeq() followed by results(). There is more information on this in the vignette section on likelihood ratio tests and in the R help page for ?DESeq."

For likelihood ratio test and Wald test, both of them are to do statistical inference and they can both be used to test whether an effect exists or not. By default for DESeq() followed by restuls(), I performed a Wald test. But, I remember LR test can be easily extended to multiple parameters while Wald test is not appropriate for multiple parameters, correct? So, if there are multiple levels for a treatment factor in an experiment, my understanding is that we have to change the default performance for DESeq2, correct?

 

 

ADD REPLYlink written 4.8 years ago by xfwang20

"What do you mean by 'example genes'? Do you mean the DEG detected from Cuffdiff? Also, you recommended me to look at the plots, what are kind of plots here?"

I meant, genes which you are curious about, for example those not significant in one or the other method. The plotCounts function is used in the vignette.

"But, I remember LR test can be easily extended to multiple parameters while Wald test is not appropriate for multiple parameters, correct? So, if there are multiple levels for a treatment factor in an experiment, my understanding is that we have to change the default performance for DESeq2, correct?"

This is not correct. The default method can handle all kinds of design. In the vignette, the default DESeq method is used in a multifactor design with type of sequencing and with the condition, for example. Having multiple levels for a factor is also handled by the default method, for example in the Examples in ?results. You might have read a older post talking about v1.2, where in the devel branch (v1.3) we added behavior to make fold changes from factors with 3 or more levels symmetric to changes in the base level. But the default methods have always been with simple and complex designs in mind.

ADD REPLYlink written 4.8 years ago by Michael Love25k

I double check my notes for Categorical Analysis which is a class for statistics. I see that Wald test is not for multiple parameters. For example,

H0: ß1=ß2=0

H1: at least one of ß1 and ß2 is not 0, then Wald test is not for this testing because wald test is not for multiple parameters.

Anyway, I may be wrong to understand this for DESeq2, we can ignore this and I can understand it for DESeq2 based on your explanation. 

Thanks a lot for replying to me so quick and also at weekend! :)

ADD REPLYlink written 4.8 years ago by xfwang20

Oh I see now what you meant by multiple parameters. Math is always best for resolving these things :) Yes, the Wald cannot test the null of multiple coefficients being equal to zero while the LRT can.

ADD REPLYlink written 4.8 years ago by Michael Love25k

So, my understanding is that if we have multiple levels (>2) for a factor, it is similar to this test (H0: ß1=ß2=0 VS H1: at least one of ß1 and ß2 is not 0) when we want to test significance for a main effect. That is why I think I need to change the default Wald test to LRT when I have a factor with more than 2 levels. Could you tell me where I am stuck when I apply this understanding to DESeq2?

ADD REPLYlink written 4.8 years ago by xfwang20
Answer: A couple of questions about baseMean, RPKM, and multi-factor analysis for DESeq2
0
gravatar for xfwang
4.8 years ago by
xfwang20
United States
xfwang20 wrote:

Here are some new questions:

First, as I said I did pairwise comparisons on my first and second dataset (there is only one factor with 2 levels for each of them), and the results from Cuffdiff and DESeq2 are so different. I think I found the reason for this big difference. The majority of DEG detected by Cuffdiff are set to NA in output of DESeq2. For one of my dataset, there are 25 DEG detected by Cuffdiff and all of them are set to NA by DESeq2. For another dataset, there are 325 genes detected as differentially expressed. Of those, 269 are set to NA by DESeq2. 

On my understanding, there are a total of 3 reasons for NA. After I extracted the information from DESeq2 restuls for the genes which are detected as differentially expressed by Cuffdiff, I found the baseMean column are non-zero, and also both p value and adjusted p value are set to NA. So, I conclude that the majority of differentially expressed genes detected by Cuffdiff are with extreme count outlier detected by DESeq2. If my understanding is right, could you give me some advice about how to deal with these genes which are DEG in Cuffdiff but set to NA by DESeq2?

(Also, I see the "results" function will flag genes which contain a Cook's distance above a cutoff, but I don't see these flags in "results" output, could you tell me how to see the flags?) (One more thing, I have one more question about "independent filtering" here, I see the "results" function will perform independent filtering by default and only the adjusted p values for the genes which do not pass the filter threshold are set to NA. But in the output of "results" for my data, all of rows/genes with NA adjusted p values are also with NA p values. So, it means that the NAs are due to either all samples of zero counts or a sample with an extreme count outlier. But, I don' think my data does not need to filter out the weakly-expressed genes because of multiple testing. Do you know why does this happen?)

Second, my questions are about multi-factor design analysis by DESeq2. My experiment is 2-way factorial design. So, it is 2-way ANOVA analysis. As I said, there is no interaction effect (GT x Con) by Wald test. But I don't know what should I do for the downstream analysis. Actually, there are 2 ways in my mind and I am not sure which one should I go?

The first way is without model selection and still go with this design = formula(~ GT+Con+GT:Con). By running this 

> resultsNames(GT_Con_dds)

[1] "Intercept"     "GT_WT_vs_NEO"  "Con_NMS_vs_MS" "GTWT.ConNMS"  ,

I got the available names for testing significance (statistical inference) on my understanding. By AVONA table, it is from bottom to top to look at the statistical inference. By running 'results(GT_Con_dds, name="GT_WT_vs_NEO") and  results(GT_Con_dds, name="Con_NMS_vs_MS")', there are 3 and 114 rows/genes for each of them, respectively. So, there are main effects (GT and Con) on the response (gene expression level) or the main effects are significant. Then, I need to look at the marginal comparisons because the interaction effect is not significant. Take GT effect for example, the code should be results(GT_Con_dds, contrast=c("GT", "WT", "NEO")), correct? 

(BTW, if the interaction effect is significant, I need to look at the conditional comparisons. Still take GT effect for example, this time I need to compare of WT vs NEO for each level of Con. The codes should be results(GT_Con_dds, contrast=list("MS.GT", "MS.NEO")) and results(GT_Con_dds, contrast=list("NMS.GT", "NMS.NEO")). Is my understanding right?

The second way in my mind is associated with model selection. Because of the principle of parsimony, we are trying to find an "easy" model that is complex enough to fit the data well. In my data analysis, there is no evidence for significant interaction effect. So, I may go to re-fit the model without interaction by this 

GT_Con_dds_non_interaction <- DESeqDataSetFromMatrix(countData = GT_Con_countData , colData = GT_Con_colData, design = formula(~ GT+Con) )

and to see the marginal comparisons for main effects (GT and Con), correct?

I got stuck to move forward on my analysis because of the questions I described here. Thank you so much for your help and I am sorry I make my description so long.

Thanks a lot!

Best,

Xiaofei

 

 

ADD COMMENTlink written 4.8 years ago by xfwang20

"If my understanding is right, could you give me some advice about how to deal with these genes which are DEG in Cuffdiff but set to NA by DESeq2?"

Look at some example genes with plotCounts, and if you decide to turn off outlier filtering, in other words that you think the filtering is unnecessary, you can set cooksCutoff=FALSE in results().

"could you tell me how to see the flags"

There's no extra column which is a flag. We chose not to add additional columns to the results table because the information is contained in the values for baseMean, pvalue and padj. The logic is described in the vignette (it sounds like you've seen this already). 

"But in the output of "results" for my data, all of rows/genes with NA adjusted p values are also with NA p values. "

So then, it means that the best filtering threshold chosen by DESeq2 was a mean count of zero (essentially, not filtering is better than filtering for your dataset). The optimal threshold is chosen automatically to maximize the number of significant genes, so you don't have to worry about this. But if you want to turn it off you can set independentFiltering=FALSE in results(). And you can use summary(res) to get more information on the filtering and outlier numbers.

Note that the same results table can be built using name="GT_WT_vs_NEO" or contrast=c("GT", "WT", "NEO").

I don't have a recommendation for the first or second model, it's up to you as the analyst.

ADD REPLYlink written 4.8 years ago by Michael Love25k

"There's no extra column which is a flag. We chose not to add additional columns to the results table because the information is contained in the values for baseMean, pvalue and padj. The logic is described in the vignette (it sounds like you've seen this already). "

Actually, I don't see there is flag information contained in the values for baseMean, pvalue, and padj. What do they look like? In another word, what is the difference for baseMean, pvalue, and padj (look like) between with and without outlier genes? How do they annotated/flagged on baseMean, pvalue, and padj for these genes with outlier?

ADD REPLYlink written 4.8 years ago by xfwang20
Answer: A couple of questions about baseMean, RPKM, and multi-factor analysis for DESeq2
0
gravatar for xfwang
4.8 years ago by
xfwang20
United States
xfwang20 wrote:

Dear Michael,

I sorry I am still struggling about the different results between DESeq2 and Cuffdiff. Here is a new question about how DESeq2 calculate log2FoldChange?

The columns "FIN_Nor_mean" and "VEH_Nor_mean" are got from DESeq2 by code "PFC_dds_countsNor <- counts(PFC_dds, normalized=TRUE)" and "write.table(PFC_dds_countsNor, "PFC_dds_countsNor.txt", quote=FALSE, sep='\t')". After got PFC_dds_countsNor.txt, I extracted genes1-25 (which are significant genes in cuffdiff output) information from it. Then, I calculated the normalized counts for FIN and VEH, respectively. Also, I calculated the "baseMean_cal" for them, which is same as "baseMean" column got from DESeq2. However, when I calculated the log2FoldChange by this log2(BEH_Nor_mean/FIN_Nor_mean), the results are in column "log2FC-cal", it is similar to the log2FC from Cuffdiff but it is different with "log2FoldChange" column (the results got from DESeq2 by "PFC_dds <-DESeq(PFC_dds)" and PFC_res <- results(PFC_dds)). So, I am wondering how DESeq2 calculate log2FoldChange? Why it is different with my calculation? Is there something wrong with my calculation?

Also, the p_values are so different between DESeq2 and cuffdiff?

Thanks a lot!

Best,

Xiaofei

gene FIN_Nor_mean VEH_Nor_mean baseMean log2FoldChange lfcSE stat pvalue padj   baseMean_cal NorMeanFC log2FC_cal diff_log2FC diff_test_stat diff_p_value diff_q_value
gene1 816.4047786 498.9497995 675.3136768 -0.472398767 0.191803282 -2.462933702 0.013780539 0.999597155   675.3136768 0.611154923 -0.710389957 -0.703128 -2.42941 5e-05 0.0063866
gene2 1277.781604 490.8505676 928.0344766 -0.489961315 0.196507376 -2.493348216 0.012654468 0.999597155   928.0344766 0.384142772 -1.380285486 -1.37272 -4.56181 5e-05 0.0063866
gene3 127.2974789 34.7072995 86.14628804 -0.899048363 0.203536115 -4.417144157 1.00E-05 0.066692357   86.14628804 0.272647187 -1.874892825 -1.86653 -3.6175 5e-05 0.0063866
gene4 40.01446525 226.4063237 122.8552912 0.121464238 0.092279704 1.31626168 0.188086228 0.999597155   122.8552912 5.658111943 2.50032072 2.6239 4.35182 5e-05 0.0063866
gene5 773.7401993 436.457829 623.8369236 -0.478322867 0.20067041 -2.383624308 0.017143094 0.999597155   623.8369236 0.564088346 -0.826006964 -0.817289 -2.72896 5e-05 0.0063866
gene6 1573.313176 538.1434523 1113.237743 -0.722191048 0.20338784 -3.550807396 0.000384051 0.661459471   1113.237743 0.342044712 -1.547743168 -1.5446 -5.56294 5e-05 0.0063866
gene7 1626.941227 249.0617302 1014.55034 -0.470140989 0.166336082 -2.826452227 0.004706676 0.999597155   1014.55034 0.15308588 -2.70758687 -2.69789 -7.33934 5e-05 0.0063866
gene8 1561.771122 1055.675408 1336.839693 -0.359363942 0.195594701 -1.837288739 0.066167281 0.999597155   1336.839693 0.675947579 -0.565016728 -0.556128 -2.1236 5e-05 0.0063866
gene9 220.6347689 109.9580226 171.4451039 -0.586134637 0.200344005 -2.925641007 0.003437474 0.999597155   171.4451039 0.498371237 -1.004707289 -1.0101 -2.46675 5e-05 0.0063866
gene10 249.043545 394.8955972 313.8666793 0.421332629 0.196530911 2.143849158 0.032044975 0.999597155   313.8666793 1.585648796 0.665073265 0.680747 1.9863 5e-05 0.0063866
gene11 133.7911486 465.8820426 281.3871015 0.192478969 0.127718927 1.507051248 0.131797522 0.999597155   281.3871015 3.482158928 1.79998205 1.80874 4.3624 5e-05 0.0063866
gene12 867.3980164 561.8838276 731.6139325 -0.503811612 0.161344106 -3.12259074 0.001792668 0.999597155   731.6139325 0.647780854 -0.626422267 -0.612023 -2.19264 5e-05 0.0063866
gene13 899.9394186 583.6803221 759.3798201 -0.463527744 0.177810759 -2.606859942 0.009137673 0.999597155   759.3798201 0.648577349 -0.624649456 -0.617155 -2.22082 5e-05 0.0063866
gene14 41469.99639 10694.58068 27792.03385 -0.236061973 0.14108987 -1.673131974 0.094301318 0.999597155   27792.03385 0.257887186 -1.955188004 -1.94441 -3.41448 5e-05 0.0063866
gene15 210.5000153 109.5918313 165.6519335 -0.60649158 0.194549321 -3.117418122 0.001824426 0.999597155   165.6519335 0.520626239 -0.941680071 -0.936517 -2.25834 5e-05 0.0063866
gene16 2355.677972 688.2062908 1614.579447 -0.196228912 0.135026913 -1.453257781 0.146152195 0.999597155   1614.579447 0.292147865 -1.775229347 -1.79129 -5.2087 5e-05 0.0063866
gene17 440.3590298 271.3607996 365.2487052 -0.397932397 0.201312434 -1.976690607 0.048076613 0.999597155   365.2487052 0.616226264 -0.698467923 -0.674676 -1.9768 5e-05 0.0063866
gene18 43.53285186 14.57336894 30.66197056 -0.467173825 0.189281515 -2.468142891 0.013581609 0.999597155   30.66197056 0.334767154 -1.57877011 -1.52615 -2.22642 5e-05 0.0063866
gene19 6.263961314 31.82320349 17.6236245 0.191332259 0.115852699 1.651513176 0.098633832 0.999597155   17.6236245 5.080363989 2.344931865 2.39314 2.51521 5e-05 0.0063866
gene20 109116.4269 19621.34374 69340.83436 -0.268185407 0.138644969 -1.934332043 0.053072316 0.999597155   69340.83436 0.179820255 -2.475372557 -2.54984 -3.23927 5e-05 0.0063866
gene21 227.8464609 128.5476063 183.7136367 -0.409305932 0.203535526 -2.010980294 0.044327538 0.999597155   183.7136367 0.564185223 -0.825759215 -0.809061 -1.99248 5e-05 0.0063866
gene22 343.1492346 521.1119991 422.2437966 0.347988028 0.201269492 1.728965602 0.083815252 0.999597155   422.2437966 1.518616236 0.602757338 0.614397 1.92253 0.00015 0.0169908
gene23 9164.703977 13165.17772 10942.69231 0.179206011 0.192970945 0.928668359 0.353060983 0.999597155   10942.69231 1.436508779 0.52256681 0.531002 1.73328 0.0003 0.029622
gene24 840.0714403 599.3251559 733.0730917 -0.374723505 0.171546117 -2.184389317 0.028933653 0.999597155   733.0730917 0.713421653 -0.487173091 -0.483784 -1.7334 0.0005 0.0436294
gene25 730.1533528 528.776576 640.6525631 -0.265997777 0.201287651 -1.321480853 0.186341083 0.999597155   640.6525631 0.72419934 -0.465541233 -0.468006 -1.60779 0.0005 0.0436294
ADD COMMENTlink written 4.8 years ago by xfwang20

I don't know how to upload an image. If the contents of the table do not make sense, please let me know.

ADD REPLYlink written 4.8 years ago by xfwang20

Sorry, there is a type  log2(BEH_Nor_mean/FIN_Nor_mean) should be  log2(VEH_Nor_mean/FIN_Nor_mean)

ADD REPLYlink written 4.8 years ago by xfwang20

"when I calculated the log2FoldChange by this log2(BEH_Nor_mean/FIN_Nor_mean), the results are in column "log2FC-cal", it is similar to the log2FC from Cuffdiff but it is different with "log2FoldChange" column"

In DESeq2, log fold changes are moderated toward zero, where there is more moderation when the information from the gene's counts is low (high variance of counts, or low counts).  For more details, read about LFC shrinkage in the vignette or in the paper: http://genomebiology.com/2014/15/12/550/abstract . These genes have only moderately large test statistics, which tells me that the counts likely have high within-group variance. If you want to compare the results without LFC shrinkage, you can use the betaPrior=FALSE argument to DESeq().

"Also, the p_values are so different between DESeq2 and cuffdiff?"

There are dozens of differences in the statistical procedures of these software.

ADD REPLYlink written 4.8 years ago by Michael Love25k
Answer: A couple of questions about baseMean, RPKM, and multi-factor analysis for DESeq2
0
gravatar for xfwang
4.8 years ago by
xfwang20
United States
xfwang20 wrote:

Hi Michael,

I am sorry to bother you again. 

I am trying to see what is going on by using LRT ( they might produce similar results). But I don't know how to express the reduced model for one factor analysis in DESeq2. If there is one treatment factor say 'trt', then the full model is "full=~trt" but what is the reduced model, is it “reduced= ~ 1”, I understand it in statistics which should be an unknown fixed value, but I don’t know how to express it in DESeq2, why it is “~1”, if it is in right expression?

Thanks a lot!

Best,

Xiaofei

ADD COMMENTlink written 4.8 years ago by xfwang20
Yes, reduced is ~1.
ADD REPLYlink written 4.8 years ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour