Question

A couple of questions about baseMean, RPKM, and multi-factor analysis for DESeq2

0

Entering edit mode

xfwang ▴ 20

@xfwang-7176

Last seen 8.0 years ago

United States

Dear DESeq2 community,

I am trying to use DESeq2 to detect DEG for my RNAseq data because it can deal with multi-factor designs, although there is not only multi-factor analysis in my experiment data. Let me clarify my datasets firstly.

For my first and second dataset, there is only one factor with 2 levels (They are unbalanced experiments. One treatment level is of 4 biological replication and the other is of 5). So, what I need is just to do 2 pairwise comparisons for each of them by reading the count table (the counts are raw counts got from Htseq-count) by

STR_dds <- DESeqDataSetFromMatrix(countData = STR_count, colData = STR_colData, design = ~ trt )

PFC_dds <- DESeqDataSetFromMatrix(countData = PFC_count, colData = PFC_colData, design = ~ trt ), then doing the differential expression analysis

STR_dds <-DESeq(STR_dds)

PFC_dds <-DESeq(PFC_dds).

In the meantime, I also used Cuffdiff to do the tests. However, when I got the results from these 2 pipelines (Cuffdiff and DESeq2), they are so different. Both of them are with FDR at 0.05. For one of the dataset, there are 325 genes detected as DEG from Cuffdiff, while it is only 6 genes that are differentially expressed by DESeq2. For another dataset, there are 25 DEG from Cuffdiff, while it is only 1 gene from DESeq2. It looks like either Cufflinks is too liberal, or makes false positive calls, or HTseq-DEseq2 is to conservative, or makes false negative calls. It is hard for me to make a conclusion based on the different results.

So, I am trying to see the expression levels for each gene and check the correlation between cuffdiff and DESeq2 results. However, it is RPKM (my data is single-end) from cuffdiff (cufflinks) but baseMean from DESeq2. So, my question is how DESeq2 to calculate the baseMean for each gene? Is there any relationship to RPKM? For baseMean, it is a just average of the normalized count values, taken over all samples. On my understanding, it is normalized only by library size not genes length, right? Could I get FPKM from DESeq2? If I can't, I can also calculate it manually based on gene count and length. But, I wonder know if I could get it from DESeq2 and my answer is no but I don't know the exact reason.

For my third dataset, it is multi-factor design. There are 2 factors in my experiment, genotype and condition, with 2 levels for each of them. I fit a model with interaction. Here are the codes that I used to read data in:

GT_Con_colData <- data.frame(row.names = colnames(GT_Con_countData), GT = c(rep('WT', 11), rep('NEO', 11)), Con = c(rep('MS', 5), rep('NMS', 6), rep('MS', 6), rep('NMS', 5)))

GT_Con_dds <- DESeqDataSetFromMatrix(countData = GT_Con_countData , colData = GT_Con_colData, design = formula(~ GT+Con+GT:Con) )

Then, I did test by

GT_Con_dds <- DESeq(GT_Con_dds)

GT_Con_dds_res <- results(GT_Con_dds)

The result shows that there is nothing significant. How to explain the results and what is the next step based on this non-significant results. Is it showing there is no interaction effect for my data? Actually, for this test, my full model is GT+Con+GT:Con (H1), and my reduced model is GT+Con (H0). Since we fail to reject our H0, we can conclude that the interaction effect is not significant for this dataset. Is my understanding correct? If yes, then the null model is enough to fit my data. Could I just drop/remove the interaction from my model, then the full model is GT+Con? Does is it matter the order of GT and Con because I see from vignette the variable of interest should be put at the end of the formula? In this case, what is the reduced model? Also, could you illuminate what is the code should I use if I am interested in all of contrasts based on this non-signifciant interaction effect?

I hope I make it clear for my questions in this big email (actually, I did some research online and in DESeq2 community. It really helps but I am still confused about my questions). Thank you so much!

Best,

Xiaofei

deseq2 • 7.7k views

ADD COMMENT • link 9.4 years ago • updated 9.3 years ago xfwang ▴ 20

score 1 · Answer 1 · 2014-12-16

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

hi Xiaofei,

"However, when I got the results from these 2 pipelines (Cuffdiff and DESeq2), they are so different."

This is not so surprising. There are dozens of differences in the statistical procedures of these software. Furthermore p-values are tail probabilities which are highly sensitive to changes in models. My recommendation would be to look at plots of example genes to get a better sense of what is in these sets.

"However, it is RPKM (my data is single-end) from cuffdiff (cufflinks) but baseMean from DESeq2."

baseMean isn't the best thing to look at for examining DESeq2 results. I would recommend looking at counts(dds, normalized=TRUE) which gives you a sense of the spread within each group, outlier counts, etc. See the vignette of the current release (v1.6) for an example of plotCounts() function.

"The result shows that there is nothing significant. How to explain the results and what is the next step based on this non-significant results. Is it showing there is no interaction effect for my data?"

Yes, without specifying any other arguments to results(), the last term is tested, which in this case is the interaction effect. This means you did not observe a statistical significant interaction effect.

"Actually, for this test, my full model is GT+Con+GT:Con (H1), and my reduced model is GT+Con (H0)"

This sounds like you are referring to a likelihood ratio test, but you performed a Wald test. They typically produce similar results, but you should be clear that you performed a Wald test, as this is the default for DESeq() followed by results(). There is more information on this in the vignette section on likelihood ratio tests and in the R help page for ?DESeq.

"Does is it matter the order of GT and Con because I see from vignette the variable of interest should be put at the end of the formula? "

The order does not matter, you can use the 'contrast' or 'name' argument of results() to specify which result table to build. The order only matters for when you call results() without specifying any arguments, as the software needs to pick a variable and factor levels to contrast. Instead of picking randomly, it picks the last variable and the last factor level over the first factor level.

"Also, could you illuminate what is the code should I use if I am interested in all of contrasts based on this non-signifciant interaction effect?"

Please first read the full R help page for ?results for information on how to produce results tables.

ADD COMMENT • link 9.4 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thank you so much for responses to my questions so quickly! Actually, I read the vignette again and did more research for this project. But, I still have some questions in my mind about DESeq2 (I will 'Add comment' for clarifications here and also ask some new questions).

"This is not so surprising. There are dozens of differences in the statistical procedures of these software. Furthermore p-values are tail probabilities which are highly sensitive to changes in models. My recommendation would be to look at plots of example genes to get a better sense of what is in these sets."

What do you mean by 'example genes'? Do you mean the DEG detected from Cuffdiff? Also, you recommended me to look at the plots, what are kind of plots here?

"This sounds like you are referring to a likelihood ratio test, but you performed a Wald test. They typically produce similar results, but you should be clear that you performed a Wald test, as this is the default for DESeq() followed by results(). There is more information on this in the vignette section on likelihood ratio tests and in the R help page for ?DESeq."

For likelihood ratio test and Wald test, both of them are to do statistical inference and they can both be used to test whether an effect exists or not. By default for DESeq() followed by restuls(), I performed a Wald test. But, I remember LR test can be easily extended to multiple parameters while Wald test is not appropriate for multiple parameters, correct? So, if there are multiple levels for a treatment factor in an experiment, my understanding is that we have to change the default performance for DESeq2, correct?

ADD REPLY • link 9.4 years ago xfwang ▴ 20

0

Entering edit mode

"What do you mean by 'example genes'? Do you mean the DEG detected from Cuffdiff? Also, you recommended me to look at the plots, what are kind of plots here?"

I meant, genes which you are curious about, for example those not significant in one or the other method. The plotCounts function is used in the vignette.

"But, I remember LR test can be easily extended to multiple parameters while Wald test is not appropriate for multiple parameters, correct? So, if there are multiple levels for a treatment factor in an experiment, my understanding is that we have to change the default performance for DESeq2, correct?"

This is not correct. The default method can handle all kinds of design. In the vignette, the default DESeq method is used in a multifactor design with type of sequencing and with the condition, for example. Having multiple levels for a factor is also handled by the default method, for example in the Examples in ?results. You might have read a older post talking about v1.2, where in the devel branch (v1.3) we added behavior to make fold changes from factors with 3 or more levels symmetric to changes in the base level. But the default methods have always been with simple and complex designs in mind.

ADD REPLY • link 9.4 years ago Michael Love 41k

0

Entering edit mode

I double check my notes for Categorical Analysis which is a class for statistics. I see that Wald test is not for multiple parameters. For example,

H0: ß1=ß2=0

H1: at least one of ß1 and ß2 is not 0, then Wald test is not for this testing because wald test is not for multiple parameters.

Anyway, I may be wrong to understand this for DESeq2, we can ignore this and I can understand it for DESeq2 based on your explanation.

Thanks a lot for replying to me so quick and also at weekend! :)

ADD REPLY • link 9.4 years ago xfwang ▴ 20

0

Entering edit mode

Oh I see now what you meant by multiple parameters. Math is always best for resolving these things :) Yes, the Wald cannot test the null of multiple coefficients being equal to zero while the LRT can.

ADD REPLY • link 9.4 years ago Michael Love 41k

0

Entering edit mode

So, my understanding is that if we have multiple levels (>2) for a factor, it is similar to this test (H0: ß1=ß2=0 VS H1: at least one of ß1 and ß2 is not 0) when we want to test significance for a main effect. That is why I think I need to change the default Wald test to LRT when I have a factor with more than 2 levels. Could you tell me where I am stuck when I apply this understanding to DESeq2?

ADD REPLY • link 9.4 years ago xfwang ▴ 20

score 0 · Answer 2 · 2014-12-21

Here are some new questions:

First, as I said I did pairwise comparisons on my first and second dataset (there is only one factor with 2 levels for each of them), and the results from Cuffdiff and DESeq2 are so different. I think I found the reason for this big difference. The majority of DEG detected by Cuffdiff are set to NA in output of DESeq2. For one of my dataset, there are 25 DEG detected by Cuffdiff and all of them are set to NA by DESeq2. For another dataset, there are 325 genes detected as differentially expressed. Of those, 269 are set to NA by DESeq2.

On my understanding, there are a total of 3 reasons for NA. After I extracted the information from DESeq2 restuls for the genes which are detected as differentially expressed by Cuffdiff, I found the baseMean column are non-zero, and also both p value and adjusted p value are set to NA. So, I conclude that the majority of differentially expressed genes detected by Cuffdiff are with extreme count outlier detected by DESeq2. If my understanding is right, could you give me some advice about how to deal with these genes which are DEG in Cuffdiff but set to NA by DESeq2?

(Also, I see the "results" function will flag genes which contain a Cook's distance above a cutoff, but I don't see these flags in "results" output, could you tell me how to see the flags?) (One more thing, I have one more question about "independent filtering" here, I see the "results" function will perform independent filtering by default and only the adjusted p values for the genes which do not pass the filter threshold are set to NA. But in the output of "results" for my data, all of rows/genes with NA adjusted p values are also with NA p values. So, it means that the NAs are due to either all samples of zero counts or a sample with an extreme count outlier. But, I don' think my data does not need to filter out the weakly-expressed genes because of multiple testing. Do you know why does this happen?)

Second, my questions are about multi-factor design analysis by DESeq2. My experiment is 2-way factorial design. So, it is 2-way ANOVA analysis. As I said, there is no interaction effect (GT x Con) by Wald test. But I don't know what should I do for the downstream analysis. Actually, there are 2 ways in my mind and I am not sure which one should I go?

The first way is without model selection and still go with this design = formula(~ GT+Con+GT:Con). By running this

> resultsNames(GT_Con_dds)

[1] "Intercept" "GT_WT_vs_NEO" "Con_NMS_vs_MS" "GTWT.ConNMS" ,

I got the available names for testing significance (statistical inference) on my understanding. By AVONA table, it is from bottom to top to look at the statistical inference. By running 'results(GT_Con_dds, name="GT_WT_vs_NEO") and results(GT_Con_dds, name="Con_NMS_vs_MS")', there are 3 and 114 rows/genes for each of them, respectively. So, there are main effects (GT and Con) on the response (gene expression level) or the main effects are significant. Then, I need to look at the marginal comparisons because the interaction effect is not significant. Take GT effect for example, the code should be results(GT_Con_dds, contrast=c("GT", "WT", "NEO")), correct?

(BTW, if the interaction effect is significant, I need to look at the conditional comparisons. Still take GT effect for example, this time I need to compare of WT vs NEO for each level of Con. The codes should be results(GT_Con_dds, contrast=list("MS.GT", "MS.NEO")) and results(GT_Con_dds, contrast=list("NMS.GT", "NMS.NEO")). Is my understanding right?

The second way in my mind is associated with model selection. Because of the principle of parsimony, we are trying to find an "easy" model that is complex enough to fit the data well. In my data analysis, there is no evidence for significant interaction effect. So, I may go to re-fit the model without interaction by this

GT_Con_dds_non_interaction <- DESeqDataSetFromMatrix(countData = GT_Con_countData , colData = GT_Con_colData, design = formula(~ GT+Con) )

and to see the marginal comparisons for main effects (GT and Con), correct?

I got stuck to move forward on my analysis because of the questions I described here. Thank you so much for your help and I am sorry I make my description so long.

Thanks a lot!

Best,

Xiaofei

score 0 · Answer 3 · 2015-01-12

Dear Michael,

I sorry I am still struggling about the different results between DESeq2 and Cuffdiff. Here is a new question about how DESeq2 calculate log2FoldChange?

The columns "FIN_Nor_mean" and "VEH_Nor_mean" are got from DESeq2 by code "PFC_dds_countsNor <- counts(PFC_dds, normalized=TRUE)" and "write.table(PFC_dds_countsNor, "PFC_dds_countsNor.txt", quote=FALSE, sep='\t')". After got PFC_dds_countsNor.txt, I extracted genes1-25 (which are significant genes in cuffdiff output) information from it. Then, I calculated the normalized counts for FIN and VEH, respectively. Also, I calculated the "baseMean_cal" for them, which is same as "baseMean" column got from DESeq2. However, when I calculated the log2FoldChange by this log2(BEH_Nor_mean/FIN_Nor_mean), the results are in column "log2FC-cal", it is similar to the log2FC from Cuffdiff but it is different with "log2FoldChange" column (the results got from DESeq2 by "PFC_dds <-DESeq(PFC_dds)" and PFC_res <- results(PFC_dds)). So, I am wondering how DESeq2 calculate log2FoldChange? Why it is different with my calculation? Is there something wrong with my calculation?

Also, the p_values are so different between DESeq2 and cuffdiff?

Thanks a lot!

Best,

Xiaofei

gene	FIN_Nor_mean	VEH_Nor_mean	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	baseMean_cal	NorMeanFC	log2FC_cal	diff_log2FC	diff_test_stat	diff_p_value	diff_q_value
gene1	816.4047786	498.9497995	675.3136768	-0.472398767	0.191803282	-2.462933702	0.013780539	0.999597155	675.3136768	0.611154923	-0.710389957	-0.703128	-2.42941	5e-05	0.0063866
gene2	1277.781604	490.8505676	928.0344766	-0.489961315	0.196507376	-2.493348216	0.012654468	0.999597155	928.0344766	0.384142772	-1.380285486	-1.37272	-4.56181	5e-05	0.0063866
gene3	127.2974789	34.7072995	86.14628804	-0.899048363	0.203536115	-4.417144157	1.00E-05	0.066692357	86.14628804	0.272647187	-1.874892825	-1.86653	-3.6175	5e-05	0.0063866
gene4	40.01446525	226.4063237	122.8552912	0.121464238	0.092279704	1.31626168	0.188086228	0.999597155	122.8552912	5.658111943	2.50032072	2.6239	4.35182	5e-05	0.0063866
gene5	773.7401993	436.457829	623.8369236	-0.478322867	0.20067041	-2.383624308	0.017143094	0.999597155	623.8369236	0.564088346	-0.826006964	-0.817289	-2.72896	5e-05	0.0063866
gene6	1573.313176	538.1434523	1113.237743	-0.722191048	0.20338784	-3.550807396	0.000384051	0.661459471	1113.237743	0.342044712	-1.547743168	-1.5446	-5.56294	5e-05	0.0063866
gene7	1626.941227	249.0617302	1014.55034	-0.470140989	0.166336082	-2.826452227	0.004706676	0.999597155	1014.55034	0.15308588	-2.70758687	-2.69789	-7.33934	5e-05	0.0063866
gene8	1561.771122	1055.675408	1336.839693	-0.359363942	0.195594701	-1.837288739	0.066167281	0.999597155	1336.839693	0.675947579	-0.565016728	-0.556128	-2.1236	5e-05	0.0063866
gene9	220.6347689	109.9580226	171.4451039	-0.586134637	0.200344005	-2.925641007	0.003437474	0.999597155	171.4451039	0.498371237	-1.004707289	-1.0101	-2.46675	5e-05	0.0063866
gene10	249.043545	394.8955972	313.8666793	0.421332629	0.196530911	2.143849158	0.032044975	0.999597155	313.8666793	1.585648796	0.665073265	0.680747	1.9863	5e-05	0.0063866
gene11	133.7911486	465.8820426	281.3871015	0.192478969	0.127718927	1.507051248	0.131797522	0.999597155	281.3871015	3.482158928	1.79998205	1.80874	4.3624	5e-05	0.0063866
gene12	867.3980164	561.8838276	731.6139325	-0.503811612	0.161344106	-3.12259074	0.001792668	0.999597155	731.6139325	0.647780854	-0.626422267	-0.612023	-2.19264	5e-05	0.0063866
gene13	899.9394186	583.6803221	759.3798201	-0.463527744	0.177810759	-2.606859942	0.009137673	0.999597155	759.3798201	0.648577349	-0.624649456	-0.617155	-2.22082	5e-05	0.0063866
gene14	41469.99639	10694.58068	27792.03385	-0.236061973	0.14108987	-1.673131974	0.094301318	0.999597155	27792.03385	0.257887186	-1.955188004	-1.94441	-3.41448	5e-05	0.0063866
gene15	210.5000153	109.5918313	165.6519335	-0.60649158	0.194549321	-3.117418122	0.001824426	0.999597155	165.6519335	0.520626239	-0.941680071	-0.936517	-2.25834	5e-05	0.0063866
gene16	2355.677972	688.2062908	1614.579447	-0.196228912	0.135026913	-1.453257781	0.146152195	0.999597155	1614.579447	0.292147865	-1.775229347	-1.79129	-5.2087	5e-05	0.0063866
gene17	440.3590298	271.3607996	365.2487052	-0.397932397	0.201312434	-1.976690607	0.048076613	0.999597155	365.2487052	0.616226264	-0.698467923	-0.674676	-1.9768	5e-05	0.0063866
gene18	43.53285186	14.57336894	30.66197056	-0.467173825	0.189281515	-2.468142891	0.013581609	0.999597155	30.66197056	0.334767154	-1.57877011	-1.52615	-2.22642	5e-05	0.0063866
gene19	6.263961314	31.82320349	17.6236245	0.191332259	0.115852699	1.651513176	0.098633832	0.999597155	17.6236245	5.080363989	2.344931865	2.39314	2.51521	5e-05	0.0063866
gene20	109116.4269	19621.34374	69340.83436	-0.268185407	0.138644969	-1.934332043	0.053072316	0.999597155	69340.83436	0.179820255	-2.475372557	-2.54984	-3.23927	5e-05	0.0063866
gene21	227.8464609	128.5476063	183.7136367	-0.409305932	0.203535526	-2.010980294	0.044327538	0.999597155	183.7136367	0.564185223	-0.825759215	-0.809061	-1.99248	5e-05	0.0063866
gene22	343.1492346	521.1119991	422.2437966	0.347988028	0.201269492	1.728965602	0.083815252	0.999597155	422.2437966	1.518616236	0.602757338	0.614397	1.92253	0.00015	0.0169908
gene23	9164.703977	13165.17772	10942.69231	0.179206011	0.192970945	0.928668359	0.353060983	0.999597155	10942.69231	1.436508779	0.52256681	0.531002	1.73328	0.0003	0.029622
gene24	840.0714403	599.3251559	733.0730917	-0.374723505	0.171546117	-2.184389317	0.028933653	0.999597155	733.0730917	0.713421653	-0.487173091	-0.483784	-1.7334	0.0005	0.0436294
gene25	730.1533528	528.776576	640.6525631	-0.265997777	0.201287651	-1.321480853	0.186341083	0.999597155	640.6525631	0.72419934	-0.465541233	-0.468006	-1.60779	0.0005	0.0436294

score 0 · Answer 4 · 2015-01-16

0

Entering edit mode

xfwang ▴ 20

@xfwang-7176

Last seen 8.0 years ago

United States

Hi Michael,

I am sorry to bother you again.

I am trying to see what is going on by using LRT ( they might produce similar results). But I don't know how to express the reduced model for one factor analysis in DESeq2. If there is one treatment factor say 'trt', then the full model is "full=~trt" but what is the reduced model, is it “reduced= ~ 1”, I understand it in statistics which should be an unknown fixed value, but I don’t know how to express it in DESeq2, why it is “~1”, if it is in right expression?

Thanks a lot!

Best,

Xiaofei

ADD COMMENT • link 9.3 years ago xfwang ▴ 20

0

Entering edit mode

Yes, reduced is ~1.

ADD REPLY • link 9.3 years ago Michael Love 41k