Question

Understanding differences in results with and without setting of pairing factor

0

Entering edit mode

solgakar@bi.technion.ac.il ▴ 90

@solgakarbitechnionacil-6453

Last seen 7.1 years ago

European Union

Hello,

I am working on a RNA-Seq experiment in which there are 3 different cell types analyzed. For each cell type I have 2 biological replicates - each represent a different pool of mice.

I first ran DESeq2 without setting the 'Mice' factor, meaning without pairing the 3 cell types of the same mice pool.

The colData in this case is:

	CellType
Sample_1	type1
Sample_2	type1
Sample_3	type2
Sample_4	type2
Sample_5	type3
Sample_6	type3

I have used the following design: ~CellType and obtained the 3 pairwise results between each pair of cell types.

An example for a results command I have used is:

res = results(dds, contrast = c("CellType", "type1", "type2"), alpha = 0.05)

Then, I have created a data set with the 'Mice' factor, creating the following colData:

	CellType	Mice
Sample_1	type1	m1
Sample_2	type1	m2
Sample_3	type2	m1
Sample_4	type2	m2
Sample_5	type3	m1
Sample_6	type3	m2

with the design: ~Mice + CellType and the same commands for results.

As expected I receive different results, but for some of them I can't seem to understand the differences.

For example, in a certain gene I get the following expression vector:

Sample	CellType	Mice	Expression
Sample_1	type1	m1	114.38
Sample_2	type1	m2	82.07
Sample_3	type2	m1	0
Sample_4	type2	m2	0
Sample_5	type3	m1	0
Sample_6	type3	m2	0

Meaning this gene is only expressed in the type1 cells.

When requesting the comparison: results(dds, contrast = c("CellType", "type1", "type2"), alpha = 0.05) I receive the following statistics:

- dds without setting the 'Mice' factor: log2FoldChange = 3.77127, padj = 1.38E-07

- dds with setting the 'Mice' factor: log2FoldChange = 0.313526, padj = 0.87749

Interestingly there are several such examples in which one cell type does not show any expression of the gene and the other shows different expression levels and I receive similar results when comparing them, having the data set with the pairing factor showing non-significant results with a strange log2FoldChange of close to 0.313.

Another example is the following gene:

Sample	CellType	Mice	Expression
Sample_1	type1	m1	0
Sample_2	type1	m2	354.35
Sample_3	type2	m1	0
Sample_4	type2	m2	24.95
Sample_5	type3	m1	0
Sample_6	type3	m2	16.18

Where only one of the pools supports any differences between the cell types.

Here, without pairing the Mice pool I get the non-significant (expected) result: log2FoldChange = 1.02, padj = 0.44

and with the setting: log2FoldChange = 1.25, padj = 0.000169

Am I doing something wrong? Is there an explanation for this that I am missing?

Thank you very much,

Olga

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255    LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C                  
[5] LC_TIME=Hebrew_Israel.1255    

attached base packages:
 [1] grid      splines   parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] matrixStats_0.51.0         ggrepel_0.6.5              PoiClaClu_1.0.2            gridExtra_2.2.1           
 [5] reshape_0.8.6              VennDiagram_1.6.17         futile.logger_1.4.3        colorRamps_2.3            
 [9] colorspace_1.3-2           randomcoloR_1.0.0          cowplot_0.7.0              Hmisc_4.0-2               
[13] Formula_1.2-1              survival_2.40-1            lattice_0.20-34            NOISeq_2.18.0             
[17] Matrix_1.2-8               pheatmap_1.0.8             gplots_3.0.1               RColorBrewer_1.1-2        
[21] DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0             GenomicRanges_1.26.4      
[25] GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0       
[29] ggplot2_2.2.1              XLConnect_0.2-12           XLConnectJars_0.2-12      

loaded via a namespace (and not attached):
 [1] jsonlite_1.3         gtools_3.5.0         assertthat_0.1       latticeExtra_0.6-28  RSQLite_1.1-2        backports_1.0.5     
 [7] digest_0.6.12        XVector_0.14.1       checkmate_1.8.2      htmltools_0.3.5      plyr_1.8.4           XML_3.98-1.5        
[13] genefilter_1.56.0    zlibbioc_1.20.0      xtable_1.8-2         scales_0.4.1         gdata_2.17.0         BiocParallel_1.8.1  
[19] htmlTable_1.9        tibble_1.2           annotate_1.52.1      nnet_7.3-12          lazyeval_0.2.0       magrittr_1.5        
[25] memoise_1.0.0        foreign_0.8-67       tools_3.3.3          data.table_1.10.4    stringr_1.2.0        V8_1.3              
[31] munsell_0.4.3        locfit_1.5-9.1       cluster_2.0.5        AnnotationDbi_1.36.2 lambda.r_1.1.9       caTools_1.17.1      
[37] RCurl_1.95-4.8       htmlwidgets_0.8      labeling_0.3         bitops_1.0-6         base64enc_0.1-3      gtable_0.2.0        
[43] DBI_0.6              curl_2.4             knitr_1.15.1         futile.options_1.0.0 KernSmooth_2.23-15   rJava_0.9-8         
[49] stringi_1.1.3        Rcpp_0.12.10         geneplotter_1.52.0   rpart_4.1-10         acepack_1.4.1

DESeq2 deseq2 rna-seq paired samples • 1.9k views

ADD COMMENT • link updated 7.1 years ago by Michael Love 41k • written 7.1 years ago by solgakar@bi.technion.ac.il ▴ 90

score 0 · Answer 1 · 2017-04-06

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 25 minutes ago

United States

hi Olga,

I don't see anything surprising about the results here, from my interpretation of what the model would do when you change the design. Of course, it's important to keep in mind that you have very few samples in this dataset for use to estimate the four coefficients, and so the model has to do the best it can to interpret changes as either true differences or variability, in the absence of information from the counts for individual genes. In combination with each gene's information from the likelihood, there is sharing of information across genes. In situations like you show, where you have one pool of mice or one cell type with expression and the other with 0's, the model is leaning heavily (this happens automatically through the use in DESeq2 of the Bayes formula) on the estimation of inherent biological variability from other genes with more information in the counts.

Update: I just noticed you are showing the adjusted p-values. These depend on the p-values of other genes, which change when you change the design. So it's not enough to just look at the counts for one gene to try to figure out changes in the adjusted p-value (also there is the information sharing explanation above). A simple example:

> p.adjust(c(0,0,.05),method="BH")[3]
[1] 0.05
> p.adjust(c(1,1,.05),method="BH")[3]
[1] 0.15

The second case makes perfect sense to me, what seems like biological variability when you don't inform the model about mice pools is explained, and then the type 1 / type 2 ratio looks more significant afterwards.

The most important thing to appreciate is that there is the bare minimum of sample size to perform statistical testing, and so the model does the best it can by learning about inherent biological variability from other genes with more information.

ADD COMMENT • link 7.1 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

First of all thank you very much for the response.

Regarding the explanation of the second case, I can see the reason for that.

But for the first case I still can't seem to understand why a whole set of genes having the same overall pattern of being expressed only in one cell type (in both mice) got not only the padj much higher and not significant but also a log2FoldChange dropping from ~3 and higher to ~0.3. (For that specific gene the p-value also changed from 4.63E-10 without the setting of 'Mice' factor to 0.3).

In this case (where we have a limited number of samples) which one of the models would you recommend us to use? (It is not possible at this time to increase the number of replicates).

Thank you again,

Olga.

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90

0

Entering edit mode

I would recommend the model including mouse group as it helps to explain the differences seen in counts.

ADD REPLY • link 7.0 years ago Michael Love 41k

0

Entering edit mode

This was also our primary thought.

But still we are worried about these cases that I have presented above since it seems that this way we 'miss' the interesting cases of genes specific to a single cell type - which is one of the goals of the experiment.

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90

0

Entering edit mode

Can you run DESeq() with betaPrior=FALSE, to see if it helps these cases? this is anyway the default for version 1.16 and onward.

ADD REPLY • link 7.0 years ago Michael Love 41k

0

Entering edit mode

I have tried running DESeq() with betaPrior=FALSE, in some cases it indeed changed the statistical results but still in the cases where a gene is specifically expressed in a single cell type - the log2FoldChange is extremely high but both the p-value and adjusted p-value are very far from being significant.

For example, for the following genes I will show the statistical results for comparing Cell type 1 vs. 2, where one of the compared cell types shows zero expression but the third shows some expression:

Sample	CellType	Mice	Expression gene 1	Expression gene 2
1	type1	m1	0	40.85
2	type1	m2	0	27.04
3	type2	m1	33.79	0
4	type2	m2	59.26	0
5	type3	m1	26.39	10.55
6	type3	m2	9.78	4.52

gene 1 - without 'Mice' factor: log2FoldChange = -2.259, p-value = 0.000199, padj = 0.0089

with 'Mice' factor, betaPrior=TRUE: log2FoldChange = -1.88, p-value = 0.0019, padj = 0.06

with 'Mice' factor, betaPrior=TRUE: log2FoldChange = -7.5, p-value = 0.00015, padj = 0.0068

and similar statistics were calculated for gene 2.

But there is still a whole set of cell type specific genes, where even after setting BetaPrior=FALSE - the log2FoldChange changed drastically but both the p-value and padj are extremely high and very different from the results without the setting of the pairing factor and I still don't understand why this is happening.

Thank you,

Olga

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90

0

Entering edit mode

I would go with betaPrior=FALSE (as this is default going forward). For the remaining genes where adding the mice group increases the p-value, can you show an example?

ADD REPLY • link 7.0 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Here are 4 examples for such genes:

Sample	CellType	Mice	gene 1	gene 2	gene 3	gene 4
1	type1	m1	114.38	75.57	83.74	0
2	type1	m2	82.07	97.52	139.04	0
3	type2	m1	0	0	0	82.27
4	type2	m2	0	0	0	45.22
5	type 3	m1	0	0	0	0
6	type3	m2	0	0	0	0

And the statistics for all three cases: no 'Mice' factor, with 'Mice' factor setting betaPrior=TRUE and with 'Mice' factor setting betaPrior=FALSE:

	gene 1	gene 2	gene 3	gene 4
No 'Mice': log2FoldChange	3.77	3.54	3.97	-2.87
No 'Mice': pvalue	4.63E-10	5.09E-09	5.13E-11	1.63E-06
No 'Mice': padj	1.38E-07	1.06E-06	1.96E-08	0.00015
with 'Mice', betaPrior=TRUE: log2FoldChange	0.313	0.313	0.313	-03128
with 'Mice', betaPrior=TRUE: pvalue	0.309	0.31	0.309	0.31
with 'Mice', betaPrior=TRUE: padj	0.87	0.87	0.87	0.87
with 'Mice', betaPrior=FALSE: log2FoldChange	8.44	8.279	8.62	-7.88
with 'Mice', betaPrior=FALSE: pvalue	0.07	0.084	0.07	0.099
with 'Mice', betaPrior=FALSE: padj	0.515	0.533	0.49	0.57

As you can see, the log2FC, p-value and padj are almost identical to all these genes when adding the 'Mice' factor.

Olga

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90

0

Entering edit mode

Are the library sizes very different for these samples? Zero is not always the same when the library sizes are very different, which is a good thing.

ADD REPLY • link 7.0 years ago Michael Love 41k

0

Entering edit mode

The library sizes (# counted reads) do differ more than I usually see:

Sample	Cell type	Library size + size factor
1	type1	2.5M + 0.49
2	type1	5.5M + 1.03
3	type2	4M + 0.68
4	type2	3.5M + 0.64
5	type3	10M + 1.89
6	type3	12M + 2.66

The samples from types 1 and 2 are covered less than type 3 samples.

Does that explain the results? still I don't see how adding the pairing factor changed the results.

Thank you,

Olga

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90

0

Entering edit mode

Another note:

When comparing type 1 samples (with lower library size) to type 3 samples (with higher library size) I see the same results (padj = 0.87) for these genes as in the comparison presented above (type 1 vs. type 2).

Olga

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90

0

Entering edit mode

I looked into this a bit more with some simulated counts on my end. It's difficult to accurately estimate the dispersion in this case when you have 6 samples and 4 coefficients to estimate (so only 2 residual degrees of freedom), and then 4 of the 6 samples have 0's. This leads to the big reduction in significance for these genes. If you had more replication, you wouldn't see such a difference in p-values between with and without the mice factor.

On my end, if you use the MAP estimator for such genes, you would get more confident inference (so more like the 'no mice' results, after adding the 'mice' factor), because the dispersion is moderated toward a more reasonable value compared to the other genes in the experiment. These genes are most likely getting a large dispersion estimate, such that they are classified as dispersion outliers, and then the MAP is not used.

You can enforce the MAP dispersion estimator with this code:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, outlierSD=Inf)
dds <- nbinomWaldTest(dds)

I don't think you will need betaPrior=FALSE, but you can add it if you like. It will be FALSE in the next release by default.

I think that this may be more reasonable setting for this dataset, because you have little information per gene, but the MAP estimator is doing much better. At least it gives more reasonable results in my opinion for this case.

ADD REPLY • link 7.0 years ago Michael Love 41k

0

Entering edit mode

Thank you very much for all your help!

I ran it with the MAP estimator, I will look into the results more deeply, but from a quick look these genes that we were discussing indeed look more similar to the results without the pairing factor.

ADD REPLY • link 7.0 years ago solgakar@bi.technion.ac.il ▴ 90