Understanding differences in results with and without setting of pairing factor
1
0
Entering edit mode
@solgakarbitechnionacil-6453
Last seen 7.1 years ago
European Union

Hello,

I am working on a RNA-Seq experiment in which there are 3 different cell types analyzed. For each cell type I have 2 biological replicates - each represent a different pool of mice.

I first ran DESeq2 without setting the 'Mice' factor, meaning without pairing the 3 cell types of the same mice pool.

The colData in this case is:

  CellType
Sample_1 type1
Sample_2 type1
Sample_3 type2
Sample_4 type2
Sample_5 type3
Sample_6 type3

I have used the following design: ~CellType and obtained the 3 pairwise results between each pair of cell types.

An example for a results command I have used is: 

res = results(dds, contrast = c("CellType", "type1", "type2"), alpha = 0.05)

Then, I have created a data set with the 'Mice' factor, creating the following colData:

  CellType Mice
Sample_1 type1 m1
Sample_2 type1 m2
Sample_3 type2 m1
Sample_4 type2 m2
Sample_5 type3 m1
Sample_6 type3 m2

with the design: ~Mice + CellType   and the same commands for results.

As expected I receive different results, but for some of them I can't seem to understand the differences.

For example, in a certain gene I get the following expression vector:

Sample CellType Mice Expression
Sample_1 type1 m1 114.38
Sample_2 type1 m2 82.07
Sample_3 type2 m1 0
Sample_4 type2 m2 0
Sample_5 type3 m1 0
Sample_6 type3 m2 0

Meaning this gene is only expressed in the type1 cells.

When requesting the comparison: results(dds,  contrast = c("CellType", "type1", "type2"), alpha = 0.05) I receive the following statistics:

- dds without setting the 'Mice' factor: log2FoldChange = 3.77127, padj = 1.38E-07

- dds with setting the 'Mice' factor: log2FoldChange = 0.313526, padj = 0.87749

Interestingly there are several such examples in which one cell type does not show any expression of the gene and the other shows different expression levels and I receive similar results when comparing them, having the data set with the pairing factor showing non-significant results with a strange log2FoldChange of close to 0.313.

 

Another example is the following gene:

Sample CellType Mice Expression
Sample_1 type1 m1 0
Sample_2 type1 m2 354.35
Sample_3 type2 m1 0
Sample_4 type2 m2 24.95
Sample_5 type3 m1 0
Sample_6 type3 m2 16.18

Where only one of the pools supports any differences between the cell types.

Here, without pairing the Mice pool I get the non-significant (expected) result: log2FoldChange = 1.02, padj = 0.44

and with the setting: log2FoldChange = 1.25, padj = 0.000169

Am I doing something wrong? Is there an explanation for this that I am missing?

 

Thank you very much,

Olga

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255    LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C                  
[5] LC_TIME=Hebrew_Israel.1255    

attached base packages:
 [1] grid      splines   parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] matrixStats_0.51.0         ggrepel_0.6.5              PoiClaClu_1.0.2            gridExtra_2.2.1           
 [5] reshape_0.8.6              VennDiagram_1.6.17         futile.logger_1.4.3        colorRamps_2.3            
 [9] colorspace_1.3-2           randomcoloR_1.0.0          cowplot_0.7.0              Hmisc_4.0-2               
[13] Formula_1.2-1              survival_2.40-1            lattice_0.20-34            NOISeq_2.18.0             
[17] Matrix_1.2-8               pheatmap_1.0.8             gplots_3.0.1               RColorBrewer_1.1-2        
[21] DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0             GenomicRanges_1.26.4      
[25] GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0       
[29] ggplot2_2.2.1              XLConnect_0.2-12           XLConnectJars_0.2-12      

loaded via a namespace (and not attached):
 [1] jsonlite_1.3         gtools_3.5.0         assertthat_0.1       latticeExtra_0.6-28  RSQLite_1.1-2        backports_1.0.5     
 [7] digest_0.6.12        XVector_0.14.1       checkmate_1.8.2      htmltools_0.3.5      plyr_1.8.4           XML_3.98-1.5        
[13] genefilter_1.56.0    zlibbioc_1.20.0      xtable_1.8-2         scales_0.4.1         gdata_2.17.0         BiocParallel_1.8.1  
[19] htmlTable_1.9        tibble_1.2           annotate_1.52.1      nnet_7.3-12          lazyeval_0.2.0       magrittr_1.5        
[25] memoise_1.0.0        foreign_0.8-67       tools_3.3.3          data.table_1.10.4    stringr_1.2.0        V8_1.3              
[31] munsell_0.4.3        locfit_1.5-9.1       cluster_2.0.5        AnnotationDbi_1.36.2 lambda.r_1.1.9       caTools_1.17.1      
[37] RCurl_1.95-4.8       htmlwidgets_0.8      labeling_0.3         bitops_1.0-6         base64enc_0.1-3      gtable_0.2.0        
[43] DBI_0.6              curl_2.4             knitr_1.15.1         futile.options_1.0.0 KernSmooth_2.23-15   rJava_0.9-8         
[49] stringi_1.1.3        Rcpp_0.12.10         geneplotter_1.52.0   rpart_4.1-10         acepack_1.4.1 


DESeq2 deseq2 rna-seq paired samples • 1.9k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 25 minutes ago
United States

hi Olga,

I don't see anything surprising about the results here, from my interpretation of what the model would do when you change the design. Of course, it's important to keep in mind that you have very few samples in this dataset for use to estimate the four coefficients, and so the model has to do the best it can to interpret changes as either true differences or variability, in the absence of information from the counts for individual genes. In combination with each gene's information from the likelihood, there is sharing of information across genes. In situations like you show, where you have one pool of mice or one cell type with expression and the other with 0's, the model is leaning heavily (this happens automatically through the use in DESeq2 of the Bayes formula) on the estimation of inherent biological variability from other genes with more information in the counts.

Update: I just noticed you are showing the adjusted p-values. These depend on the p-values of other genes, which change when you change the design. So it's not enough to just look at the counts for one gene to try to figure out changes in the adjusted p-value (also there is the information sharing explanation above). A simple example:

> p.adjust(c(0,0,.05),method="BH")[3]
[1] 0.05
> p.adjust(c(1,1,.05),method="BH")[3]
[1] 0.15

The second case makes perfect sense to me, what seems like biological variability when you don't inform the model about mice pools is explained, and then the type 1 / type 2 ratio looks more significant afterwards.

The most important thing to appreciate is that there is the bare minimum of sample size to perform statistical testing, and so the model does the best it can by learning about inherent biological variability from other genes with more information.

ADD COMMENT
0
Entering edit mode

Hi Michael,

First of all thank you very much for the response.

Regarding the explanation of the second case, I can see the reason for that.

But for the first case I still can't seem to understand why a whole set of genes having the same overall pattern of being expressed only in one cell type (in both mice) got not only the padj much higher and not significant but also a log2FoldChange dropping from ~3 and higher to ~0.3. (For that specific gene the p-value also changed from 4.63E-10 without the setting of 'Mice' factor to 0.3).

In this case (where we have a limited number of samples) which one of the models would you recommend us to use? (It is not possible at this time to increase the number of replicates).

Thank you again,

Olga. 

 

0
Entering edit mode

I would recommend the model including mouse group as it helps to explain the differences seen in counts.

ADD REPLY
0
Entering edit mode

This was also our primary thought.

But still we are worried about these cases that I have presented above since it seems that this way we 'miss' the interesting cases of genes specific to a single cell type - which is one of the goals of the experiment.

0
Entering edit mode

Can you run DESeq() with betaPrior=FALSE, to see if it helps these cases? this is anyway the default for version 1.16 and onward.

ADD REPLY
0
Entering edit mode

I have tried running DESeq() with betaPrior=FALSE, in some cases it indeed changed the statistical results but still in the cases where a gene is specifically expressed in a single cell type - the log2FoldChange is extremely high but both the p-value and adjusted p-value are very far from being significant.

For example, for the following genes I will show the statistical results for comparing Cell type 1 vs. 2, where one of the compared cell types shows zero expression but the third shows some expression:

Sample CellType Mice Expression gene 1 Expression gene 2
1 type1 m1 0 40.85
2 type1 m2 0 27.04
3 type2 m1 33.79 0
4 type2 m2 59.26 0
5 type3 m1 26.39 10.55
6 type3 m2 9.78 4.52

gene 1 - without 'Mice' factor: log2FoldChange = -2.259, p-value = 0.000199, padj = 0.0089

              with 'Mice' factor, betaPrior=TRUE: log2FoldChange = -1.88, p-value = 0.0019, padj = 0.06

             with 'Mice' factor, betaPrior=TRUE: log2FoldChange = -7.5, p-value = 0.00015, padj = 0.0068

and similar statistics were calculated for gene 2.

But there is still a whole set of cell type specific genes, where even after setting BetaPrior=FALSE - the log2FoldChange changed drastically but both the p-value and padj are extremely high and very different from the results without the setting of the pairing factor and I still don't understand why this is happening.

Thank you,

Olga

0
Entering edit mode

I would go with betaPrior=FALSE (as this is default going forward). For the remaining genes where adding the mice group increases the p-value, can you show an example? 

ADD REPLY
0
Entering edit mode

Hi Michael,

Here are 4 examples for such genes:

Sample CellType Mice gene 1 gene 2 gene 3 gene 4
1 type1 m1 114.38 75.57 83.74 0
2 type1 m2 82.07 97.52 139.04 0
3 type2 m1 0 0 0 82.27
4 type2 m2 0 0 0 45.22
5 type 3 m1 0 0 0 0
6 type3 m2 0 0 0 0

And the statistics for all three cases: no 'Mice' factor, with 'Mice' factor setting betaPrior=TRUE and with 'Mice' factor setting betaPrior=FALSE:

  gene 1 gene 2 gene 3 gene 4
No 'Mice': log2FoldChange 3.77 3.54 3.97 -2.87
No 'Mice': pvalue 4.63E-10 5.09E-09 5.13E-11 1.63E-06
No 'Mice': padj 1.38E-07 1.06E-06 1.96E-08 0.00015
with 'Mice', betaPrior=TRUE: log2FoldChange 0.313 0.313 0.313 -03128
with 'Mice', betaPrior=TRUE: pvalue 0.309 0.31 0.309 0.31
with 'Mice', betaPrior=TRUE: padj 0.87 0.87 0.87 0.87
with 'Mice', betaPrior=FALSE: log2FoldChange 8.44 8.279 8.62 -7.88
with 'Mice', betaPrior=FALSE: pvalue 0.07 0.084 0.07 0.099
with 'Mice', betaPrior=FALSE: padj 0.515 0.533 0.49 0.57

As you can see, the log2FC, p-value and padj are almost identical to all these genes when adding the 'Mice' factor.

 

Olga

0
Entering edit mode

Are the library sizes very different for these samples? Zero is not always the same when the library sizes are very different, which is a good thing.

ADD REPLY
0
Entering edit mode

The library sizes (# counted reads) do differ more than I usually see:

Sample Cell type Library size + size factor
1 type1 2.5M  + 0.49
2 type1 5.5M + 1.03
3 type2 4M + 0.68
4 type2 3.5M  + 0.64
5 type3 10M  + 1.89
6 type3 12M  +  2.66

The samples from types 1 and 2 are covered less than type 3 samples.

Does that explain the results? still I don't see how adding the pairing factor changed the results.

 

Thank you,

Olga

0
Entering edit mode

Another note:

When comparing type 1 samples (with lower library size) to type 3 samples (with higher library size) I see the same results (padj = 0.87) for these genes as in the comparison presented above (type 1 vs. type 2).

Olga

0
Entering edit mode

I looked into this a bit more with some simulated counts on my end. It's difficult to accurately estimate the dispersion in this case when you have 6 samples and 4 coefficients to estimate (so only 2 residual degrees of freedom), and then 4 of the 6 samples have 0's. This leads to the big reduction in significance for these genes. If you had more replication, you wouldn't see such a difference in p-values between with and without the mice factor.

On my end, if you use the MAP estimator for such genes, you would get more confident inference (so more like the 'no mice' results, after adding the 'mice' factor), because the dispersion is moderated toward a more reasonable value compared to the other genes in the experiment. These genes are most likely getting a large dispersion estimate, such that they are classified as dispersion outliers, and then the MAP is not used.

You can enforce the MAP dispersion estimator with this code:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, outlierSD=Inf)
dds <- nbinomWaldTest(dds)

I don't think you will need betaPrior=FALSE, but you can add it if you like. It will be FALSE in the next release by default.

I think that this may be more reasonable setting for this dataset, because you have little information per gene, but the MAP estimator is doing much better. At least it gives more reasonable results in my opinion for this case.

ADD REPLY
0
Entering edit mode

Thank you very much for all your help!

I ran it with the MAP estimator, I will look into the results more deeply, but from a quick look these genes that we were discussing indeed look more similar to the results without the pairing factor.

Login before adding your answer.

Traffic: 967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6