Question

DESEq2 - Why is "my gene" not differentially expressed?

1

Entering edit mode

Jane Merlevede ▴ 90

@jane-merlevede-5019

Last seen 6.1 years ago

Dear all,

I have an RNASeq experiment of 12 samples for 2 conditions (6 samples per condition).
Reads are paired-end, 100 bp.

I performed differential expression analysis using DESeq2 and was surprised not to find a gene as significantly differentially expressed (it has been reported in an array and another RNASeq experiments studying the same conditions-but different samples).

Here is the DESeq2 procedure I followed:

library("DESeq2")
DataFrame=data.frame(Nom,Fichier,HIST)

################################################
###              Model                      ###
################################################

dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/Results/Data/26022016", design= ~HIST)
str(colData(dds_raw)$HIST)

dds_raw$HIST = relevel(dds_raw$HIST, ref="Cond1")
dds <- DESeq(dds_raw, minReplicatesForReplace=6)


#####################################
### Differential expression Tests ###
#####################################
resMLE <- results(dds, addMLE=TRUE, cooksCutoff=FALSE,alpha=0.05)
summary(resMLE)
resMLEOrdered <- resMLE[order(resMLE$padj),]

write.table(resMLEOrdered, file="ResMLETable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")
write.table(counts(dds,normalized=TRUE), file="NormalizedTable",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")

and the sessionInfo

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.10.1              RcppArmadillo_0.6.500.4.0  Rcpp_0.12.3                SummarizedExperiment_1.0.2 Biobase_2.30.0             GenomicRanges_1.22.3      
 [7] GenomeInfoDb_1.6.1         IRanges_2.4.6              S4Vectors_0.8.7            BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3           XVector_0.10.0       futile.options_1.0.0 tools_3.2.0          zlibbioc_1.16.0      rpart_4.1-10        
 [9] RSQLite_1.0.0        annotate_1.48.0      gtable_0.1.2         lattice_0.20-33      DBI_0.3.1            gridExtra_2.0.0      genefilter_1.52.1    cluster_2.0.3       
[17] locfit_1.5-9.1       grid_3.2.0           nnet_7.3-11          AnnotationDbi_1.32.3 XML_3.98-1.3         survival_2.38-3      BiocParallel_1.4.3   foreign_0.8-66      
[25] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0   ggplot2_2.0.0        lambda.r_1.1.7       Hmisc_3.17-1         scales_0.3.0         splines_3.2.0       
[33] xtable_1.8-0         colorspace_1.2-6     acepack_1.3-3.3      munsell_0.4.2

Here are the normalized counts and stats for my gene of interest:

90.7033407596	0	0	2.3672116923	0	0		939.0444914676	404.8903299633	665.5848352517	0	1034.9196508454	438.2486305103

baseMean	log2FoldChange	lfcMLE	lfcSE	stat	pvalue	padj
297.9798742075	-1.0090565228	-5.2239343815	0.5743666929	-1.7568158725	0.0789491987	0.4351815966

The estimated log2FC is just above 1 but the adjusted p-value is very high, especially compare to this other gene, which is similar to my gene of interest, but with less magnitude and with lower counts:

4.1228791254	0	0.8533500773	0	0	0.7423895021		99.6303301923	89.0758725919	128.7791432768	0	69.6249931217	43.9908663202

baseMean	log2FoldChange	lfcMLE	lfcSE	stat	pvalue	padj
36.4016520173	-3.353879756	-6.1662095289	0.6214945118	-5.3964752585	0.000000068	0.000052119

In condition 2, one value is 0, as in condition 1. But it is also the case for the second gene. It seems weird to me that this second gene is differential but not my gene of interest. I am probably missing something here.
I have posted this question on Seqanswers (http://seqanswers.com/forums/showthread.php?t=66831) and got few comments. I tried for example to exclude the "outlier sample" from the second condition but still, this gene is not differential.

Thank you in advance for any help,
Jane

deseq2 differential gene expression • 5.5k views

ADD COMMENT • link 8.7 years ago Jane Merlevede ▴ 90

score 0 · Answer 1 · 2016-03-16

It's hard to say what the reason is that you can't reproduce published results. There are so many technical and/or biological factors involved that can cause variability.

If you're not sure about your own RNAseq methods/results, check with q-PCR if the gene of interest is more or less similar expressed (in the same RNA samples).

score 0 · Answer 2 · 2016-03-16

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 3 hours ago

United States

Can you send the dds object to me via email and I'll take a look? It could just be a high level of variability in your data though.

My email can be found through maintainer("DESeq2")

You can use: save(dds, file="dds.rda")

ADD COMMENT • link 8.7 years ago Michael Love 42k

1

Entering edit mode

Hi Jane,

Thanks for sending the dds. So I took a look at this gene, and tried using a likelihood ratio test to see if it would be less sensitive to the outlier count and the p-value is still not small enough to pass multiple test correction:

> res[8443,]
log2 fold change (MLE): HIST H3F3A vs HIST1H3B
LRT p-value: '~ HIST' vs '~ 1'
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat     pvalue      padj
      <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
HOXD8  297.9799      -5.223934  2.053248  4.532549 0.03325605 0.3089909

The problem is that the normalized counts are quite variable for this gene: 0-1035 in one group vs 0-91 in the other, despite that looking by eye it seems very likely to be DE if you ignore the 0 in the first group. Unfortunately, the negative binomial with 6 samples here is sensitive to the variation it sees and only gives this a p-value of 0.07 for the Wald test and 0.03 for the LRT, neither of which will pass multiple test correction.

You could try edgeR robust, which would likely downweight the outlier count in the first group and have more confident inference in this example.

ADD REPLY • link 8.7 years ago Michael Love 42k

0

Entering edit mode

Thank you a lot Michael for your time and help.

When I remove the "outlier sample" from the second condition (6 vs 5), the normalized counts for this gene vary in these ranges: 0-91 in one group vs 409-1035 in the other, but the adjusted p-value is still no significant (http://seqanswers.com/forums/showthread.php?t=66831). I am very surprised.

Anyway, it will be the occasion to test edgeR.

ADD REPLY • link 8.7 years ago Jane Merlevede ▴ 90

1

Entering edit mode

Actually, now that I look at it, I think it's the count of 91 which gives the you a large dispersion estimate for this gene.

Note that you do get a small p-value 0.02, just not small enough for multiple test correction.

Using the edgeR robust sample weighting will help.

ADD REPLY • link 8.7 years ago Michael Love 42k

score 0 · Answer 3 · 2016-03-17

0

Entering edit mode

Jane Merlevede ▴ 90

@jane-merlevede-5019

Last seen 6.1 years ago

Exact: when removing the sample with 91 as normalized count for my gene, it has now an adjusted p-value of e-54!

ADD COMMENT • link 8.7 years ago Jane Merlevede ▴ 90

1

Entering edit mode

Is this good research ethics?? Changing your data in order to get the results that fit your hypothesis?

Isn't RNAseq suppose to be an objective research tool? Hypothesis generating and so forth?

ADD REPLY • link 8.7 years ago b.nota ▴ 370

0

Entering edit mode

I am not changing the data: I did not say I will remove this sample for ever.

I am trying to understand the reason why a gene, which is supposed to be deregulated, is not found. This is a basic approach of each analysis I think: first, you check if what is know is found. And then, you can benefit from the whole analysis to look for novelty.

As in Exome sequencing: if you do not find a mutation found by Sanger sequencing, you have to adjust your parameters or at least be sure of the reason why is it not found as well. I am not that lucky to have a good analysis from a single trial.

ADD REPLY • link 8.7 years ago Jane Merlevede ▴ 90

1

Entering edit mode

Even if it can be done, you're maybe skewing a little bit the data.

To me (but I can be wrong, especially because I don't know which gene/system you're studying) you need to consider the two opposite opinions here:

It is good practice to check for what is known. So if you don't find something that is well known due to heterogeneity of one sample it is maybe a good idea to remove it to see what is left.

At the same time, this sample is maybe "true" and the observed variation is something that you should take into account even if it contradicts your hypothesis. So by removing it you try to fit your data to your starting hypothesis which is not so good...Maybe in the other papers you're referring to they did not use as many samples as you did and got lucky finding this gene without much variation in their other samples.

I would tried to investigate the reason why this sample is different from the other, no just directly remove it to get cleaner data. Maybe just technical artifact/contamination that would give you an unbiased reason to exclude it.

ADD REPLY • link 8.7 years ago Radek ▴ 90

1

Entering edit mode

As I said previously, I won't remove that sample. For sure, I won't remove one sample based on a single gene count!

I tried to understand why this gene was not detected and if it was due to a particular sample.

We cannot conclude that "this sample is different from the others" based on this lonely count. Clustering and PCA can give some indications whether "this sample is different from the others".

ADD REPLY • link 8.7 years ago Jane Merlevede ▴ 90

0

Entering edit mode

Just to echo the others, you should not remove any samples for the final analysis.

I imagine you were doing so to explore how it affects the testing.

But you need to go with an final approach that uses all of the samples. Hence i recommended the weighting approach in edgeR robust.

ADD REPLY • link 8.7 years ago Michael Love 42k

0

Entering edit mode

Yes, thank you Michael, I have started to look at edgeR robust. It will take some time since I never tried it.

ADD REPLY • link 8.7 years ago Jane Merlevede ▴ 90