edgeR: very low p-value and very high variance within the group of replicates
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
I have a question about you edgeR, in order to fix a problem that may be very simple for you. I'm using `edgeR` to perform differential expression analysis from RNA-seq experiment. I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups. I ran the `edgeR` pakage like follow: x <- read.delim("my_reads_count.txt", row.names="GENE") group <- factor(c(1,1,1,2,2,2)) y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant. This is an example for `my_reads_count.txt`: GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3 ENSG00000198842 0 3 2 2 6666 3 ENSG00000257017 3 3 25 2002 29080 4 And for `my_edgeR_results.txt`: GENE logFC logCPM PValue FDR ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04 ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07 It seems very odd that "0 3 2" versus "2 6666 3" is significant.... I would like that the variance within the group is considered. Can you help me? Some suggestion? Are there some alternative function in edgeR that is capable to fix my problem?? -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.0.8 limma_3.12.3 tweeDEseqCountData_1.0.8 Biobase_2.16.0 BiocGenerics_0.2.0 BiocInstaller_1.8.3 loaded via a namespace (and not attached): [1] tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org.
edgeR edgeR • 1.7k views
ADD COMMENT
0
Entering edit mode
Mark Robinson ▴ 880
@mark-robinson-4908
Last seen 5.5 years ago
Hi Valentina, We have seen a few instances like this. Despite the high variance of this gene, it's dispersion is "shrunk" towards the dispersion for genes of similar average abundance. This makes it a bit sensitive to outliers, but in general this sharing of information has positive effects on the power. A few thoughts/suggestions: 1. Does a plotMDS(y) suggest that 'sample2_2' is much different than the other replicates? The two genes you show appear to be quite different for that sample. 2. I would suggest upgrading to R-3.0.x and edgeR 3.2.x; I believe some defaults have changed; in particular, the prior.df parameter (that controls the degree of moderation) in estimateTagwiseDisp() has been decreased, which will dampen the effect of these "outlier" observations somewhat. You may wish to lower prior.df even further. 3. A PhD student of mine has a promising solution for this that offers a good intermediate between resistance to outliers and power, but it's not quite ready for prime time. Stay tuned, it should be ready soon. 4. Consider adding estimateTrendedDisp() to your pipeline, like: y <- estimateCommonDisp(y) y <- estimateTrendedDisp(y) y <- estimateTagwiseDisp(y) Best regards, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://tiny.cc/mrobin On 19.06.2013, at 15:50, valentina [guest] <guest at="" bioconductor.org=""> wrote: > > I have a question about you edgeR, in order to fix a problem that may be very simple for you. > > I'm using `edgeR` to perform differential expression analysis from RNA-seq experiment. > > I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups. > > I ran the `edgeR` pakage like follow: > > x <- read.delim("my_reads_count.txt", row.names="GENE") > group <- factor(c(1,1,1,2,2,2)) > y <- DGEList(counts=x,group=group) > y <- calcNormFactors(y) > y <- estimateCommonDisp(y) > y <- estimateTagwiseDisp(y) > et <- exactTest(y) > > I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant. > > This is an example for `my_reads_count.txt`: > > GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3 > ENSG00000198842 0 3 2 2 6666 3 > ENSG00000257017 3 3 25 2002 29080 4 > > And for `my_edgeR_results.txt`: > > GENE logFC logCPM PValue FDR > ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04 > ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07 > > It seems very odd that "0 3 2" versus "2 6666 3" is significant.... > I would like that the variance within the group is considered. Can you help me? Some suggestion? > Are there some alternative function in edgeR that is capable to fix my problem?? > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.0.8 limma_3.12.3 tweeDEseqCountData_1.0.8 Biobase_2.16.0 BiocGenerics_0.2.0 BiocInstaller_1.8.3 > > loaded via a namespace (and not attached): > [1] tools_2.15.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 3.7 years ago
Zentrum für Molekularbiologie, Universi…
Hi Valentina > I obtained a very odd results: in some cases I had a very low p-value > and FDR but looking at the raw data it is obvious that the difference > between the two groups can't be significant. Maybe also give DESeq2 a try. We recently implemented some changes to improve its robustness against outliers of the kind that you are describing. Don't be too surprised, though, if you don't find any useful hits: Usually, it takes samples from dozens of patients to find genes which gene which differ between sub-groups with different prognosis. With only 6 samples, you are severely underpowered. Simon
ADD COMMENT

Login before adding your answer.

Traffic: 603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6