edgeR: very low p-value and very high variance within the group of replicates

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

I have a question about you edgeR, in order to fix a problem that may be very simple for you. I'm using `edgeR` to perform differential expression analysis from RNA-seq experiment. I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups. I ran the `edgeR` pakage like follow: x <- read.delim("my_reads_count.txt", row.names="GENE") group <- factor(c(1,1,1,2,2,2)) y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant. This is an example for `my_reads_count.txt`: GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3 ENSG00000198842 0 3 2 2 6666 3 ENSG00000257017 3 3 25 2002 29080 4 And for `my_edgeR_results.txt`: GENE logFC logCPM PValue FDR ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04 ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07 It seems very odd that "0 3 2" versus "2 6666 3" is significant.... I would like that the variance within the group is considered. Can you help me? Some suggestion? Are there some alternative function in edgeR that is capable to fix my problem?? -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.0.8 limma_3.12.3 tweeDEseqCountData_1.0.8 Biobase_2.16.0 BiocGenerics_0.2.0 BiocInstaller_1.8.3 loaded via a namespace (and not attached): [1] tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 1.7k views

ADD COMMENT • link updated 10.8 years ago by Simon Anders ★ 3.7k • written 10.8 years ago by Guest User ★ 13k

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 5.5 years ago

Hi Valentina, We have seen a few instances like this. Despite the high variance of this gene, it's dispersion is "shrunk" towards the dispersion for genes of similar average abundance. This makes it a bit sensitive to outliers, but in general this sharing of information has positive effects on the power. A few thoughts/suggestions: 1. Does a plotMDS(y) suggest that 'sample2_2' is much different than the other replicates? The two genes you show appear to be quite different for that sample. 2. I would suggest upgrading to R-3.0.x and edgeR 3.2.x; I believe some defaults have changed; in particular, the prior.df parameter (that controls the degree of moderation) in estimateTagwiseDisp() has been decreased, which will dampen the effect of these "outlier" observations somewhat. You may wish to lower prior.df even further. 3. A PhD student of mine has a promising solution for this that offers a good intermediate between resistance to outliers and power, but it's not quite ready for prime time. Stay tuned, it should be ready soon. 4. Consider adding estimateTrendedDisp() to your pipeline, like: y <- estimateCommonDisp(y) y <- estimateTrendedDisp(y) y <- estimateTagwiseDisp(y) Best regards, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://tiny.cc/mrobin On 19.06.2013, at 15:50, valentina [guest] <guest at="" bioconductor.org=""> wrote: > > I have a question about you edgeR, in order to fix a problem that may be very simple for you. > > I'm using `edgeR` to perform differential expression analysis from RNA-seq experiment. > > I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups. > > I ran the `edgeR` pakage like follow: > > x <- read.delim("my_reads_count.txt", row.names="GENE") > group <- factor(c(1,1,1,2,2,2)) > y <- DGEList(counts=x,group=group) > y <- calcNormFactors(y) > y <- estimateCommonDisp(y) > y <- estimateTagwiseDisp(y) > et <- exactTest(y) > > I obtained a very odd results: in some cases I had a very low p-value and FDR but looking at the raw data it is obvious that the difference between the two groups can't be significant. > > This is an example for `my_reads_count.txt`: > > GENE sample1_1 sample1_2 sample1_3 sample2_1 sample2_2 sample2_3 > ENSG00000198842 0 3 2 2 6666 3 > ENSG00000257017 3 3 25 2002 29080 4 > > And for `my_edgeR_results.txt`: > > GENE logFC logCPM PValue FDR > ENSG00000198842 9.863211e+00 5.4879462930 5.368843e-07 1.953612e-04 > ENSG00000257017 9.500927e+00 7.7139869397 8.072384e-10 7.171947e-07 > > It seems very odd that "0 3 2" versus "2 6666 3" is significant.... > I would like that the variance within the group is considered. Can you help me? Some suggestion? > Are there some alternative function in edgeR that is capable to fix my problem?? > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.0.8 limma_3.12.3 tweeDEseqCountData_1.0.8 Biobase_2.16.0 BiocGenerics_0.2.0 BiocInstaller_1.8.3 > > loaded via a namespace (and not attached): > [1] tools_2.15.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.8 years ago Mark Robinson ▴ 880

0

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi Valentina > I obtained a very odd results: in some cases I had a very low p-value > and FDR but looking at the raw data it is obvious that the difference > between the two groups can't be significant. Maybe also give DESeq2 a try. We recently implemented some changes to improve its robustness against outliers of the kind that you are describing. Don't be too surprised, though, if you don't find any useful hits: Usually, it takes samples from dozens of patients to find genes which gene which differ between sub-groups with different prognosis. With only 6 samples, you are severely underpowered. Simon

ADD COMMENT • link 10.8 years ago Simon Anders ★ 3.7k

Login before adding your answer.