edgeR on differentially expressed genes with low read counts

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi, I am using the package edgeR to determine differentially expressed genes in RNA-seq data. I have 8 samples, corresponding to 3 patients (2 conditions per patient and 2 patients duplicated). I performed the contrast of these two conditions, following the user guide (section 11 Case study: Oral carcinomas vs matched normal tissue). When analyzing some border line differentially expressed genes (FDR ~ 0.02) I found that in some cases, the read counts was really low, e.g. only one sample with 2 reads, and the others (7) with 0 counts. > countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colna mes(countsTable))] 61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli 70_CT.tot 61m2_CT.tot 67m2_CT.tot ENSG00000207696 0 0 0 0 2 0 0 0 The design matrix used was: > design patientsCT61 patientsCT67 patientsCT70 as.factor(1 - (as.numeric(groupsCT) - 1))1 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 4 1 0 0 0 5 0 1 0 0 6 1 0 0 1 7 0 1 0 1 8 0 0 1 1 The result for the gene in question was: > tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",] table.logConc table.logFC table.PValue table.FDR adjust.method comparison ENSG00000207696 -18.34 7.726 0.005222 0.03941 BH as.factor(1 - (as.numeric(groupsCT) - 1))1 When investigating the result of the gene, I found that the glm-fitted values looked like this: > format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitt ed.values)=="ENSG00000207696",],scientific=FALSE,digits=4) 61_CT.tot 67_CT.tot 70_CT.tot 61m2_CT.tot 67m2_CT.tot 61_CT.poli 67_CT.poli 70_CT.poli "0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959" "0.03479503" "0.03522840" "1.84247667" Since I was not able to find the details used in edgeR to calculate the model easily, I was wondering if from this glm-fitted values is it reasonable to obtain such a low FDR? I used the common dispersion for the variance estimate: DGEensCT.CR<-estimateCRDisp(DGEensCT,design) glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion = DGEensCT.CR$CR.common.dispersion) lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT) Could someone help me with this? Thank you very much in advanced! Cheers, Luc??a -- output of sessionInfo(): R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 edgeR_2.0.5 loaded via a namespace (and not attached): [1] RCurl_1.4-3 XML_3.2-0 limma_3.6.9 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 1.0k views

ADD COMMENT • link updated 12.4 years ago by Gordon Smyth 50k • written 12.4 years ago by Guest User ★ 13k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

You should filter out the genes with extremely low total counts as you will not have enough power to achieve significance. Naomi At 08:54 AM 11/29/2011, Luc??a [guest] wrote: >Hi, >I am using the package edgeR to determine >differentially expressed genes in RNA-seq data. >I have 8 samples, corresponding to 3 patients (2 >conditions per patient and 2 patients >duplicated). I performed the contrast of these >two conditions, following the user guide >(section 11 Case study: Oral carcinomas vs >matched normal tissue). When analyzing some >border line differentially expressed genes (FDR >~ 0.02) I found that in some cases, the read >counts was really low, e.g. only one sample with >2 reads, and the others (7) with 0 counts. > > > countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colna mes(countsTable))] > 61_CT.poli 61_CT.tot 67_CT.poli > 67_CT.tot 70_CT.poli 70_CT.tot 61m2_CT.tot 67m2_CT.tot >ENSG00000207696 0 0 0 > 0 2 0 0 0 > >The design matrix used was: > > design > patientsCT61 patientsCT67 patientsCT70 > as.factor(1 - (as.numeric(groupsCT) - 1))1 >1 1 0 0 > 0 >2 0 1 0 > 0 >3 0 0 1 > 0 >4 1 0 0 > 0 >5 0 1 0 > 0 >6 1 0 0 > 1 >7 0 1 0 > 1 >8 0 0 1 > 1 > >The result for the gene in question was: > > tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",] > table.logConc table.logFC > table.PValue table.FDR adjust.method comparison >ENSG00000207696 -18.34 7.726 >0.005222 0.03941 BH as.factor(1 - (as.numeric(groupsCT) - 1))1 > >When investigating the result of the gene, I >found that the glm-fitted values looked like this: > > > > format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitt ed.values)=="ENSG00000207696",],scientific=FALSE,digits=4) > 61_CT.tot 67_CT.tot 70_CT.tot > 61m2_CT.tot 67m2_CT.tot 61_CT.poli 67_CT.poli 70_CT.poli >"0.00008988" "0.00004048" "0.06229493" >"0.00026943" "0.00014959" "0.03479503" "0.03522840" "1.84247667" > >Since I was not able to find the details used in >edgeR to calculate the model easily, I was >wondering if from this glm-fitted values is it >reasonable to obtain such a low FDR? >I used the common dispersion for the variance estimate: >DGEensCT.CR<-estimateCRDisp(DGEensCT,design) >glmfit.DGEensCT <- glmFit(DGEensCT, >design,dispersion = DGEensCT.CR$CR.common.dispersion) >lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT) > >Could someone help me with this? >Thank you very much in advanced! > >Cheers, >Luc??a > > -- output of sessionInfo(): > >R version 2.12.0 (2010-10-15) >Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > >locale: >[1] C > >attached base packages: >[1] stats graphics grDevices utils datasets methods base > >other attached packages: >[1] biomaRt_2.6.0 edgeR_2.0.5 > >loaded via a namespace (and not attached): >[1] RCurl_1.4-3 XML_3.2-0 limma_3.6.9 > >-- >Sent via the guest posting facility at bioconductor.org. > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.4 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Hi Naomi On 11/29/2011 06:33 PM, Naomi Altman wrote: > You should filter out the genes with extremely low total counts as you > will not have enough power to achieve significance. I think you were to quick here to recommend filtering. Filtering should help to increase power, but it should not be necessary to get type-I error control. Lucia's problem is not that she has a lack of power, rather the opposite. She is worried that the reported p values are too optimistic, and I would agree that the example she quotes looks rather suspicious: > > [...] When analyzing >> some border line differentially expressed genes (FDR ~ 0.02) I found >> that in some cases, the read counts was really low, e.g. only one >> sample with 2 reads, and the others (7) with 0 counts. >> > >> countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",coln ames(countsTable))] >> >> 61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli 70_CT.tot >> 61m2_CT.tot 67m2_CT.tot >> ENSG00000207696 0 0 0 0 2 0 0 0 Something seems to have gone wrong here. Simon

ADD REPLY • link 12.4 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 34 minutes ago

WEHI, Melbourne, Australia

Dear Luca, The edgeR function glmLRT() conducts a likelihood ratio test, a standard procedure for generalized linear models. It is virtually impossible to get the p-value that you give from the counts that you give, regardless of the dispersion estimation, although the p-value does depend on the total library sizes, which I can't determine from your email. So I would guess that there is a code error somewhere in the analysis that you've used, perhaps the different data tables have got out of sync during the analysis somehow. I can see that the design matrix doesn't seem to agree with the columns of the data matrix. For example, patient CT61 seems to be associated with columns one, two and seven of your data, but is associated with libraries one, four and six in your design matrix. There may be other issues as well. As Naomi has mentioned, I would recommend that any gene with very low counts be filtered from the data table before analysis, because such genes can never legitimately be significantly differentially expressed. For your experiment, I would probably choose to keep genes with at least 1 count-per-million in at least three libraries. See the edgeR User's Guide for more information about this. However this doesn't seem to your main problem. If you are still having a problem and need further advice, please give complete uninterrupted code that takes you from the original data table to topTags results. Best wishes Gordon ------------------- original message ---------------------- Luca [guest] guest at bioconductor.org Tue Nov 29 14:54:51 CET 2011 Hi, I am using the package edgeR to determine differentially expressed genes in RNA-seq data. I have 8 samples, corresponding to 3 patients (2 conditions per patient and 2 patients duplicated). I performed the contrast of these two conditions, following the user guide (section 11 Case study: Oral carcinomas vs matched normal tissue). When analyzing some border line differentially expressed genes (FDR ~ 0.02) I found that in some cases, the read counts was really low, e.g. only one sample with 2 reads, and the others (7) with 0 counts. > countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colname s(countsTable))] 61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli 70_CT.tot 61m2_CT.tot 67m2_CT.tot ENSG00000207696 0 0 0 0 2 0 0 0 The design matrix used was: > design patientsCT61 patientsCT67 patientsCT70 as.factor(1 - (as.numeric(groupsCT) - 1))1 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 4 1 0 0 0 5 0 1 0 0 6 1 0 0 1 7 0 1 0 1 8 0 0 1 1 The result for the gene in question was: > tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",] table.logConc table.logFC table.PValue table.FDR adjust.method comparison ENSG00000207696 -18.34 7.726 0.005222 0.03941 BH as.factor(1 - (as.numeric(groupsCT) - 1))1 When investigating the result of the gene, I found that the glm-fitted values looked like this: > format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitted .values)=="ENSG00000207696",],scientific=FALSE,digits=4) 61_CT.tot 67_CT.tot 70_CT.tot 61m2_CT.tot 67m2_CT.tot 61_CT.poli 67_CT.poli 70_CT.poli "0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959" "0.03479503" "0.03522840" "1.84247667" Since I was not able to find the details used in edgeR to calculate the model easily, I was wondering if from this glm-fitted values is it reasonable to obtain such a low FDR? I used the common dispersion for the variance estimate: DGEensCT.CR<-estimateCRDisp(DGEensCT,design) glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion = DGEensCT.CR$CR.common.dispersion) lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT) Could someone help me with this? Thank you very much in advanced! Cheers, Luca -- output of sessionInfo(): R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.6.0 edgeR_2.0.5 loaded via a namespace (and not attached): [1] RCurl_1.4-3 XML_3.2-0 limma_3.6.9 -- Sent via the guest posting facility at bioconductor.org. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.4 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Luca, Also, I notice from the sessionInfo (thanks for including it) that you're using versions of R and edgeR from two Bioconductor releases ago, i.e., more than a year out of date. You're using a very, very early version of the edgeR glm code. It's important to install the current software, although it will not fix your main issue, which arises from another cause. Best wishes Gordon On Wed, 30 Nov 2011, Gordon K Smyth wrote: > Dear Luca, > > The edgeR function glmLRT() conducts a likelihood ratio test, a standard > procedure for generalized linear models. It is virtually impossible to get > the p-value that you give from the counts that you give, regardless of the > dispersion estimation, although the p-value does depend on the total library > sizes, which I can't determine from your email. So I would guess that there > is a code error somewhere in the analysis that you've used, perhaps the > different data tables have got out of sync during the analysis somehow. I > can see that the design matrix doesn't seem to agree with the columns of the > data matrix. For example, patient CT61 seems to be associated with columns > one, two and seven of your data, but is associated with libraries one, four > and six in your design matrix. There may be other issues as well. > > As Naomi has mentioned, I would recommend that any gene with very low counts > be filtered from the data table before analysis, because such genes can never > legitimately be significantly differentially expressed. For your experiment, > I would probably choose to keep genes with at least 1 count-per- million in at > least three libraries. See the edgeR User's Guide for more information about > this. However this doesn't seem to your main problem. > > If you are still having a problem and need further advice, please give > complete uninterrupted code that takes you from the original data table to > topTags results. > > Best wishes > Gordon > > ------------------- original message ---------------------- > > Luca [guest] guest at bioconductor.org > Tue Nov 29 14:54:51 CET 2011 > > Hi, > I am using the package edgeR to determine differentially expressed genes in > RNA-seq data. I have 8 samples, corresponding to 3 patients (2 conditions per > patient and 2 patients duplicated). I performed the contrast of these two > conditions, following the user guide (section 11 Case study: Oral carcinomas > vs matched normal tissue). When analyzing some border line differentially > expressed genes (FDR ~ 0.02) I found that in some cases, the read counts was > really low, e.g. only one sample with 2 reads, and the others (7) with 0 > counts. >> > countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colna mes(countsTable))] > 61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli > 70_CT.tot 61m2_CT.tot 67m2_CT.tot > ENSG00000207696 0 0 0 0 2 0 > 0 0 > > The design matrix used was: >> design > patientsCT61 patientsCT67 patientsCT70 as.factor(1 - (as.numeric(groupsCT) > - 1))1 > 1 1 0 0 0 > 2 0 1 0 0 > 3 0 0 1 0 > 4 1 0 0 0 > 5 0 1 0 0 > 6 1 0 0 1 > 7 0 1 0 1 > 8 0 0 1 1 > > The result for the gene in question was: >> tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",] > table.logConc table.logFC table.PValue table.FDR > adjust.method comparison > ENSG00000207696 -18.34 7.726 0.005222 0.03941 BH > as.factor(1 - (as.numeric(groupsCT) - 1))1 > > When investigating the result of the gene, I found that the glm- fitted values > looked like this: > >> > format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitt ed.values)=="ENSG00000207696",],scientific=FALSE,digits=4) > 61_CT.tot 67_CT.tot 70_CT.tot 61m2_CT.tot 67m2_CT.tot 61_CT.poli > 67_CT.poli 70_CT.poli > "0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959" "0.03479503" > "0.03522840" "1.84247667" > > Since I was not able to find the details used in edgeR to calculate the model > easily, I was wondering if from this glm-fitted values is it reasonable to > obtain such a low FDR? > I used the common dispersion for the variance estimate: > DGEensCT.CR<-estimateCRDisp(DGEensCT,design) > glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion = > DGEensCT.CR$CR.common.dispersion) > lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT) > > Could someone help me with this? > Thank you very much in advanced! > > Cheers, > Luca > > -- output of sessionInfo(): > > R version 2.12.0 (2010-10-15) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.6.0 edgeR_2.0.5 > > loaded via a namespace (and not attached): > [1] RCurl_1.4-3 XML_3.2-0 limma_3.6.9 > > -- > Sent via the guest posting facility at bioconductor.org. > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 12.4 years ago Gordon Smyth 50k

Login before adding your answer.