DEXSeq on two-exon genes: how to specify a formula without redundant terms
1
0
Entering edit mode
@narayanan-manikandan-nihniaid-e-5829
Last seen 9.1 years ago
United States
Hi DEXSeq users/developers, I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error. I figure this is due to redundant terms in my formula as shown in PS below. So my questions are: 1) Is there a way to specify the formula count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed? 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame: if (nrow(mm) <= ncol(mm)) stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.") Would changing the code this way violate any assumptions of the DEXSeq model? Thank you, Mani PS: # condition + batch terms are redundant as sample term is already present! > formulaDispersion count ~ sample + (condition + batch) * exon > design(ecs) condition batch Untr_biorep1 Untr biorep1 LPS_biorep1 LPS biorep1 Untr_biorep2 Untr biorep2 LPS_biorep2 LPS biorep2 > colnames(model.matrix(formulaDispersion, mf)) [1] "(Intercept)" "sampleLPS_biorep2" "sampleUntr_biorep1" [4] "sampleUntr_biorep2" "conditionUntr" "batchbiorep2" [7] "exonE002" "conditionUntr:exonE002" "batchbiorep2:exonE002"
safe DEXSeq safe DEXSeq • 1.6k views
0
Entering edit mode
@narayanan-manikandan-nihniaid-e-5829
Last seen 9.1 years ago
United States
Hi, Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below? Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks! >mf sample exon sizeFactor condition batch dispersion count 1 Untr_biorep1 E001 1.0517803 Untr biorep1 0.13523549 25 2 Untr_biorep1 E002 1.0517803 Untr biorep1 0.01656906 708 3 LPS_biorep1 E001 1.0010474 LPS biorep1 0.13523549 20 4 LPS_biorep1 E002 1.0010474 LPS biorep1 0.01656906 442 5 Untr_biorep2 E001 0.8920483 Untr biorep2 0.13523549 26 6 Untr_biorep2 E002 0.8920483 Untr biorep2 0.01656906 947 7 LPS_biorep2 E001 1.1090401 LPS biorep2 0.13523549 25 8 LPS_biorep2 E002 1.1090401 LPS biorep2 0.01656906 884 >mm (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2 1 1 0 1 0 2 1 0 1 0 3 1 0 0 0 4 1 0 0 0 5 1 0 0 1 6 1 0 0 1 7 1 1 0 0 8 1 1 0 0 conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002 1 1 0 0 0 2 1 0 1 1 3 0 0 0 0 4 0 0 1 0 5 1 1 0 0 6 1 1 1 1 7 0 1 0 0 8 0 1 1 0 batchbiorep2:exonE002 1 0 2 0 3 0 4 0 5 0 6 1 7 0 8 1 attr(,"assign") [1] 0 1 1 1 2 3 4 5 6 attr(,"contrasts") attr(,"contrasts")$sample [1] "contr.treatment" attr(,"contrasts")$condition [1] "contr.treatment" attr(,"contrasts")$batch [1] "contr.treatment" attr(,"contrasts")$exon [1] "contr.treatment" From: Narayanan, Manikandan (NIH/NIAID) [E] Sent: Thursday, May 16, 2013 11:14 AM To: bioconductor@r-project.org Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms Hi DEXSeq users/developers, I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error. I figure this is due to redundant terms in my formula as shown in PS below. So my questions are: 1) Is there a way to specify the formula count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed? 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame: if (nrow(mm) <= ncol(mm)) stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.") Would changing the code this way violate any assumptions of the DEXSeq model? Thank you, Mani PS: # condition + batch terms are redundant as sample term is already present! > formulaDispersion count ~ sample + (condition + batch) * exon > design(ecs) condition batch Untr_biorep1 Untr biorep1 LPS_biorep1 LPS biorep1 Untr_biorep2 Untr biorep2 LPS_biorep2 LPS biorep2 > colnames(model.matrix(formulaDispersion, mf)) [1] "(Intercept)" "sampleLPS_biorep2" "sampleUntr_biorep1" [4] "sampleUntr_biorep2" "conditionUntr" "batchbiorep2" [7] "exonE002" "conditionUntr:exonE002" "batchbiorep2:exonE002" [[alternative HTML version deleted]]
0
Entering edit mode
Dear Manikandan Narayanan, Sorry for the delayed reply! Thanks for your e-mail! It has led us to spot two minor bugs and therefore modified the DEXSeq code in two aspects: Firstly, redundant terms were removed in DEXSeq in a very naive way: a lm.fit model was first fitted and the columns from the model matrix with NA coefficients were removed from the model matrix, now a proper function has been written for this purpose (with basically the code to what you also proposed). Secondly, our "if" statement that you mentioned was wrongly located within the function, which was causing your error message. It has been relocated and now it seems to work for its purpose ( I tested with your model frame and formula). You will find the changes in the last version of the svn (1.7.2). Let me know if it works for you or if you have additional questions. Best regards, Alejandro Reyes > Hi, > Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below? > > Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks! > >> mf > sample exon sizeFactor condition batch dispersion count > 1 Untr_biorep1 E001 1.0517803 Untr biorep1 0.13523549 25 > 2 Untr_biorep1 E002 1.0517803 Untr biorep1 0.01656906 708 > 3 LPS_biorep1 E001 1.0010474 LPS biorep1 0.13523549 20 > 4 LPS_biorep1 E002 1.0010474 LPS biorep1 0.01656906 442 > 5 Untr_biorep2 E001 0.8920483 Untr biorep2 0.13523549 26 > 6 Untr_biorep2 E002 0.8920483 Untr biorep2 0.01656906 947 > 7 LPS_biorep2 E001 1.1090401 LPS biorep2 0.13523549 25 > 8 LPS_biorep2 E002 1.1090401 LPS biorep2 0.01656906 884 > >> mm > (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2 > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 0 0 > 4 1 0 0 0 > 5 1 0 0 1 > 6 1 0 0 1 > 7 1 1 0 0 > 8 1 1 0 0 > conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002 > 1 1 0 0 0 > 2 1 0 1 1 > 3 0 0 0 0 > 4 0 0 1 0 > 5 1 1 0 0 > 6 1 1 1 1 > 7 0 1 0 0 > 8 0 1 1 0 > batchbiorep2:exonE002 > 1 0 > 2 0 > 3 0 > 4 0 > 5 0 > 6 1 > 7 0 > 8 1 > attr(,"assign") > [1] 0 1 1 1 2 3 4 5 6 > attr(,"contrasts") > attr(,"contrasts")$sample > [1] "contr.treatment" > > attr(,"contrasts")$condition > [1] "contr.treatment" > > attr(,"contrasts")$batch > [1] "contr.treatment" > > attr(,"contrasts")$exon > [1] "contr.treatment" > > > > > > > > From: Narayanan, Manikandan (NIH/NIAID) [E] > Sent: Thursday, May 16, 2013 11:14 AM > To: bioconductor at r-project.org > Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms > > Hi DEXSeq users/developers, > I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error. > > I figure this is due to redundant terms in my formula as shown in PS below. So my questions are: > > 1) Is there a way to specify the formula count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed? > > 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame: > if (nrow(mm) <= ncol(mm)) > stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.") > > Would changing the code this way violate any assumptions of the DEXSeq model? > > > Thank you, > Mani > > > PS: # condition + batch terms are redundant as sample term is already present! >> formulaDispersion > count ~ sample + (condition + batch) * exon > >> design(ecs) > condition batch > Untr_biorep1 Untr biorep1 > LPS_biorep1 LPS biorep1 > Untr_biorep2 Untr biorep2 > LPS_biorep2 LPS biorep2 > >> colnames(model.matrix(formulaDispersion, mf)) > [1] "(Intercept)" "sampleLPS_biorep2" "sampleUntr_biorep1" > [4] "sampleUntr_biorep2" "conditionUntr" "batchbiorep2" > [7] "exonE002" "conditionUntr:exonE002" "batchbiorep2:exonE002" > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Great, thank you Alejandro - I will check out the code and let you know how it goes. In the mean time, I also found a small typo in the DEXSeqHTML function. The line results[whichis.na(results$pvalue)), ][, c("pvalue", "padjust")] = 1 breaks with an error if there are no NA pvalues, and can be patched by changing to: results[whichis.na(results$pvalue)),c("pvalue", "padjust")] = 1 While you are at it, it would also be useful if extraCols is indexed by geneIDs as rownames - that way we can send in extra info for as many genes in any order and it would be properly added to genetable for display in the DEXSeqHTML report. In detail, if you could change: genetable <- cbind(extraCols, genetable) to genetable <- cbind(extraCols[match(genetable$geneID, rownames(extraCols)),], genetable]) or something similar and change the documentation appropriatley, that would be quite helpful to users too! Thanks, Mani ________________________________________ From: Alejandro Reyes [alejandro.reyes@embl.de] Sent: Friday, May 24, 2013 8:19 AM To: Narayanan, Manikandan (NIH/NIAID) [E] Cc: bioconductor at r-project.org Subject: Re: [BioC] DEXSeq on two-exon genes: how to specify a formula without redundant terms Dear Manikandan Narayanan, Sorry for the delayed reply! Thanks for your e-mail! It has led us to spot two minor bugs and therefore modified the DEXSeq code in two aspects: Firstly, redundant terms were removed in DEXSeq in a very naive way: a lm.fit model was first fitted and the columns from the model matrix with NA coefficients were removed from the model matrix, now a proper function has been written for this purpose (with basically the code to what you also proposed). Secondly, our "if" statement that you mentioned was wrongly located within the function, which was causing your error message. It has been relocated and now it seems to work for its purpose ( I tested with your model frame and formula). You will find the changes in the last version of the svn (1.7.2). Let me know if it works for you or if you have additional questions. Best regards, Alejandro Reyes > Hi, > Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below? > > Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks! > >> mf > sample exon sizeFactor condition batch dispersion count > 1 Untr_biorep1 E001 1.0517803 Untr biorep1 0.13523549 25 > 2 Untr_biorep1 E002 1.0517803 Untr biorep1 0.01656906 708 > 3 LPS_biorep1 E001 1.0010474 LPS biorep1 0.13523549 20 > 4 LPS_biorep1 E002 1.0010474 LPS biorep1 0.01656906 442 > 5 Untr_biorep2 E001 0.8920483 Untr biorep2 0.13523549 26 > 6 Untr_biorep2 E002 0.8920483 Untr biorep2 0.01656906 947 > 7 LPS_biorep2 E001 1.1090401 LPS biorep2 0.13523549 25 > 8 LPS_biorep2 E002 1.1090401 LPS biorep2 0.01656906 884 > >> mm > (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2 > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 0 0 > 4 1 0 0 0 > 5 1 0 0 1 > 6 1 0 0 1 > 7 1 1 0 0 > 8 1 1 0 0 > conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002 > 1 1 0 0 0 > 2 1 0 1 1 > 3 0 0 0 0 > 4 0 0 1 0 > 5 1 1 0 0 > 6 1 1 1 1 > 7 0 1 0 0 > 8 0 1 1 0 > batchbiorep2:exonE002 > 1 0 > 2 0 > 3 0 > 4 0 > 5 0 > 6 1 > 7 0 > 8 1 > attr(,"assign") > [1] 0 1 1 1 2 3 4 5 6 > attr(,"contrasts") > attr(,"contrasts")$sample > [1] "contr.treatment" > > attr(,"contrasts")$condition > [1] "contr.treatment" > > attr(,"contrasts")$batch > [1] "contr.treatment" > > attr(,"contrasts")$exon > [1] "contr.treatment" > > > > > > > > From: Narayanan, Manikandan (NIH/NIAID) [E] > Sent: Thursday, May 16, 2013 11:14 AM > To: bioconductor at r-project.org > Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms > > Hi DEXSeq users/developers, > I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error. > > I figure this is due to redundant terms in my formula as shown in PS below. So my questions are: > > 1) Is there a way to specify the formula count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed? > > 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame: > if (nrow(mm) <= ncol(mm)) > stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.") > > Would changing the code this way violate any assumptions of the DEXSeq model? > > > Thank you, > Mani > > > PS: # condition + batch terms are redundant as sample term is already present! >> formulaDispersion > count ~ sample + (condition + batch) * exon > >> design(ecs) > condition batch > Untr_biorep1 Untr biorep1 > LPS_biorep1 LPS biorep1 > Untr_biorep2 Untr biorep2 > LPS_biorep2 LPS biorep2 > >> colnames(model.matrix(formulaDispersion, mf)) > [1] "(Intercept)" "sampleLPS_biorep2" "sampleUntr_biorep1" > [4] "sampleUntr_biorep2" "conditionUntr" "batchbiorep2" > [7] "exonE002" "conditionUntr:exonE002" "batchbiorep2:exonE002" > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
Entering edit mode
[bioc-devel added and subject header changed] Dear List & Developers Mani's post reminded me of the pitfall posed by using expressions such as x = x[ -which(someExpression) ] which tends to have unintended consequences in particular if all elements of 'someExpression' are FALSE. Compare: x=1:10 > x[ -which(x==17) ] integer(0) > x[ x!=17 ] [1] 1 2 3 4 5 6 7 8 9 10 A quick search in Bioconductor source code found this idiom in multiple packages, including: IRanges Biostrings beadarray BeadDataPackR BioNet affycoretools annotate copynumber CRImage DiffBind flowStats geNetClassifier GGtools maigesPack maskBAD MineICA MmPalateMiRNA Mulcom nem phyloseq plgem RBGL girafe Ringo RIPSeeker rnaSeqMap SomatiCA Starr VanillaICE I cannot tell whether this poses an actual bug threat in any of these cases, but developers might want to double-check. Best wishes Wolfgang On May 24, 2013, at 5:08 pm, "Narayanan, Manikandan (NIH/NIAID) [E]" <manikandan.narayanan at="" nih.gov=""> wrote: > Great, thank you Alejandro - I will check out the code and let you know how it goes. > > In the mean time, I also found a small typo in the DEXSeqHTML function. The line > results[whichis.na(results$pvalue)), ][, c("pvalue", "padjust")] = 1 > breaks with an error if there are no NA pvalues, and can be patched by changing to: > results[whichis.na(results$pvalue)),c("pvalue", "padjust")] = 1 > > While you are at it, it would also be useful if extraCols is indexed by geneIDs as rownames - that way we can send in extra info for as many > genes in any order and it would be properly added to genetable for display in the DEXSeqHTML report. In detail, if you could change: > genetable <- cbind(extraCols, genetable) > to > genetable <- cbind(extraCols[match(genetable$geneID, rownames(extraCols)),], genetable]) > or something similar and change the documentation appropriatley, that would be quite helpful to users too! > > > Thanks, > Mani > > > > ________________________________________ > From: Alejandro Reyes [alejandro.reyes at embl.de] > Sent: Friday, May 24, 2013 8:19 AM > To: Narayanan, Manikandan (NIH/NIAID) [E] > Cc: bioconductor at r-project.org > Subject: Re: [BioC] DEXSeq on two-exon genes: how to specify a formula without redundant terms > > Dear Manikandan Narayanan, > > Sorry for the delayed reply! > > Thanks for your e-mail! It has led us to spot two minor bugs and > therefore modified the DEXSeq code in two aspects: > > Firstly, redundant terms were removed in DEXSeq in a very naive way: a > lm.fit model was first fitted and the columns from the model matrix with > NA coefficients were removed from the model matrix, now a proper > function has been written for this purpose (with basically the code to > what you also proposed). Secondly, our "if" statement that you mentioned > was wrongly located within the function, which was causing your error > message. It has been relocated and now it seems to work for its purpose > ( I tested with your model frame and formula). > > You will find the changes in the last version of the svn (1.7.2). Let me > know if it works for you or if you have additional questions. > > Best regards, > Alejandro Reyes > > >> Hi, >> Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below? >> >> Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks! >> >>> mf >> sample exon sizeFactor condition batch dispersion count >> 1 Untr_biorep1 E001 1.0517803 Untr biorep1 0.13523549 25 >> 2 Untr_biorep1 E002 1.0517803 Untr biorep1 0.01656906 708 >> 3 LPS_biorep1 E001 1.0010474 LPS biorep1 0.13523549 20 >> 4 LPS_biorep1 E002 1.0010474 LPS biorep1 0.01656906 442 >> 5 Untr_biorep2 E001 0.8920483 Untr biorep2 0.13523549 26 >> 6 Untr_biorep2 E002 0.8920483 Untr biorep2 0.01656906 947 >> 7 LPS_biorep2 E001 1.1090401 LPS biorep2 0.13523549 25 >> 8 LPS_biorep2 E002 1.1090401 LPS biorep2 0.01656906 884 >> >>> mm >> (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2 >> 1 1 0 1 0 >> 2 1 0 1 0 >> 3 1 0 0 0 >> 4 1 0 0 0 >> 5 1 0 0 1 >> 6 1 0 0 1 >> 7 1 1 0 0 >> 8 1 1 0 0 >> conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002 >> 1 1 0 0 0 >> 2 1 0 1 1 >> 3 0 0 0 0 >> 4 0 0 1 0 >> 5 1 1 0 0 >> 6 1 1 1 1 >> 7 0 1 0 0 >> 8 0 1 1 0 >> batchbiorep2:exonE002 >> 1 0 >> 2 0 >> 3 0 >> 4 0 >> 5 0 >> 6 1 >> 7 0 >> 8 1 >> attr(,"assign") >> [1] 0 1 1 1 2 3 4 5 6 >> attr(,"contrasts") >> attr(,"contrasts")$sample >> [1] "contr.treatment" >> >> attr(,"contrasts")$condition >> [1] "contr.treatment" >> >> attr(,"contrasts")$batch >> [1] "contr.treatment" >> >> attr(,"contrasts")$exon >> [1] "contr.treatment" >> >> >> >> >> >> >> >> From: Narayanan, Manikandan (NIH/NIAID) [E] >> Sent: Thursday, May 16, 2013 11:14 AM >> To: bioconductor at r-project.org >> Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms >> >> Hi DEXSeq users/developers, >> I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error. >> >> I figure this is due to redundant terms in my formula as shown in PS below. So my questions are: >> >> 1) Is there a way to specify the formula count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed? >> >> 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame: >> if (nrow(mm) <= ncol(mm)) >> stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.") >> >> Would changing the code this way violate any assumptions of the DEXSeq model? >> >> >> Thank you, >> Mani >> >> >> PS: # condition + batch terms are redundant as sample term is already present! >>> formulaDispersion >> count ~ sample + (condition + batch) * exon >> >>> design(ecs) >> condition batch >> Untr_biorep1 Untr biorep1 >> LPS_biorep1 LPS biorep1 >> Untr_biorep2 Untr biorep2 >> LPS_biorep2 LPS biorep2 >> >>> colnames(model.matrix(formulaDispersion, mf)) >> [1] "(Intercept)" "sampleLPS_biorep2" "sampleUntr_biorep1" >> [4] "sampleUntr_biorep2" "conditionUntr" "batchbiorep2" >> [7] "exonE002" "conditionUntr:exonE002" "batchbiorep2:exonE002" >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6