affy code archeology: expresso behavior for liwong/invariantset

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

Hello, I am in the midst of updating some inherited legacy code from R 2.8 and affy_1.20.2 to R 2.15.2 and affy_1.36.0, trying to fix some longstanding bugs. Everything is going well except for some very different results coming out of an expresso call in one of my regression tests, specifically from: eset <- expresso(afbatch, normalize.method="invariantset", bg.correct=FALSE, pmcorrect.method="pmonly",summary.method="liwong", verbose=TRUE) It's reporting expression values approx. 0.4-0.7x the values in the earlier setup (examples below). I take for granted that many things have changed both at the R level and at the package level given the large leap in versions, but the difference seemed a bit odd since the results for other methods (RMA, GCRMA, MAS5) remained reasonably consistent in the update. It's quite possible that there is a package conflict or that I've accidentally broken something in the code base, though the legacy code itself is largely unchanged and the other regression tests seem to hold up pretty well. I have looked at a number of things already, but before heading further down that path I wanted to check whether the implementation itself might have changed in a way where these differences would be expected. Has there been a major change to liwong/invariantset somewhere along the way between 1.20.2 and 1.36.0? Checking with other members of the team, we're actually completely OK with the differences if they are in line with known changes to the affy package and can be explained to our users (this is a GenePattern module). Here is some example output for reference. This is for a cut-down data set; I've seen similar results with 20 samples. I can provide more if it would be helpful. >From the original setup (using write.table(exprs(eset)): "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" "CL20030502308AA.CEL" "1007_s_at" 228.013212425507 214.877883873475 287.677963272274 306.997621485651 "1053_at" 193.206766296132 168.017787218035 169.430151901596 157.819950438341 <...snip...> >From the new setup: "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" "CL20030502308AA.CEL" "1007_s_at" 500.169414439674 461.734001304198 704.700664579873 735.43106514776 "1053_at" 321.921661422307 251.464504650157 261.491260842992 228.793486504634 <...snip...> Thanks in advance! Regards, David -- David Eby Consultant Cancer Informatics Development Broad Institute of MIT and Harvard 7 Cambridge Center, Cambridge, MA 02142, USA http://www.broadinstitute.org/cancer -- output of sessionInfo(): R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] methods stats graphics grDevices utils datasets base other attached packages: [1] makecdfenv_1.36.0 gcrma_2.30.0 Biostrings_2.26.2 [4] affy_1.36.0 preprocessCore_1.20.0 affyio_1.26.0 [7] zlibbioc_1.4.0 AnnotationDbi_1.20.1 Biobase_2.18.0 [10] IRanges_1.16.2 RSQLite_0.11.2 DBI_0.2-5 [13] BiocGenerics_0.4.0 spatial_7.3-5 rpart_3.1-55 [16] nnet_7.3-5 nlme_3.1-105 mgcv_1.7-21 [19] Matrix_1.0-9 MASS_7.3-22 lattice_0.20-10 [22] KernSmooth_2.23-8 foreign_0.8-51 cluster_1.14.3 [25] class_7.3-5 boot_1.3-7 loaded via a namespace (and not attached): [1] BiocInstaller_1.8.3 grid_2.15.2 parallel_2.15.2 [4] splines_2.15.2 stats4_2.15.2 hgu133acdf_2.10.0 is not shown above but was also loaded at a later point. -- Sent via the guest posting facility at bioconductor.org.

Regression gcrma Regression gcrma • 2.4k views

ADD COMMENT • link updated 13.0 years ago by James W. MacDonald 68k • written 13.0 years ago by Guest User ★ 13k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 8 days ago

United States

On 01/24/2013 01:41 PM, David Eby [guest] wrote: > > Hello, > > I am in the midst of updating some inherited legacy code from R 2.8 and affy_1.20.2 to R 2.15.2 and affy_1.36.0, trying to fix some longstanding bugs. > That's a lot of water under the bridge (4 1/2 years? do you still have a working installation?)... > Everything is going well except for some very different results coming out of an expresso call in one of my regression tests, specifically from: > eset <- expresso(afbatch, normalize.method="invariantset", bg.correct=FALSE, pmcorrect.method="pmonly",summary.method="liwong", verbose=TRUE) ...there is this at least in svn affy$ svn log -r57229 ---------------------------------------------------------------------- -- r57229 | jmacdon | 2011-08-04 12:18:56 -0700 (Thu, 04 Aug 2011) | 1 line fixed bug in normalize.AffyBatch.invariantset ---------------------------------------------------------------------- -- and I've added Jim to the email for any insights. Martin > It's reporting expression values approx. 0.4-0.7x the values in the earlier setup (examples below). I take for granted that many things have changed both at the R level and at the package level given the large leap in versions, but the difference seemed a bit odd since the results for other methods (RMA, GCRMA, MAS5) remained reasonably consistent in the update. > > It's quite possible that there is a package conflict or that I've accidentally broken something in the code base, though the legacy code itself is largely unchanged and the other regression tests seem to hold up pretty well. I have looked at a number of things already, but before heading further down that path I wanted to check whether the implementation itself might have changed in a way where these differences would be expected. Has there been a major change to liwong/invariantset somewhere along the way between 1.20.2 and 1.36.0? > > Checking with other members of the team, we're actually completely OK with the differences if they are in line with known changes to the affy package and can be explained to our users (this is a GenePattern module). > > Here is some example output for reference. This is for a cut-down data set; I've seen similar results with 20 samples. I can provide more if it would be helpful. >>From the original setup (using write.table(exprs(eset)): > "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" "CL20030502308AA.CEL" > "1007_s_at" 228.013212425507 214.877883873475 287.677963272274 306.997621485651 > "1053_at" 193.206766296132 168.017787218035 169.430151901596 157.819950438341 > <...snip...> > >>From the new setup: > "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" "CL20030502308AA.CEL" > "1007_s_at" 500.169414439674 461.734001304198 704.700664579873 735.43106514776 > "1053_at" 321.921661422307 251.464504650157 261.491260842992 228.793486504634 > <...snip...> > > Thanks in advance! > Regards, > David > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 13.0 years ago Martin Morgan 25k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Hi David, On 1/28/2013 10:14 AM, Martin Morgan wrote: > On 01/24/2013 01:41 PM, David Eby [guest] wrote: >> >> Hello, >> >> I am in the midst of updating some inherited legacy code from R 2.8 >> and affy_1.20.2 to R 2.15.2 and affy_1.36.0, trying to fix some >> longstanding bugs. >> > > That's a lot of water under the bridge (4 1/2 years? do you still have > a working installation?)... > >> Everything is going well except for some very different results >> coming out of an expresso call in one of my regression tests, >> specifically from: >> eset <- expresso(afbatch, normalize.method="invariantset", >> bg.correct=FALSE, pmcorrect.method="pmonly",summary.method="liwong", >> verbose=TRUE) > > ...there is this at least in svn > > affy$ svn log -r57229 > -------------------------------------------------------------------- ---- > r57229 | jmacdon | 2011-08-04 12:18:56 -0700 (Thu, 04 Aug 2011) | 1 line > > fixed bug in normalize.AffyBatch.invariantset > -------------------------------------------------------------------- ---- > > and I've added Jim to the email for any insights. Thanks for the detective work, Martin. This was on my to-do list for today, and now I can just press the easy button. I don't recall how this came up, but there was definitely a bug. The basic idea is that we want to say which of the arrays is the reference, based on one of four criteria. The default is to base this on the array that has the median overall intensity (where overall intensity is defined as the mean intensity). Prior to the fix we weren't correctly figuring out which array had the median overall intensity, and were instead simply picking the trunc(N/2) array, where N = # arrays. So if you had 8 arrays, this function would choose the 4th array as reference regardless. Now it actually chooses the array with the median overall intensity, so the difference between your old results and the new is caused by switching the array that is used as reference. Best, Jim > > Martin > > >> It's reporting expression values approx. 0.4-0.7x the values in the >> earlier setup (examples below). I take for granted that many things >> have changed both at the R level and at the package level given the >> large leap in versions, but the difference seemed a bit odd since the >> results for other methods (RMA, GCRMA, MAS5) remained reasonably >> consistent in the update. >> >> It's quite possible that there is a package conflict or that I've >> accidentally broken something in the code base, though the legacy >> code itself is largely unchanged and the other regression tests seem >> to hold up pretty well. I have looked at a number of things already, >> but before heading further down that path I wanted to check whether >> the implementation itself might have changed in a way where these >> differences would be expected. Has there been a major change to >> liwong/invariantset somewhere along the way between 1.20.2 and 1.36.0? >> >> Checking with other members of the team, we're actually completely OK >> with the differences if they are in line with known changes to the >> affy package and can be explained to our users (this is a GenePattern >> module). >> >> Here is some example output for reference. This is for a cut-down >> data set; I've seen similar results with 20 samples. I can provide >> more if it would be helpful. >>> From the original setup (using write.table(exprs(eset)): >> "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" >> "CL20030502308AA.CEL" >> "1007_s_at" 228.013212425507 214.877883873475 287.677963272274 >> 306.997621485651 >> "1053_at" 193.206766296132 168.017787218035 169.430151901596 >> 157.819950438341 >> <...snip...> >> >>> From the new setup: >> "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" >> "CL20030502308AA.CEL" >> "1007_s_at" 500.169414439674 461.734001304198 704.700664579873 >> 735.43106514776 >> "1053_at" 321.921661422307 251.464504650157 261.491260842992 >> 228.793486504634 >> <...snip...> >> >> Thanks in advance! >> Regards, >> David >> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 13.0 years ago James W. MacDonald 68k

0

Entering edit mode

Hi Jim and Martin, Thanks for looking into this. That explanation makes a lot of sense and would certainly cause the new version to behave differently. In fact, it sounds like the former version could give different results depending on the order of the arrays. Reproducible results is a big part of what GenePattern is trying to do, so it's good to know about that. I'll be sure to document the change for our users. Thanks again for the help. I really appreciate it! Regards, David -- David Eby Consultant Cancer Informatics Development Broad Institute of MIT and Harvard 7 Cambridge Center, Cambridge, MA 02142, USAhttp://www.broadinstitute.org/cancer On Mon, Jan 28, 2013 at 11:08 AM, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi David, > > > On 1/28/2013 10:14 AM, Martin Morgan wrote: > >> On 01/24/2013 01:41 PM, David Eby [guest] wrote: >> >>> >>> Hello, >>> >>> I am in the midst of updating some inherited legacy code from R 2.8 and >>> affy_1.20.2 to R 2.15.2 and affy_1.36.0, trying to fix some longstanding >>> bugs. >>> >>> >> That's a lot of water under the bridge (4 1/2 years? do you still have a >> working installation?)... >> >> Everything is going well except for some very different results coming >>> out of an expresso call in one of my regression tests, specifically from: >>> eset <- expresso(afbatch, normalize.method="**invariantset", >>> bg.correct=FALSE, pmcorrect.method="pmonly",**summary.method="liwong", >>> verbose=TRUE) >>> >> >> ...there is this at least in svn >> >> affy$ svn log -r57229 >> ------------------------------**------------------------------** >> ------------ >> r57229 | jmacdon | 2011-08-04 12:18:56 -0700 (Thu, 04 Aug 2011) | 1 line >> >> fixed bug in normalize.AffyBatch.**invariantset >> ------------------------------**------------------------------** >> ------------ >> >> and I've added Jim to the email for any insights. >> > > Thanks for the detective work, Martin. This was on my to-do list for > today, and now I can just press the easy button. > > I don't recall how this came up, but there was definitely a bug. The basic > idea is that we want to say which of the arrays is the reference, based on > one of four criteria. The default is to base this on the array that has the > median overall intensity (where overall intensity is defined as the mean > intensity). > > Prior to the fix we weren't correctly figuring out which array had the > median overall intensity, and were instead simply picking the trunc(N/2) > array, where N = # arrays. So if you had 8 arrays, this function would > choose the 4th array as reference regardless. > > Now it actually chooses the array with the median overall intensity, so > the difference between your old results and the new is caused by switching > the array that is used as reference. > > Best, > > Jim > >> >> Martin >> >> >> >> It's reporting expression values approx. 0.4-0.7x the values in the >>> earlier setup (examples below). I take for granted that many things have >>> changed both at the R level and at the package level given the large leap >>> in versions, but the difference seemed a bit odd since the results for >>> other methods (RMA, GCRMA, MAS5) remained reasonably consistent in the >>> update. >>> >>> It's quite possible that there is a package conflict or that I've >>> accidentally broken something in the code base, though the legacy code >>> itself is largely unchanged and the other regression tests seem to hold up >>> pretty well. I have looked at a number of things already, but before >>> heading further down that path I wanted to check whether the implementation >>> itself might have changed in a way where these differences would be >>> expected. Has there been a major change to liwong/invariantset somewhere >>> along the way between 1.20.2 and 1.36.0? >>> >>> Checking with other members of the team, we're actually completely OK >>> with the differences if they are in line with known changes to the affy >>> package and can be explained to our users (this is a GenePattern module). >>> >>> Here is some example output for reference. This is for a cut-down data >>> set; I've seen similar results with 20 samples. I can provide more if it >>> would be helpful. >>> >>>> From the original setup (using write.table(exprs(eset)): >>>> >>> "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" >>> "CL20030502308AA.CEL" >>> "1007_s_at" 228.013212425507 214.877883873475 287.677963272274 >>> 306.997621485651 >>> "1053_at" 193.206766296132 168.017787218035 169.430151901596 >>> 157.819950438341 >>> <...snip...> >>> >>> From the new setup: >>>> >>> "CL20030502207AA.CEL" "CL20030502208AA.CEL" "CL20030502307AA.CEL" >>> "CL20030502308AA.CEL" >>> "1007_s_at" 500.169414439674 461.734001304198 704.700664579873 >>> 735.43106514776 >>> "1053_at" 321.921661422307 251.464504650157 261.491260842992 >>> 228.793486504634 >>> <...snip...> >>> >>> Thanks in advance! >>> Regards, >>> David >>> >>> >> >> > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago David Eby ▴ 30

0

Entering edit mode

Dear List, I am using DEXSeq to test for differential exon usage on a dataset with 3 samples (2 belong to a treatment group and 1 to a control group) and have 2 main questions. 1) Unbalanced design where one out of two groups has no replicates Name condition a1 control b1 treatment b2 treatment Now, I have read in a couple of other threads that there is no point in running DEXSeq when there are no replicates (unlike in case of DESeq - which I have already run for these). However, the researcher still wants me to try it out. Thus, I was wondering if there in an unbalanced design, is it still possible to run DEXSeq and would estimateDispersions still work? When I run this function, I don't get errors but when I run fitDispersionFunction is doesn't like it. ######## > ecs <- estimateDispersions(ecs, nCores=8) Estimating Cox-Reid exon dispersion estimates using 8 cores. (Progress report: one dot per 100 genes) ...................................................................... ...................................................................... ....................> > > > ecs <- fitDispersionFunction(ecs) Error in fitDispersionFunction(ecs) : no CR dispersion estimations found, please first call estimateDispersions function > > head(fData(ecs)) geneID exonID testable dispBeforeSharing ENSG00000000003:E001 ENSG00000000003 E001 TRUE NA ENSG00000000003:E002 ENSG00000000003 E002 TRUE NA ENSG00000000003:E003 ENSG00000000003 E003 TRUE NA ENSG00000000003:E004 ENSG00000000003 E004 TRUE NA ENSG00000000003:E005 ENSG00000000003 E005 TRUE NA ENSG00000000003:E006 ENSG00000000003 E006 TRUE NA dispFitted dispersion pvalue padjust chr start end ENSG00000000003:E001 NA NA NA NA X 99883667 99884983 ENSG00000000003:E002 NA NA NA NA X 99885756 99885863 ENSG00000000003:E003 NA NA NA NA X 99887482 99887537 ENSG00000000003:E004 NA NA NA NA X 99887538 99887565 ENSG00000000003:E005 NA NA NA NA X 99888402 99888438 ENSG00000000003:E006 NA NA NA NA X 99888439 99888536 strand transcripts ENSG00000000003:E001 - ENST00000373020 ENSG00000000003:E002 - ENST00000373020 ENSG00000000003:E003 - ENST00000373020 ENSG00000000003:E004 - ENST00000373020;ENST00000496771 ENSG00000000003:E005 - ENST00000373020;ENST00000496771 ENSG00000000003:E006 - ENST00000494424;ENST00000373020;ENST00000496771 ######## Also, I read in a thread somewhere where Simon suggested to set fData(ecs)$dispersion <- .1, if it is necessary for a 1 vs 1, but does not fully recommend it. Is this what I need to do? 2) How does one work with multiple groups (or conditions)? In the vignette, there is nothing specified and neither can I see anything in the testforDEU function. Unlike DESeq (or edgeR) where one can specify. In DESeq: for e.g. if there are 3 groups: Control, Treatment & Mutant: res1 <- nbinomTest(cds, "Control", "Treatment") and/or res2 <- nbinomTest(cds, "Control", "Mutant") Many Thanks, Natasha sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] WriteXLS_2.3.0 gdata_2.12.0 DEXSeq_1.4.0 Biobase_2.18.0 [5] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 gtools_2.7.0 hwriter_1.3 plyr_1.7.1 RCurl_1.95-3 [6] statmod_1.4.16 stringr_0.6.1 tools_2.15.2 XML_3.95-0.1

ADD REPLY • link 13.0 years ago Natasha ▴ 440

0

Entering edit mode

Hi, I'll try to offer some suggestions until the authors have time to respond. Hopefully I don't lead you too far off the correct path: On Mon, Jan 28, 2013 at 11:31 AM, Natasha Sahgal <nsahgal at="" well.ox.ac.uk=""> wrote: > Dear List, > > I am using DEXSeq to test for differential exon usage on a dataset with 3 samples (2 belong to a treatment group and 1 to a control group) and have 2 main questions. > > 1) Unbalanced design where one out of two groups has no replicates > > Name condition > a1 control > b1 treatment > b2 treatment > > Now, I have read in a couple of other threads that there is no point in running DEXSeq when there are no replicates (unlike in case of DESeq - which I have already run for these). However, the researcher still wants me to try it out. > > Thus, I was wondering if there in an unbalanced design, is it still possible to run DEXSeq and would estimateDispersions still work? When I run this function, I don't get errors but when I run fitDispersionFunction is doesn't like it. > ######## >> ecs <- estimateDispersions(ecs, nCores=8) > Estimating Cox-Reid exon dispersion estimates using 8 cores. (Progress report: one dot per 100 genes) > .................................................................... ...................................................................... ......................> >> >> >> ecs <- fitDispersionFunction(ecs) > Error in fitDispersionFunction(ecs) : > no CR dispersion estimations found, please first call estimateDispersions function >> >> head(fData(ecs)) > geneID exonID testable dispBeforeSharing > ENSG00000000003:E001 ENSG00000000003 E001 TRUE NA > ENSG00000000003:E002 ENSG00000000003 E002 TRUE NA [cut] The portion of the fData table you pasted look to be all NA for the `dispBeforeSharing` estimate, which (if true) would suggest that something is going wrong during the estimateDisperion step. After you run `ecs <- estimateDispersions(ecs, nCores=8)`, what fraction of your exons has an estimated dispersion? eg: R> sum(!is.na(fData(ecs)$dispBeforeSharing)) / nrow(fData(ecs)) Hopefully it's greater than 0? A quick check seems to suggest that one could do what you need to (2 reps in condA and 1) using DEXSeq as is. Try it by creating a similar unbalanced data set from the pasilla package using only the single-read data: R> library(DEXSeq) R> data("pasillaExons", package="pasilla") R> p <- pasillaExons[, pData(pasillaExons)$type == 'single-read'] R> pData(p) <- droplevels(pData(p)) R> pData(p)[, 1:3] sizeFactor condition type treated1fb | NA treated single-read untreated1fb| NA untreated single-read untreated2fb| NA untreated single-read R> p <- estimateSizeFactors(p) R> p <- estimateDispersions(p) R> p <- fitDispersionFunction(p) R> sum(!is.na(fData(p)$dispBeforeSharing)) / nrow(fData(p)) [1] 0.7248996 > Also, I read in a thread somewhere where Simon suggested to set fData(ecs)$dispersion <- .1, if it is necessary for a 1 vs 1, but does not fully recommend it. Is this what I need to do? > > > 2) How does one work with multiple groups (or conditions)? > In the vignette, there is nothing specified and neither can I see anything in the testforDEU function. Unlike DESeq (or edgeR) where one can specify. In DESeq: for e.g. if there are 3 groups: Control, Treatment & Mutant: > > res1 <- nbinomTest(cds, "Control", "Treatment") > and/or > res2 <- nbinomTest(cds, "Control", "Mutant") If you want to explicitly only test for two conditions (and ignore the other ones), I believe that, currently, your only recourse is to subset your ExonCountSet to include only the two conditions you want to test. Make sure you do the "droplevels" trick on the pData like I did above (this will remove factors in your `condition` column that represent the conditions you've removed). After that, you can do the testForDEU. btw, the above was run on R 2.15.2 and DEXSeq 1.4.0 HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, Thanks for the rather quick response. Based on what you said, I did check for what fraction of exons have an estimated dispersion, unfortunately it is zero! > sum(!is.na(fData(ecs)$dispBeforeSharing)) / nrow(fData(ecs)) [1] 0 Also tried the subsetting the pasilla dataset for a similar design as you mentioned, and yes it did appear to work. Many Thanks, Natasha -----Original Message----- From: Steve Lianoglou [mailto:mailinglist.honeypot@gmail.com] Sent: 28 January 2013 21:25 To: Natasha Sahgal Cc: bioconductor at r-project.org Subject: Re: [BioC] DEXSeq estimating dispersions and multiple conditions Hi, I'll try to offer some suggestions until the authors have time to respond. Hopefully I don't lead you too far off the correct path: On Mon, Jan 28, 2013 at 11:31 AM, Natasha Sahgal <nsahgal at="" well.ox.ac.uk=""> wrote: > Dear List, > > I am using DEXSeq to test for differential exon usage on a dataset with 3 samples (2 belong to a treatment group and 1 to a control group) and have 2 main questions. > > 1) Unbalanced design where one out of two groups has no replicates > > Name condition > a1 control > b1 treatment > b2 treatment > > Now, I have read in a couple of other threads that there is no point in running DEXSeq when there are no replicates (unlike in case of DESeq - which I have already run for these). However, the researcher still wants me to try it out. > > Thus, I was wondering if there in an unbalanced design, is it still possible to run DEXSeq and would estimateDispersions still work? When I run this function, I don't get errors but when I run fitDispersionFunction is doesn't like it. > ######## >> ecs <- estimateDispersions(ecs, nCores=8) > Estimating Cox-Reid exon dispersion estimates using 8 cores. (Progress > report: one dot per 100 genes) > ...................................................................... > ...................................................................... > ....................> >> >> >> ecs <- fitDispersionFunction(ecs) > Error in fitDispersionFunction(ecs) : > no CR dispersion estimations found, please first call > estimateDispersions function >> >> head(fData(ecs)) > geneID exonID testable dispBeforeSharing > ENSG00000000003:E001 ENSG00000000003 E001 TRUE NA > ENSG00000000003:E002 ENSG00000000003 E002 TRUE NA [cut] The portion of the fData table you pasted look to be all NA for the `dispBeforeSharing` estimate, which (if true) would suggest that something is going wrong during the estimateDisperion step. After you run `ecs <- estimateDispersions(ecs, nCores=8)`, what fraction of your exons has an estimated dispersion? eg: R> sum(!is.na(fData(ecs)$dispBeforeSharing)) / nrow(fData(ecs)) Hopefully it's greater than 0? A quick check seems to suggest that one could do what you need to (2 reps in condA and 1) using DEXSeq as is. Try it by creating a similar unbalanced data set from the pasilla package using only the single- read data: R> library(DEXSeq) R> data("pasillaExons", package="pasilla") p <- pasillaExons[, R> pData(pasillaExons)$type == 'single-read'] R> pData(p) <- droplevels(pData(p)) R> pData(p)[, 1:3] sizeFactor condition type treated1fb | NA treated single-read untreated1fb| NA untreated single-read untreated2fb| NA untreated single-read R> p <- estimateSizeFactors(p) R> p <- estimateDispersions(p) R> p <- fitDispersionFunction(p) R> sum(!is.na(fData(p)$dispBeforeSharing)) / nrow(fData(p)) [1] 0.7248996 > Also, I read in a thread somewhere where Simon suggested to set fData(ecs)$dispersion <- .1, if it is necessary for a 1 vs 1, but does not fully recommend it. Is this what I need to do? > > > 2) How does one work with multiple groups (or conditions)? > In the vignette, there is nothing specified and neither can I see anything in the testforDEU function. Unlike DESeq (or edgeR) where one can specify. In DESeq: for e.g. if there are 3 groups: Control, Treatment & Mutant: > > res1 <- nbinomTest(cds, "Control", "Treatment") and/or > res2 <- nbinomTest(cds, "Control", "Mutant") If you want to explicitly only test for two conditions (and ignore the other ones), I believe that, currently, your only recourse is to subset your ExonCountSet to include only the two conditions you want to test. Make sure you do the "droplevels" trick on the pData like I did above (this will remove factors in your `condition` column that represent the conditions you've removed). After that, you can do the testForDEU. btw, the above was run on R 2.15.2 and DEXSeq 1.4.0 HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan- Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Natasha ▴ 440

0

Entering edit mode

Hello again, I think I found the reason for that strange behaviour. When Steve mentioned the drop.levels case whilst subsetting data, I decided to check my code from the beginning. The design for my ecs object accidentally had an additional level. On correcting that and rerunning everything again it appears to be fine. sum(!is.na(fData(ecs)$dispBeforeSharing)) / nrow(fData(ecs)) [1] 0.4693243 Best, Natasha -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Natasha Sahgal Sent: 29 January 2013 10:34 To: Steve Lianoglou Cc: bioconductor at r-project.org Subject: Re: [BioC] DEXSeq estimating dispersions and multiple conditions Hi Steve, Thanks for the rather quick response. Based on what you said, I did check for what fraction of exons have an estimated dispersion, unfortunately it is zero! > sum(!is.na(fData(ecs)$dispBeforeSharing)) / nrow(fData(ecs)) [1] 0 Also tried the subsetting the pasilla dataset for a similar design as you mentioned, and yes it did appear to work. Many Thanks, Natasha -----Original Message----- From: Steve Lianoglou [mailto:mailinglist.honeypot@gmail.com] Sent: 28 January 2013 21:25 To: Natasha Sahgal Cc: bioconductor at r-project.org Subject: Re: [BioC] DEXSeq estimating dispersions and multiple conditions Hi, I'll try to offer some suggestions until the authors have time to respond. Hopefully I don't lead you too far off the correct path: On Mon, Jan 28, 2013 at 11:31 AM, Natasha Sahgal <nsahgal at="" well.ox.ac.uk=""> wrote: > Dear List, > > I am using DEXSeq to test for differential exon usage on a dataset with 3 samples (2 belong to a treatment group and 1 to a control group) and have 2 main questions. > > 1) Unbalanced design where one out of two groups has no replicates > > Name condition > a1 control > b1 treatment > b2 treatment > > Now, I have read in a couple of other threads that there is no point in running DEXSeq when there are no replicates (unlike in case of DESeq - which I have already run for these). However, the researcher still wants me to try it out. > > Thus, I was wondering if there in an unbalanced design, is it still possible to run DEXSeq and would estimateDispersions still work? When I run this function, I don't get errors but when I run fitDispersionFunction is doesn't like it. > ######## >> ecs <- estimateDispersions(ecs, nCores=8) > Estimating Cox-Reid exon dispersion estimates using 8 cores. (Progress > report: one dot per 100 genes) > ...................................................................... > ...................................................................... > ....................> >> >> >> ecs <- fitDispersionFunction(ecs) > Error in fitDispersionFunction(ecs) : > no CR dispersion estimations found, please first call > estimateDispersions function >> >> head(fData(ecs)) > geneID exonID testable dispBeforeSharing > ENSG00000000003:E001 ENSG00000000003 E001 TRUE NA > ENSG00000000003:E002 ENSG00000000003 E002 TRUE NA [cut] The portion of the fData table you pasted look to be all NA for the `dispBeforeSharing` estimate, which (if true) would suggest that something is going wrong during the estimateDisperion step. After you run `ecs <- estimateDispersions(ecs, nCores=8)`, what fraction of your exons has an estimated dispersion? eg: R> sum(!is.na(fData(ecs)$dispBeforeSharing)) / nrow(fData(ecs)) Hopefully it's greater than 0? A quick check seems to suggest that one could do what you need to (2 reps in condA and 1) using DEXSeq as is. Try it by creating a similar unbalanced data set from the pasilla package using only the single- read data: R> library(DEXSeq) R> data("pasillaExons", package="pasilla") p <- pasillaExons[, R> pData(pasillaExons)$type == 'single-read'] R> pData(p) <- droplevels(pData(p)) R> pData(p)[, 1:3] sizeFactor condition type treated1fb | NA treated single-read untreated1fb| NA untreated single-read untreated2fb| NA untreated single-read R> p <- estimateSizeFactors(p) R> p <- estimateDispersions(p) R> p <- fitDispersionFunction(p) R> sum(!is.na(fData(p)$dispBeforeSharing)) / nrow(fData(p)) [1] 0.7248996 > Also, I read in a thread somewhere where Simon suggested to set fData(ecs)$dispersion <- .1, if it is necessary for a 1 vs 1, but does not fully recommend it. Is this what I need to do? > > > 2) How does one work with multiple groups (or conditions)? > In the vignette, there is nothing specified and neither can I see anything in the testforDEU function. Unlike DESeq (or edgeR) where one can specify. In DESeq: for e.g. if there are 3 groups: Control, Treatment & Mutant: > > res1 <- nbinomTest(cds, "Control", "Treatment") and/or > res2 <- nbinomTest(cds, "Control", "Mutant") If you want to explicitly only test for two conditions (and ignore the other ones), I believe that, currently, your only recourse is to subset your ExonCountSet to include only the two conditions you want to test. Make sure you do the "droplevels" trick on the pData like I did above (this will remove factors in your `condition` column that represent the conditions you've removed). After that, you can do the testForDEU. btw, the above was run on R 2.15.2 and DEXSeq 1.4.0 HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan- Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.0 years ago Natasha ▴ 440

Login before adding your answer.