About subsampling of VST in lumi
1
0
Entering edit mode
Pan Du ★ 1.2k
@pan-du-2010
Last seen 10.3 years ago
Hi Ligia, Thanks for your report. Yes, we use down-sampling to speed up the parameter estimation. If you want to use all the data points, you can set the parameter "nSupport" of vst function as the length of the vector. I will add this to the vignette or help file. Thanks! Pan On 12/14/07 5:18 AM, "ligia at ebi.ac.uk" <ligia at="" ebi.ac.uk=""> wrote: > Dear Pan Du, > >> From what I understand when looking at "vst", the random subsampling that > affects my data occurs at step 4 below: > > 1 if (c3 != 0) { > 2 selInd <- selInd & (std^2 > c3) > 3 dd <- data.frame(y = sqrt(std[selInd]^2 - c3), x1 = u[selInd]) > 4 if (nrow(dd) > 5000 dd <- dd[sample(1:nrow(dd), 5000), ] > 5 lmm <- lm(y ~ x1, dd) > 6 c1 <- lmm$coef[2] > 7 c2 <- lmm$coef[1] > 8 } > > because my "dd" matrix has around 5500 rows. Maybe it would be nice to > have the option to turn this off, or add the option to provide the max > value allowed for nrow(dd)... > > Cheers, > L?gia > > >> Dear Ligia >> >> I believe this is because they random subsample the data to "speed >> processing", see the man page and the nSupport parameter. >> >> I cc Pan Du with the suggestion to make the explanation of this in the >> man page more clear. Is there an option to switch off the random >> subsampling? >> >> Best wishes >> Wolfgang >> >> >> >> ligia at ebi.ac.uk ha scritto: >>> Hi Wolfgang, >>> >>> I noticed a peculiar behaviour in lumi package: when I apply the >>> variance >>> stabilizing transformation, >>> it gives slightly different results each time I run the method. See >>> below >>> for a subset of the data: >>> >>> >>>> load("dat.rda") >>>> library("lumi") >>> >>>> x1 <- lumiT(dat, method="vst", ifPlot=!TRUE) >>> 2007-12-13 10:56:35 , processing array 1 >>> 2007-12-13 10:56:35 , processing array 2 >>> 2007-12-13 10:56:35 , processing array 3 >>> 2007-12-13 10:56:35 , processing array 4 >>> >>>> x2 <- lumiT(dat, method="vst", ifPlot=!TRUE) >>> 2007-12-13 10:56:36 , processing array 1 >>> 2007-12-13 10:56:36 , processing array 2 >>> 2007-12-13 10:56:36 , processing array 3 >>> 2007-12-13 10:56:37 , processing array 4 >>> >>> >>>> table(exprs(x1)==exprs(x2)) >>> >>> FALSE TRUE >>> 88705 3 >>> >>>> range(exprs(x1)-exprs(x2)) >>> [1] -0.05682931 0.03592777 >>> >>>> sessionInfo() >>> R version 2.7.0 Under development (unstable) (2007-11-29 r43558) >>> i686-pc-linux-gnu >>> >>> locale: >>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e n_US.UTF-8 >>> ;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UT F-8;LC_NAM >>> E=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION >>> =C >>> >>> attached base packages: >>> [1] tools stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] lumi_1.5.10 annotate_1.15.6 AnnotationDbi_1.1.6 >>> [4] RSQLite_0.6-0 DBI_0.2-3 mgcv_1.3-29 >>> [7] affy_1.15.7 preprocessCore_0.99.12 affyio_1.5.7 >>> [10] Biobase_1.17.6 >>> >>> Cheers, >>> Ligia >> >> >> -- >> >> Best wishes >> Wolfgang >> >> ------------------------------------------------------------------ >> Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber >> > > --------------------------------------------------- Pan Du, PhD Research Assistant Professor Robert H. Lurie Comprehensive Cancer Center Northwestern University 676 ST Clair St., #1200 Chicago, IL 60611 Office (312)695-4781 dupan at northwestern.edu
Cancer lumi Cancer lumi • 927 views
ADD COMMENT
0
Entering edit mode
@ligiaebiacuk-1794
Last seen 10.3 years ago
Hi Pan, Thanks for your email. The problem I reported is not due to the downsampling step controlled via "nSupport" parameter, but with a subsequent step in "vst" where if the number of selected probes with high variance (indSel) is above 5000, then only a random subset (5000) of these probes is used (the steps I mentioned in my last email) to fit the linear model between variance and mean of probe beads. Couldn't this value (5000) be just another parameter to "vst"? Thanks for your help, Ligia > Hi Ligia, > > Thanks for your report. > Yes, we use down-sampling to speed up the parameter estimation. If you > want > to use all the data points, you can set the parameter "nSupport" of vst > function as the length of the vector. I will add this to the vignette or > help file. Thanks! > > > Pan > > > On 12/14/07 5:18 AM, "ligia at ebi.ac.uk" <ligia at="" ebi.ac.uk=""> wrote: > >> Dear Pan Du, >> >>> From what I understand when looking at "vst", the random subsampling >>> that >> affects my data occurs at step 4 below: >> >> 1 if (c3 != 0) { >> 2 selInd <- selInd & (std^2 > c3) >> 3 dd <- data.frame(y = sqrt(std[selInd]^2 - c3), x1 = >> u[selInd]) >> 4 if (nrow(dd) > 5000 dd <- dd[sample(1:nrow(dd), 5000), ] >> 5 lmm <- lm(y ~ x1, dd) >> 6 c1 <- lmm$coef[2] >> 7 c2 <- lmm$coef[1] >> 8 } >> >> because my "dd" matrix has around 5500 rows. Maybe it would be nice to >> have the option to turn this off, or add the option to provide the max >> value allowed for nrow(dd)... >> >> Cheers, >> L?gia >> >> >>> Dear Ligia >>> >>> I believe this is because they random subsample the data to "speed >>> processing", see the man page and the nSupport parameter. >>> >>> I cc Pan Du with the suggestion to make the explanation of this in the >>> man page more clear. Is there an option to switch off the random >>> subsampling? >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> ligia at ebi.ac.uk ha scritto: >>>> Hi Wolfgang, >>>> >>>> I noticed a peculiar behaviour in lumi package: when I apply the >>>> variance >>>> stabilizing transformation, >>>> it gives slightly different results each time I run the method. See >>>> below >>>> for a subset of the data: >>>> >>>> >>>>> load("dat.rda") >>>>> library("lumi") >>>> >>>>> x1 <- lumiT(dat, method="vst", ifPlot=!TRUE) >>>> 2007-12-13 10:56:35 , processing array 1 >>>> 2007-12-13 10:56:35 , processing array 2 >>>> 2007-12-13 10:56:35 , processing array 3 >>>> 2007-12-13 10:56:35 , processing array 4 >>>> >>>>> x2 <- lumiT(dat, method="vst", ifPlot=!TRUE) >>>> 2007-12-13 10:56:36 , processing array 1 >>>> 2007-12-13 10:56:36 , processing array 2 >>>> 2007-12-13 10:56:36 , processing array 3 >>>> 2007-12-13 10:56:37 , processing array 4 >>>> >>>> >>>>> table(exprs(x1)==exprs(x2)) >>>> >>>> FALSE TRUE >>>> 88705 3 >>>> >>>>> range(exprs(x1)-exprs(x2)) >>>> [1] -0.05682931 0.03592777 >>>> >>>>> sessionInfo() >>>> R version 2.7.0 Under development (unstable) (2007-11-29 r43558) >>>> i686-pc-linux-gnu >>>> >>>> locale: >>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE= en_US.UTF-8 >>>> ;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.U TF-8;LC_NAM >>>> E=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDE NTIFICATION >>>> =C >>>> >>>> attached base packages: >>>> [1] tools stats graphics grDevices utils datasets >>>> methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] lumi_1.5.10 annotate_1.15.6 AnnotationDbi_1.1.6 >>>> [4] RSQLite_0.6-0 DBI_0.2-3 mgcv_1.3-29 >>>> [7] affy_1.15.7 preprocessCore_0.99.12 affyio_1.5.7 >>>> [10] Biobase_1.17.6 >>>> >>>> Cheers, >>>> Ligia >>> >>> >>> -- >>> >>> Best wishes >>> Wolfgang >>> >>> ------------------------------------------------------------------ >>> Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber >>> >> >> > > > --------------------------------------------------- > Pan Du, PhD > Research Assistant Professor > Robert H. Lurie Comprehensive Cancer Center > Northwestern University > 676 ST Clair St., #1200 > Chicago, IL 60611 > Office (312)695-4781 > dupan at northwestern.edu > --------------------------------------------------- > > > > >
ADD COMMENT
0
Entering edit mode
Thanks! Ligia. I will make the change. Probably, we will just remove the sub-sampling step by default. Have a nice weekend, Pan On 12/14/07 3:56 PM, "ligia at ebi.ac.uk" <ligia at="" ebi.ac.uk=""> wrote: > Hi Pan, > > Thanks for your email. > The problem I reported is not due to the downsampling step controlled via > "nSupport" parameter, but with a subsequent step in "vst" where if the > number of selected probes with high variance (indSel) is above 5000, then > only a random subset (5000) of these probes is used (the steps I mentioned > in my last email) to fit the linear model between variance and mean of > probe beads. Couldn't this value (5000) be just another parameter to > "vst"? > > Thanks for your help, > Ligia > > > >> Hi Ligia, >> >> Thanks for your report. >> Yes, we use down-sampling to speed up the parameter estimation. If you >> want >> to use all the data points, you can set the parameter "nSupport" of vst >> function as the length of the vector. I will add this to the vignette or >> help file. Thanks! >> >> >> Pan >> >> >> On 12/14/07 5:18 AM, "ligia at ebi.ac.uk" <ligia at="" ebi.ac.uk=""> wrote: >> >>> Dear Pan Du, >>> >>>> From what I understand when looking at "vst", the random subsampling >>>> that >>> affects my data occurs at step 4 below: >>> >>> 1 if (c3 != 0) { >>> 2 selInd <- selInd & (std^2 > c3) >>> 3 dd <- data.frame(y = sqrt(std[selInd]^2 - c3), x1 = >>> u[selInd]) >>> 4 if (nrow(dd) > 5000 dd <- dd[sample(1:nrow(dd), 5000), ] >>> 5 lmm <- lm(y ~ x1, dd) >>> 6 c1 <- lmm$coef[2] >>> 7 c2 <- lmm$coef[1] >>> 8 } >>> >>> because my "dd" matrix has around 5500 rows. Maybe it would be nice to >>> have the option to turn this off, or add the option to provide the max >>> value allowed for nrow(dd)... >>> >>> Cheers, >>> L?gia >>> >>> >>>> Dear Ligia >>>> >>>> I believe this is because they random subsample the data to "speed >>>> processing", see the man page and the nSupport parameter. >>>> >>>> I cc Pan Du with the suggestion to make the explanation of this in the >>>> man page more clear. Is there an option to switch off the random >>>> subsampling? >>>> >>>> Best wishes >>>> Wolfgang >>>> >>>> >>>> >>>> ligia at ebi.ac.uk ha scritto: >>>>> Hi Wolfgang, >>>>> >>>>> I noticed a peculiar behaviour in lumi package: when I apply the >>>>> variance >>>>> stabilizing transformation, >>>>> it gives slightly different results each time I run the method. See >>>>> below >>>>> for a subset of the data: >>>>> >>>>> >>>>>> load("dat.rda") >>>>>> library("lumi") >>>>> >>>>>> x1 <- lumiT(dat, method="vst", ifPlot=!TRUE) >>>>> 2007-12-13 10:56:35 , processing array 1 >>>>> 2007-12-13 10:56:35 , processing array 2 >>>>> 2007-12-13 10:56:35 , processing array 3 >>>>> 2007-12-13 10:56:35 , processing array 4 >>>>> >>>>>> x2 <- lumiT(dat, method="vst", ifPlot=!TRUE) >>>>> 2007-12-13 10:56:36 , processing array 1 >>>>> 2007-12-13 10:56:36 , processing array 2 >>>>> 2007-12-13 10:56:36 , processing array 3 >>>>> 2007-12-13 10:56:37 , processing array 4 >>>>> >>>>> >>>>>> table(exprs(x1)==exprs(x2)) >>>>> >>>>> FALSE TRUE >>>>> 88705 3 >>>>> >>>>>> range(exprs(x1)-exprs(x2)) >>>>> [1] -0.05682931 0.03592777 >>>>> >>>>>> sessionInfo() >>>>> R version 2.7.0 Under development (unstable) (2007-11-29 r43558) >>>>> i686-pc-linux-gnu >>>>> >>>>> locale: >>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE =en_US.UTF >>>>> -8 >>>>> ;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US. UTF-8;LC_N >>>>> AM >>>>> E=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_ID ENTIFICATI >>>>> ON >>>>> =C >>>>> >>>>> attached base packages: >>>>> [1] tools stats graphics grDevices utils datasets >>>>> methods >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] lumi_1.5.10 annotate_1.15.6 AnnotationDbi_1.1.6 >>>>> [4] RSQLite_0.6-0 DBI_0.2-3 mgcv_1.3-29 >>>>> [7] affy_1.15.7 preprocessCore_0.99.12 affyio_1.5.7 >>>>> [10] Biobase_1.17.6 >>>>> >>>>> Cheers, >>>>> Ligia >>>> >>>> >>>> -- >>>> >>>> Best wishes >>>> Wolfgang >>>> >>>> ------------------------------------------------------------------ >>>> Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber >>>> >>> >>> >> >> >> --------------------------------------------------- >> Pan Du, PhD >> Research Assistant Professor >> Robert H. Lurie Comprehensive Cancer Center >> Northwestern University >> 676 ST Clair St., #1200 >> Chicago, IL 60611 >> Office (312)695-4781 >> dupan at northwestern.edu >> --------------------------------------------------- >> >> >> >> >> > >
ADD REPLY

Login before adding your answer.

Traffic: 620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6