Surrogate variable analysis fails with “subscript out of bounds”

0

Entering edit mode

Vebjorn Ljosa ▴ 20

@vebjorn-ljosa-5373

Last seen 9.7 years ago

(This is a crossposting of a question I have asked on stackoverflow.com, so far with no responses: http://stackoverflow.com/questions/11252724/surrogate-variable- analysis-fails-with-subscript-out-of-bounds) I am trying to apply surrogate variable analysis using [Bioconductor's sva package][1]. The example in [the vignette][2] works fine, but when I try it with real data, I get a "subscript out of bounds" error in `irwsva.build`: ? ? $ R ? ? R version 2.15.0 (2012-03-30) ? ? ? ? ? > trainData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData. txt') ? ? > trainpheno <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno .txt') ? ? > testData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/testData.t xt') ? ? > trainData <- as.matrix(trainData) ? ? > testData <- as.matrix(testData) ? ? > library(sva) ? ? > trainMod <- model.matrix(~as.factor(label), trainpheno) ? ? > num.sv(trainData, trainMod) ? ? [1] 8 ? ? > trainMod0 <- model.matrix(~1, trainpheno) ? ? > trainSv <- sva(trainData, trainMod, trainMod0) ? ? Number of significant surrogate variables is: ?8 ? ? Iteration (out of 5 ):1 ?2 ?3 ?4 ?5 ?Error in irwsva.build(dat = dat, mod = mod, mod0 = mod0, n.sv = n.sv, ?: ? ? ? subscript out of bounds An attempt to narrow it down with `debug()` revealed that `fast.svd` is being called on a 453 x 100 matrix of all zeros. (The dimensions 453 x 100 are the same as my training set.) This results in a `V` that is 100 x 0; the "subscript out of bounds" error is because `irwsva.build` attempts to index into `V`. There must be something about my data that causes this behaviour?but what? As a possible workaround, I tried calling `sva` with `method="two- step"`: ? ? > trainSv <- sva(trainData, trainMod, trainMod0, method='two- step') ? ? Number of significant surrogate variables is: ?8 That worked, but I need to subsequently call `fsva`. That failed because calling `sva` with `method="two-step"` resulted in `trainSv$pprob.b` being NULL. So how do my data differ from those in the vignette? The training and test data are matrices in both cases. In the vignette the training matrix is 22283 x 30; in my case, it is 453 x 100. In the vignette, the variable of interest (_cancer_) is binary; in my case, the dependent variable can take 12 different values. The last difference seems to be important, for if I reduce the range to [0, 7], it works: ? ? > trainMod <- model.matrix(~as.factor(label), trainpheno %% 8) ? ? > trainSv <- sva(trainData, trainMod, trainMod0) ? ? Number of significant surrogate variables is: ?9 ? ? Iteration (out of 5 ):1 ?2 ?3 ?4 ?5 ?> Thinking that perhaps 100 samples (columns) is just insufficient for 12 classes, I tried a similar dataset with 293 columns. (The data are from the same experiment, but profile the 293 individual samples rather than the 100 treatments.) It did not help: ? ? > trainData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData3 .txt') ? ? > trainpheno <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno .txt') ? ? > trainData <- as.matrix(trainData) ? ? > trainMod <- model.matrix(~as.factor(label), trainpheno) ? ? > trainMod0 <- model.matrix(~1, trainpheno) ? ? > trainSv <- sva(trainData, trainMod, trainMod0) ? ? Number of significant surrogate variables is: ?11 ? ? Iteration (out of 5 ):1 ?2 ?3 ?4 ?5 ?Error in irwsva.build(dat = dat, mod = mod, mod0 = mod0, n.sv = n.sv, ?: ? ? ? subscript out of bounds If I limit sva to one iteration, it is able to run to completion, but I don't know if I can trust the results: ? ? > trainSv <- sva(trainData, trainMod, trainMod0, B=1) ? ? Number of significant surrogate variables is: ?11 ? ? Iteration (out of 1 ):1 ?> Does anyone understand `irwsva` well enough to say why this is happening? Is there anything I can do to make it work on my data? Thanks, Vebjorn ? [1]: http://www.bioconductor.org/packages/release/bioc/html/sva.html ? [2]: http://www.bioconductor.org/packages/release/bioc/vignettes/sva /inst/doc/sva.pdf

sva sva • 1.7k views

ADD COMMENT • link updated 11.8 years ago by Jeff Leek ▴ 650 • written 11.8 years ago by Vebjorn Ljosa ▴ 20

0

Entering edit mode

Jeff Leek ▴ 650

@jeff-leek-5015

Last seen 3.2 years ago

United States

Hi Vebjorn, This problem is likely because of the small number of genes/features you are considering (453) and the high dimension of the response variable (12). With so many different levels of the response variable, many features are likely significantly associated with the response. Part of the iteration in the sva algorithm is to downweight features strongly associated with the response, so the whole data set is being down-weighted to 0. I would suggest running only one iteration of sva. Usually it takes a very small number of iterations to converge, and since your data are relatively low dimensional in the number of features, this may be the best that you can do if you are doing artifact discovery. Best, Jeff On Mon, Jul 2, 2012 at 2:12 PM, Vebjorn Ljosa <ljosa at="" broad.mit.edu=""> wrote: > (This is a crossposting of a question I have asked on > stackoverflow.com, so far with no responses: > http://stackoverflow.com/questions/11252724/surrogate-variable- analysis-fails-with-subscript-out-of-bounds) > > I am trying to apply surrogate variable analysis using [Bioconductor's > sva package][1]. The example in [the vignette][2] works fine, but when > I try it with real data, I get a "subscript out of bounds" error in > `irwsva.build`: > > $ R > > R version 2.15.0 (2012-03-30) > ? > > trainData <- > read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainDat a.txt') > > trainpheno <- > read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainphe no.txt') > > testData <- > read.table('http://www.broadinstitute.org/~ljosa/svaproblem/testData .txt') > > trainData <- as.matrix(trainData) > > testData <- as.matrix(testData) > > library(sva) > > trainMod <- model.matrix(~as.factor(label), trainpheno) > > num.sv(trainData, trainMod) > [1] 8 > > trainMod0 <- model.matrix(~1, trainpheno) > > trainSv <- sva(trainData, trainMod, trainMod0) > Number of significant surrogate variables is: 8 > Iteration (out of 5 ):1 2 3 4 5 Error in irwsva.build(dat = > dat, mod = mod, mod0 = mod0, n.sv = n.sv, : > subscript out of bounds > > An attempt to narrow it down with `debug()` revealed that `fast.svd` > is being called on a 453 x 100 matrix of all zeros. (The dimensions > 453 x 100 are the same as my training set.) This results in a `V` that > is 100 x 0; the "subscript out of bounds" error is because > `irwsva.build` attempts to index into `V`. There must be something > about my data that causes this behaviour?but what? > > As a possible workaround, I tried calling `sva` with `method="two- step"`: > > > trainSv <- sva(trainData, trainMod, trainMod0, method='two- step') > Number of significant surrogate variables is: 8 > > That worked, but I need to subsequently call `fsva`. That failed > because calling `sva` with `method="two-step"` resulted in > `trainSv$pprob.b` being NULL. > > So how do my data differ from those in the vignette? The training and > test data are matrices in both cases. In the vignette the training > matrix is 22283 x 30; in my case, it is 453 x 100. In the vignette, > the variable of interest (_cancer_) is binary; in my case, the > dependent variable can take 12 different values. > > The last difference seems to be important, for if I reduce the range > to [0, 7], it works: > > > trainMod <- model.matrix(~as.factor(label), trainpheno %% 8) > > trainSv <- sva(trainData, trainMod, trainMod0) > Number of significant surrogate variables is: 9 > Iteration (out of 5 ):1 2 3 4 5 > > > Thinking that perhaps 100 samples (columns) is just insufficient for > 12 classes, I tried a similar dataset with 293 columns. (The data are > from the same experiment, but profile the 293 individual samples > rather than the 100 treatments.) It did not help: > > > trainData <- > read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainDat a3.txt') > > trainpheno <- > read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainphe no.txt') > > trainData <- as.matrix(trainData) > > trainMod <- model.matrix(~as.factor(label), trainpheno) > > trainMod0 <- model.matrix(~1, trainpheno) > > trainSv <- sva(trainData, trainMod, trainMod0) > Number of significant surrogate variables is: 11 > Iteration (out of 5 ):1 2 3 4 5 Error in irwsva.build(dat = > dat, mod = mod, mod0 = mod0, n.sv = n.sv, : > subscript out of bounds > > If I limit sva to one iteration, it is able to run to completion, but > I don't know if I can trust the results: > > > trainSv <- sva(trainData, trainMod, trainMod0, B=1) > Number of significant surrogate variables is: 11 > Iteration (out of 1 ):1 > > > Does anyone understand `irwsva` well enough to say why this is > happening? Is there anything I can do to make it work on my data? > > Thanks, > Vebjorn > > [1]: http://www.bioconductor.org/packages/release/bioc/html/sva.html > [2]: http://www.bioconductor.org/packages/release/bioc/vignettes/s va/inst/doc/sva.pdf > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.8 years ago Jeff Leek ▴ 650

0

Entering edit mode

On Mon, Jul 2, 2012 at 2:22 PM, Jeff Leek <jtleek at="" gmail.com=""> wrote: > > This problem is likely because of the small number of genes/features > you are considering (453) and the high dimension of the response > variable (12). With so many different levels of the response variable, > many features are likely significantly associated with the response. > Part of the iteration in the sva algorithm is to downweight features > strongly associated with the response, so the whole data set is being > down-weighted to 0. > > I would suggest running only one iteration of sva. Usually it takes a > very small number of iterations to converge, and since your data are > relatively low dimensional in the number of features, this may be the > best that you can do if you are doing artifact discovery. Thanks for responding. As I tried to use only one iteration, I soon came across a dataset where even that was too much: http://www.broadinstitute.org/~ljosa/svaproblem/trainData4.txt http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno4.txt Would it be reasonable to detect the case when `dats` is all zeros after the first iteration and to return an empty set of surrogate variables in that case? Vebjorn

ADD REPLY • link 11.8 years ago Vebjorn Ljosa ▴ 20

Login before adding your answer.