Question

R help with large matrices

0

Entering edit mode

Ian Roberts ▴ 70

@ian-roberts-2237

Last seen 11.3 years ago

Dear All, I'm having memory problems producing large matrices and wonder if there is a better way to do what I need? I'm a novice R user, so here's what I've done I need to permute a scoring matrix up to 19,300 events. That is, I have a vector of length 19,300 results and need to compare each with each other for all possible permutations therein of the type N!/(N-n)! I've tried writing my own, and the permutations function of package gtools, however both run into trouble with vectors in excess of 1000 events. Essentially, the score matrix provides start and stop clone numbers for an ordered gene list. It works well for BAC arrays, but is failing for Agilent 244K oligo arrays!!! Thanks for any advice. Ian # FUNCTION: getSegMatrix # Get matrix of clone comparisons. Version 1. Feb 2008. Ian Roberts. ir210 at cam.ac.uk # # Generate a scoring matrix of clone start and stop numbers that compares each clone with every other clone for that chromosome # E.g for a chr containing 5 clones, the matrix would be: # x 1 2 3 4 5 #+1 2 3 4 5 x #+2 3 4 5 x x #+3 4 5 x x x #+4 5 x x x x #+5 x x x x x #Arguments are clone start number, clone stop number and nosClones. getSegMatrix<-function(regStart,regStop,nosClones) { inVec<-c(regStart:regStop) nosRows<-length(inVec) rowNameVec<-paste("+",1:nosRows,sep="") scoreMatrix<-matrix(NA,nrow=nosClones,ncol=nosClones, dimnames=list(c(rowNameVec),c(regStart:regStop))) for (i in 1:(nosClones-1)){ cloneStart<-inVec tempEnd<-cloneStart+i cloneEnd<-tempEnd[tempEnd <= regStop] scoreMatrix[i,1:length(cloneEnd)]<-cloneEnd } scoreMatrix[1,1:nosClones]<-NA return(scoreMatrix) }

oligo BAC oligo BAC • 1.5k views

ADD COMMENT • link updated 17.7 years ago by Sean Davis 21k • written 17.7 years ago by Ian Roberts ▴ 70

score 0 · Answer 1 · 2008-04-09

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 10 months ago

United States

On Wed, Apr 9, 2008 at 6:31 AM, Ian Roberts <ir210 at="" cam.ac.uk=""> wrote: > Dear All, > > I'm having memory problems producing large matrices and wonder if there is a > better way to do what I need? I'm a novice R user, so here's what I've done > > I need to permute a scoring matrix up to 19,300 events. > That is, I have a vector of length 19,300 results and need to compare each with > each other for all possible permutations therein of the type N!/(N-n)! Unless you have a pretty big machine, you will probably not be able to fit a 19,300 x 19,300 member matrix into memory. Why not use a random sampling of a set size (say, 1000 events)? Sample() is the function that chooses random samples. That said, you may want to describe what you are trying to do, rather than asking how to do it. There may be a bioconductor package that already answers the question you are trying to answer. In particular, there are numerous CGH array packages. Sean > I've tried writing my own, and the permutations function of package gtools, > however both run into trouble with vectors in excess of 1000 events. > > Essentially, the score matrix provides start and stop clone numbers for an > ordered gene list. It works well for BAC arrays, but is failing for Agilent > 244K oligo arrays!!!

ADD COMMENT • link 17.7 years ago Sean Davis 21k

0

Entering edit mode

>That said, you may want to describe what you are trying to do ... >... In particular, there are numerous CGH array packages. We use snapCGH - the output of which is segmented data with a call status. To automate the determination of common regions of CNI and also minimum regions of CNI across the whole sample set, additional functions are required. The approach we've taken is to systematically look at the call state of each clone across the sample set, and iterations of clones from a starting point to the end of the chromosome for each sample. Score matrix here actually provides the index number of the start and stop clones for a particular genomic region ... that need to be scored. In essence: 1 2 3 4 5 +1 2 3 4 5 6 +2 3 4 5 6 7 +3 4 5 6 7 8 +4 5 6 7 8 9 The matrix column headers are clone index numbers along the length of a particular chromosome, and the rows give increments for the length of genomic region being assayed. Hence, the matrix is looped through (col x row) and coordinates used to retrieve the call states of the samples contained therein. That's how I'm currently generating my 'all permutations' index coordinates for the calls comparison and I think this is the bit that is bringing me down - its too memory intensive... A second large matrix function records the outcome of the comparison of call states between the index points of score matrix. In fact, there are two results matrix lists (one for gain, and one for loss) resultGain[[1]] stores the binary outcome, while resultGainP[[1]] records the percentage of samples that brought about the outcome. >Unless you have a pretty big machine, you will probably not be able to >fit a 19,300 x 19,300 member matrix into memory. The work is being undertaken on CamGrid. http://www.escience.cam.ac.uk/projects/camgrid/ Thanks for any suggestions! Ian -----Original Message----- From: seandavi@gmail.com [mailto:seandavi@gmail.com] On Behalf Of Sean Davis Sent: 09 April 2008 13:41 To: Ian Roberts Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] R help with large matrices On Wed, Apr 9, 2008 at 6:31 AM, Ian Roberts <ir210 at="" cam.ac.uk=""> wrote: > Dear All, > > I'm having memory problems producing large matrices and wonder if there is a > better way to do what I need? I'm a novice R user, so here's what I've done > > I need to permute a scoring matrix up to 19,300 events. > That is, I have a vector of length 19,300 results and need to compare each with > each other for all possible permutations therein of the type N!/(N-n)! Unless you have a pretty big machine, you will probably not be able to fit a 19,300 x 19,300 member matrix into memory. Why not use a random sampling of a set size (say, 1000 events)? Sample() is the function that chooses random samples. That said, you may want to describe what you are trying to do, rather than asking how to do it. There may be a bioconductor package that already answers the question you are trying to answer. In particular, there are numerous CGH array packages. Sean > I've tried writing my own, and the permutations function of package gtools, > however both run into trouble with vectors in excess of 1000 events. > > Essentially, the score matrix provides start and stop clone numbers for an > ordered gene list. It works well for BAC arrays, but is failing for Agilent > 244K oligo arrays!!!

ADD REPLY • link 17.7 years ago Ian Roberts ▴ 70

0

Entering edit mode

Ian, >> Unless you have a pretty big machine, you will probably not be able to >> fit a 19,300 x 19,300 member matrix into memory. > > The work is being undertaken on CamGrid. > http://www.escience.cam.ac.uk/projects/camgrid/ The problem is if you can't split the task, which you probably can't do, the grid is irrelevant. You will need gigabytes of address space in a single machine just to fit the matrix: 19300^2 = 372.5M; but your values aren't bytes, they're either float (4 bytes, therefore 1.4G), or, more likely, double (8 bytes; 2.8G). Chances are you will run out of addressable memory if you're trying to execute this on a 32 bit platform (or on a 64 bit platform (such as amd64) while using a 32-bit (x86) binary of R). -- Atro Tossavainen (Mr.) / The Institute of Biotechnology at Systems Analyst, Techno-Amish & / the University of Helsinki, Finland, +358-9-19158939 UNIX Dinosaur / employs me, but my opinions are my own. < URL : http : / / www . helsinki . fi / %7E atossava / > NO FILE ATTACHMENTS

ADD REPLY • link 17.7 years ago Atro Tossavainen ▴ 160