Question

M values; and dist functions

0

Entering edit mode

john herbert ▴ 560

@john-herbert-4612

Last seen 11.3 years ago

It would be helpful to get some clarification on some, supposedly, simple array facts; Part1) For 2 colour arrays, Mvalues! Am I correct in thinking that M values from a two colour array are the same as log2 fold change? Cy5 = case, Cy3 = control and M = log2( case/control), so a log fold change of -1 is 2 fold down-regulated etc? For a 1 colour array, Mvalues will arise from array1 = case and arrray 2 = control So log2(array1/array2) is again the equivalent of log2 fold change. with both these scenarios, I am right in stating that the raw fluorescent signals are themselves log2 transformed to make plot distributions close to normal? Part2) I use the marray package to extract array data, normalise and an impute package to replace missing values for M values. I make myself an expression set object using "new" I then want to generate a dist object > dd = dist(exprs(exampleSet)) Error in vector("double", length) : cannot allocate vector of length 582309001 or > dd = dist(exampleSet) Error in vector("double", length) : cannot allocate vector of length 582309001 It is probable I need to reduce the data set first as most genes are not differentially expressed (as is the assumption with microarrays). It would be great to understand these types of things more. Thank you. [[alternative HTML version deleted]]

marray marray • 1.4k views

ADD COMMENT • link updated 14.7 years ago by Steve Lianoglou ★ 13k • written 14.7 years ago by john herbert ▴ 560

score 0 · Answer 1 · 2011-04-26

Hi John, On Tue, Apr 26, 2011 at 6:24 AM, john herbert <arraystruggles at="" gmail.com=""> wrote: > It would be helpful to get some clarification on some, supposedly, simple > array facts; > > Part1) > > For 2 colour arrays, Mvalues! > > Am I correct in thinking that M values from a two colour array are the same > as log2 fold change? > Cy5 = case, Cy3 = control and M = log2( case/control), so a log fold change > of -1 is 2 fold down-regulated etc? That is correct, with the (obvious) exception that there are no hard and fast rules for what type of samples are labeled with cy5 or cy3 ... and often times there are dye swaps done to control for bias ... those scenarios are (I think) covered in the limma manual, btw. > For a 1 colour array, Mvalues will arise from array1 = case and arrray 2 = > control > So log2(array1/array2) is again the equivalent of log2 fold change. Yup ... just make sure you normalize your arrays together. > with both these scenarios, I am right in stating that the raw fluorescent > signals are themselves log2 transformed to make plot distributions close to > normal? Yes. > Part2) > > I use the marray package to extract array data, normalise and an impute > package to replace missing values for M values. > I make myself an expression set object using "new" > > I then want to generate a dist object > >> dd = dist(exprs(exampleSet)) > Error in vector("double", length) : > ?cannot allocate vector of length 582309001 > > or > >> dd = dist(exampleSet) > Error in vector("double", length) : > ?cannot allocate vector of length 582309001 > > It is probable I need to reduce the data set first as most genes are not > differentially expressed (as is the assumption with microarrays). > It would be great to understand these types of things more. Hmmm ... Well, R does have a limit on vector length that is a result of it using 32bit integers (for indexing, I guess) -- so, I think the max size of a vector is 2^31 - 1 == .Machine$integer.max == 2147483647. But that's still bigger than 582309001. This works on my cpu, for instance: R> x <- integer(582309001) It takes a ton (~2gb) of memory, but it still works (I don't have enough free memory to make a "numeric" (aka double) vector of that size, though). Maybe you're hitting the memory limits of your machine? How much RAM do you have? Are you running R in 32-bit or 64-bit mode? What's the result of: R> sessionInfo() -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

score 0 · Answer 2 · 2011-04-27

Hi John, As an aside -- when replying to bioc emails, make sure you hit "reply all" so the email is sent back to the list. Comments inline: On Wed, Apr 27, 2011 at 11:17 AM, john herbert <arraystruggles at="" gmail.com=""> wrote: > Hello Steve, > Thank you very much for answering the questions. > > "That is correct, with the (obvious) exception that there are no hard > and fast rules for what type of samples are labeled with cy5 or cy3 > ... and often times there are dye swaps done to control for bias ... > those scenarios are (I think) covered in the limma manual, btw." > > Yes, I understand about dye swapping and how they are portrayed in a design > matrix; though I am far from fulling understand notation and design > matrices. > > For heatmap, clustering, do you have to separate out your dye intensities to > make the columns of data (separate out samples)? Most example data > generating heatmaps is affy, which is like one colour. I think you generally want to be plotting fold-change in your heatmaps, so just make sure your numerators/denominators are what you expect them to be. > ?"Hmmm ... > > Well, R does have a limit on vector length that is a result of it > using 32bit integers (for indexing, I guess) -- so, I think the max > size of a vector is 2^31 - 1 == .Machine$integer.max == 2147483647. > > But that's still bigger than 582309001. This works on my cpu, for instance: > > R> x <- integer(582309001) > > It takes a ton (~2gb) of memory, but it still works (I don't have > enough free memory to make a "numeric" (aka double) vector of that > size, though). Maybe you're hitting the memory limits of your machine? > > How much RAM do you have? Are you running R in 32-bit or 64-bit mode? > What's the result of: > > R> sessionInfo()" > > My machine is medium powerful, with CoreI7 and 12Gb of RAM. Nice! > My session info > is here is at the bottom (machine is 64 bit) and R is 32 bit (as some > packages are not 64 friendly). I'm a bit surprised that you've found some packages to be non 64-bit friendly ... I haven't use 32bit R in years and haven't had any problems that I can really remember (related to 64 bit, that is). I'm also not running on windows, so maybe there are some 64bit problems with windows that I'm not aware of (maybe others can comment). Anyway, I'd try again using 64bit R -- I think it should work for you. Try to run the minimal amount of code to necessary so you don't have to load any of the 64-bit problematic packages. As far as I know, you shouldn't have any problems with any bioconductor packages breaking in R-64bit, and `dist` is in base-R, so ... everything should be OK. Might be as good a time as any to upgrade to R-2.13 and run a 'minimal' bioconductor install so you're not tempted to run anything "extraneous" > I am off a few days now but I could try and regenerate the R and analyses > that broke dist, it could be a code/concept problem. Dollars to donuts it's the cpu architecture ... :-) -steve > > Best wishes, > > John. > > > > # Session information. > R version 2.12.1 (2010-12-16) > Platform: i386-pc-mingw32/i386 (32-bit) > locale: > [1] LC_COLLATE=English_United Kingdom.1252? LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > attached base packages: > [1] stats4??? grid????? stats???? graphics? grDevices utils???? datasets > methods?? base > other attached packages: > ?[1] ALL_1.4.7???????????????? gplots_2.8.0 > caTools_1.11????????????? bitops_1.0-4.1 > ?[5] gdata_2.8.1?????????????? gtools_2.6.2 > DESeq_1.2.1?????????????? locfit_1.5-6 > ?[9] akima_0.5-4?????????????? arrayQualityMetrics_3.2.4 > affyPLM_1.26.0??????????? preprocessCore_1.12.0 > [13] gcrma_2.22.0????????????? affy_1.28.0 > UsingR_0.1-13???????????? MASS_7.3-9 > [17] ggplot2_0.8.9???????????? proto_0.3-8 > reshape_0.8.3???????????? plyr_1.4 > [21] bioDist_1.22.0??????????? impute_1.24.0 > flexclust_1.3-1?????????? modeltools_0.2-17 > [25] scatterplot3d_0.3-33????? pvclust_1.2-2 > cluster_1.13.2??????????? ks_1.8.1 > [29] misc3d_0.7-1????????????? rgl_0.92.798 > mvtnorm_0.9-95??????????? KernSmooth_2.23-4 > [33] Agi4x44PreProcess_1.10.0? genefilter_1.32.0 > convert_1.26.0??????????? marray_1.28.0 > [37] geneplotter_1.28.0??????? lattice_0.19-13 > annotate_1.28.0?????????? AnnotationDbi_1.12.0 > [41] CCl4_1.0.9??????????????? limma_3.6.9 > vsn_3.18.0??????????????? Biobase_2.10.0 > loaded via a namespace (and not attached): > ?[1] affyio_1.18.0?????? beadarray_2.0.6???? Biostrings_2.18.2 > DBI_0.2-5?????????? hwriter_1.3 > ?[6] IRanges_1.8.8?????? latticeExtra_0.6-14 RColorBrewer_1.0-2 > RSQLite_0.9-4?????? simpleaffy_2.26.1 > [11] splines_2.12.1????? survival_2.36-2???? SVGAnnotation_0.9-0 > tools_2.12.1??????? XML_3.2-0.2 > [16] xtable_1.5-6 > > > On Tue, Apr 26, 2011 at 2:22 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> >> Hi John, >> >> >> On Tue, Apr 26, 2011 at 6:24 AM, john herbert <arraystruggles at="" gmail.com=""> >> wrote: >> > It would be helpful to get some clarification on some, supposedly, >> > simple >> > array facts; >> > >> > Part1) >> > >> > For 2 colour arrays, Mvalues! >> > >> > Am I correct in thinking that M values from a two colour array are the >> > same >> > as log2 fold change? >> > Cy5 = case, Cy3 = control and M = log2( case/control), so a log fold >> > change >> > of -1 is 2 fold down-regulated etc? >> >> That is correct, with the (obvious) exception that there are no hard >> and fast rules for what type of samples are labeled with cy5 or cy3 >> ... and often times there are dye swaps done to control for bias ... >> those scenarios are (I think) covered in the limma manual, btw. >> >> > For a 1 colour array, Mvalues will arise from array1 = case and arrray 2 >> > = >> > control >> > So log2(array1/array2) is again the equivalent of log2 fold change. >> >> Yup ... just make sure you normalize your arrays together. >> >> > with both these scenarios, I am right in stating that the raw >> > fluorescent >> > signals are themselves log2 transformed to make plot distributions close >> > to >> > normal? >> >> Yes. >> >> > Part2) >> > >> > I use the marray package to extract array data, normalise and an impute >> > package to replace missing values for M values. >> > I make myself an expression set object using "new" >> > >> > I then want to generate a dist object >> > >> >> dd = dist(exprs(exampleSet)) >> > Error in vector("double", length) : >> > ?cannot allocate vector of length 582309001 >> > >> > or >> > >> >> dd = dist(exampleSet) >> > Error in vector("double", length) : >> > ?cannot allocate vector of length 582309001 >> > >> > It is probable I need to reduce the data set first as most genes are not >> > differentially expressed (as is the assumption with microarrays). >> > It would be great to understand these types of things more. >> >> Hmmm ... >> >> Well, R does have a limit on vector length that is a result of it >> using 32bit integers (for indexing, I guess) -- so, I think the max >> size of a vector is 2^31 - 1 == .Machine$integer.max == 2147483647. >> >> But that's still bigger than 582309001. This works on my cpu, for >> instance: >> >> R> x <- integer(582309001) >> >> It takes a ton (~2gb) of memory, but it still works (I don't have >> enough free memory to make a "numeric" (aka double) vector of that >> size, though). Maybe you're hitting the memory limits of your machine? >> >> How much RAM do you have? Are you running R in 32-bit or 64-bit mode? >> What's the result of: >> >> R> sessionInfo() >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact