Question: normalisation + sam
0
9.5 years ago by
Mike Dewar60
Mike Dewar60 wrote:
Hi, I'm trying to use siggenes - sam() to look at differentially expressed genes of data taken from the Immunological Genome project (http://www.immgen.org/). A problem with this is that I have to perform the preprocessing of the original CEL files as they do not make the processed data available on GEO. To do this I'm using the aroma suite of packages to perform quantile normalisation of this data set (so far 128 arrays) in fixed memory (i.e. my laptop). This is a good thing, as it has forced me to learn a little about array preprocessing, and a bad thing as I'm new to all this and might be going horribly wrong. When it comes time to look for differentially expressed genes, I'm using siggenes - sam() and I'm getting some strange results. I'm using (what I think would be considered) many classes (28), where each class has at least 3 examples, and thus throwing out some of the arrays. The results I'm getting look like: SAM Analysis for the Multi-Class Case with 28 Classes Delta p0 False Called FDR 1 0.1 0.014 23355.05 23532 0.013997 2 0.2 0.014 21805.98 23498 0.013087 3 0.3 0.014 15923.83 23169 0.009693 4 0.4 0.014 9637.47 22343 0.006083 5 0.5 0.014 5527.75 21330 0.003655 6 0.6 0.014 3060.73 20221 0.002135 7 0.7 0.014 1703.82 19205 0.001251 8 0.8 0.014 953.02 18256 0.000736 9 0.9 0.014 536.81 17382 0.000436 10 1.0 0.014 307.19 16730 0.000259 11 1.1 0.014 176 16143 0.000154 which I think implies that many many genes are differentially expressed. Using plot(sam.out, 1.2) shows a pretty much vertical line starting at the origin, with no genes observably behaving like the null model. Even if I only try this on 2 classes, and hence throwing out most of the data, I'm still not getting sensible results. Now I'm hoping that I'm doing something wrong, and that not 16K of my genes are differentially expressed. However, I'm having difficulty figuring out what it might be. The one striking thing between my data set and the golub example set is that golub seems to be zero-mean - is this a requirement for sam()? Any other ideas of what to look for, or what other information I could provide to help this question make sense, would be greatly appreciated. Thanks in advance, Mike Dewar - - - Dr Michael Dewar Postdoctoral Research Scientist Applied Mathematics Columbia University http://www.columbia.edu/~md2954/ [[alternative HTML version deleted]]
preprocessing siggenes • 785 views
modified 9.5 years ago by Wolfgang Huber13k • written 9.5 years ago by Mike Dewar60
0
9.5 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:
Hi Mike can you provide a reproducible example (R script) and the output of sessionInfo() for the 2-groups comparison? This would include a pointer to the 6(?) CEL files on the web. For the 28-classes case, I doubt that hypothesis testing of the null hypothesis "mean in all classes is the same" is the most useful data analytic approach. Perhaps you can precisize your question. Best wishes Wolfgang > Hi, > > I'm trying to use siggenes - sam() to look at differentially expressed genes of data taken from the Immunological Genome project (http://www.immgen.org/). A problem with this is that I have to perform the preprocessing of the original CEL files as they do not make the processed data available on GEO. > > To do this I'm using the aroma suite of packages to perform quantile normalisation of this data set (so far 128 arrays) in fixed memory (i.e. my laptop). This is a good thing, as it has forced me to learn a little about array preprocessing, and a bad thing as I'm new to all this and might be going horribly wrong. > > When it comes time to look for differentially expressed genes, I'm using siggenes - sam() and I'm getting some strange results. I'm using (what I think would be considered) many classes (28), where each class has at least 3 examples, and thus throwing out some of the arrays. The results I'm getting look like: > > SAM Analysis for the Multi-Class Case with 28 Classes > > Delta p0 False Called FDR > 1 0.1 0.014 23355.05 23532 0.013997 > 2 0.2 0.014 21805.98 23498 0.013087 > 3 0.3 0.014 15923.83 23169 0.009693 > 4 0.4 0.014 9637.47 22343 0.006083 > 5 0.5 0.014 5527.75 21330 0.003655 > 6 0.6 0.014 3060.73 20221 0.002135 > 7 0.7 0.014 1703.82 19205 0.001251 > 8 0.8 0.014 953.02 18256 0.000736 > 9 0.9 0.014 536.81 17382 0.000436 > 10 1.0 0.014 307.19 16730 0.000259 > 11 1.1 0.014 176 16143 0.000154 > > which I think implies that many many genes are differentially expressed. Using plot(sam.out, 1.2) shows a pretty much vertical line starting at the origin, with no genes observably behaving like the null model. Even if I only try this on 2 classes, and hence throwing out most of the data, I'm still not getting sensible results. > > Now I'm hoping that I'm doing something wrong, and that not 16K of my genes are differentially expressed. However, I'm having difficulty figuring out what it might be. The one striking thing between my data set and the golub example set is that golub seems to be zero-mean - is this a requirement for sam()? > > Any other ideas of what to look for, or what other information I could provide to help this question make sense, would be greatly appreciated. > > Thanks in advance, > > Mike Dewar > > - - - > Dr Michael Dewar > Postdoctoral Research Scientist > Applied Mathematics > Columbia University > http://www.columbia.edu/~md2954/ > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
Hi Wolfgang, To get from the CEL files to a Bioconductor ExpressionSet is quite longwinded and, as it's not using Bioconductor, maybe not something I should abuse this list with. Instead, I've put the expression set I've generated up online at http://www.columbia.edu/~md2954/immgen.data Then, to generate the results I'm seeing, I use the code below. I'm pretty new to R, too, so any general hints also welcome. In the code below I'm interested in seeing which genes that have symbols like "CD4" are differentially expressed. Note that if I just use all the genes (and cut out the code that selects "^Cd\\d+" genes) then I still get results implying that my genes are not distributed in the way sam is expecting. To precisize my question then, does the result of SAM, as expressed in the below code, imply that my array data is badly normalised? If not, does it imply some horrific misunderstanding on my part on how these things should work? Mike library(siggenes) library(Biobase) # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e. #path = "/Users/mike/Data/Immgen/userData/GSE15907" path = "." # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about this) sep = "/" # load the data filename = "immgen.data" load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen # form a vector which corresponds to the class of each array p = pData(immgen) cl = colnames(p)[apply(p, 1, which)] # remove "phenotype" genes # any gene that directly encodes a surface marker is highlighted regexp = "^Cd\\d+" marker_indices = grep(regexp,fData(immgen)$symbol) markers = immgen[marker_indices,] # class labels: classes <- colnames(pData(markers)) # find out how many of each class we have count <- lapply(pData(markers),sum) # chuck out all the experiments with less than 4 examples. This will leave just two classes. to_keep = as.logical(sapply( as.data.frame(t( # the next line chooses those phenotypes with more than n examples # set n to 2 to leave 28 classes n = 3 pData(markers)[,classes[count>n]] )), sum # the sum is for each row, hence the transposition above )) # now to_keep has a 1 next to each array we want and a 0 to those we don't # hence we can use to_keep to index the markers expression set markers <- markers[,to_keep] # and then chuck out classes we've rejected cl <- cl[to_keep] # then apply sam to the markers sam.out = sam(data=markers,cl=cl,var.equal=FALSE) print(sam.out, seq(0.1, 5, 0.1)) plot(sam.out, 0.5) On 22 Apr 2010, at 15:41, Wolfgang Huber wrote: > Hi Mike > > can you provide a reproducible example (R script) and the output of sessionInfo() for the 2-groups comparison? This would include a pointer to the 6(?) CEL files on the web. > > For the 28-classes case, I doubt that hypothesis testing of the null hypothesis "mean in all classes is the same" is the most useful data analytic approach. Perhaps you can precisize your question. > > Best wishes > Wolfgang > >> Hi, >> I'm trying to use siggenes - sam() to look at differentially expressed genes of data taken from the Immunological Genome project (http://www.immgen.org/). A problem with this is that I have to perform the preprocessing of the original CEL files as they do not make the processed data available on GEO. To do this I'm using the aroma suite of packages to perform quantile normalisation of this data set (so far 128 arrays) in fixed memory (i.e. my laptop). This is a good thing, as it has forced me to learn a little about array preprocessing, and a bad thing as I'm new to all this and might be going horribly wrong. When it comes time to look for differentially expressed genes, I'm using siggenes - sam() and I'm getting some strange results. I'm using (what I think would be considered) many classes (28), where each class has at least 3 examples, and thus throwing out some of the arrays. The results I'm getting look like: >> SAM Analysis for the Multi-Class Case with 28 Classes >> Delta p0 False Called FDR >> 1 0.1 0.014 23355.05 23532 0.013997 >> 2 0.2 0.014 21805.98 23498 0.013087 >> 3 0.3 0.014 15923.83 23169 0.009693 >> 4 0.4 0.014 9637.47 22343 0.006083 >> 5 0.5 0.014 5527.75 21330 0.003655 >> 6 0.6 0.014 3060.73 20221 0.002135 >> 7 0.7 0.014 1703.82 19205 0.001251 >> 8 0.8 0.014 953.02 18256 0.000736 >> 9 0.9 0.014 536.81 17382 0.000436 >> 10 1.0 0.014 307.19 16730 0.000259 >> 11 1.1 0.014 176 16143 0.000154 >> which I think implies that many many genes are differentially expressed. Using plot(sam.out, 1.2) shows a pretty much vertical line starting at the origin, with no genes observably behaving like the null model. Even if I only try this on 2 classes, and hence throwing out most of the data, I'm still not getting sensible results. >> Now I'm hoping that I'm doing something wrong, and that not 16K of my genes are differentially expressed. However, I'm having difficulty figuring out what it might be. The one striking thing between my data set and the golub example set is that golub seems to be zero-mean - is this a requirement for sam()? >> Any other ideas of what to look for, or what other information I could provide to help this question make sense, would be greatly appreciated. >> Thanks in advance, >> Mike Dewar >> - - - >> Dr Michael Dewar >> Postdoctoral Research Scientist Applied Mathematics >> Columbia University >> http://www.columbia.edu/~md2954/ >> [[alternative HTML version deleted]] >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > > - - - Dr Michael Dewar Postdoctoral Research Scientist Applied Mathematics Columbia University http://www.columbia.edu/~md2954/ [[alternative HTML version deleted]] ADD REPLYlink written 9.5 years ago by Mike Dewar60 Answer: normalisation + sam 0 9.5 years ago by EMBL European Molecular Biology Laboratory Wolfgang Huber13k wrote: Hi Mike I am afraid it does look like a normalisation problem. siggenes is right, given the data. Have a look at the heatmap in the PDF file at http://www-huber.embl.de/users/whuber/bioc-list/100423/ and the R script in the same directory, which derives from your code below. Best wishes Wolfgang Mike Dewar scripsit 23/04/10 17:11: > Hi Wolfgang, > > To get from the CEL files to a Bioconductor ExpressionSet is quite > longwinded and, as it's not using Bioconductor, maybe not something I > should abuse this list with. Instead, I've put the expression set I've > generated up online at > > http://www.columbia.edu/~md2954/immgen.data > > Then, to generate the results I'm seeing, I use the code below. I'm > pretty new to R, too, so any general hints also welcome. In the code > below I'm interested in seeing which genes that have symbols like "CD4" > are differentially expressed. Note that if I just use all the genes (and > cut out the code that selects "^Cd\\d+" genes) then I still get results > implying that my genes are not distributed in the way sam is expecting. > > To precisize my question then, does the result of SAM, as expressed in > the below code, imply that my array data is badly normalised? If not, > does it imply some horrific misunderstanding on my part on how these > things should work? > > Mike > > library(siggenes) > library(Biobase) > # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e. > #path = "/Users/mike/Data/Immgen/userData/GSE15907" > path = "." > # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about this) > sep = "/" > # load the data > filename = "immgen.data" > load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen > # form a vector which corresponds to the class of each array > p = pData(immgen) > cl = colnames(p)[apply(p, 1, which)] > # remove "phenotype" genes > # any gene that directly encodes a surface marker is highlighted > regexp = "^Cd\\d+" > marker_indices = grep(regexp,fData(immgen)$symbol) > markers = immgen[marker_indices,] > # class labels: > classes <- colnames(pData(markers)) > # find out how many of each class we have > count <- lapply(pData(markers),sum) > # chuck out all the experiments with less than 4 examples. This will > leave just two classes. > to_keep = as.logical(sapply( > as.data.frame(t( > # the next line chooses those phenotypes with more than n examples > # set n to 2 to leave 28 classes > n = 3 > pData(markers)[,classes[count>n]] > )), > sum # the sum is for each row, hence the transposition above > )) > # now to_keep has a 1 next to each array we want and a 0 to those we don't > # hence we can use to_keep to index the markers expression set > markers <- markers[,to_keep] > # and then chuck out classes we've rejected > cl <- cl[to_keep] > # then apply sam to the markers > sam.out = sam(data=markers,cl=cl,var.equal=FALSE) > print(sam.out, seq(0.1, 5, 0.1)) > plot(sam.out, 0.5) > > > > On 22 Apr 2010, at 15:41, Wolfgang Huber wrote: > >> Hi Mike >> >> can you provide a reproducible example (R script) and the output of >> sessionInfo() for the 2-groups comparison? This would include a >> pointer to the 6(?) CEL files on the web. >> >> For the 28-classes case, I doubt that hypothesis testing of the null >> hypothesis "mean in all classes is the same" is the most useful data >> analytic approach. Perhaps you can precisize your question. >> >> Best wishes >> Wolfgang >> >>> Hi, >>> I'm trying to use siggenes - sam() to look at differentially >>> expressed genes of data taken from the Immunological Genome project >>> (http://www.immgen.org/). A problem with this is that I have to >>> perform the preprocessing of the original CEL files as they do not >>> make the processed data available on GEO. To do this I'm using the >>> aroma suite of packages to perform quantile normalisation of this >>> data set (so far 128 arrays) in fixed memory (i.e. my laptop). This >>> is a good thing, as it has forced me to learn a little about array >>> preprocessing, and a bad thing as I'm new to all this and might be >>> going horribly wrong. When it comes time to look for differentially >>> expressed genes, I'm using siggenes - sam() and I'm getting some >>> strange results. I'm using (what I think would be considered) many >>> classes (28), where each class has at least 3 examples, and thus >>> throwing out some of the arrays. The results I'm getting look like: >>> SAM Analysis for the Multi-Class Case with 28 Classes >>> Delta p0 False Called FDR >>> 1 0.1 0.014 23355.05 23532 0.013997 >>> 2 0.2 0.014 21805.98 23498 0.013087 >>> 3 0.3 0.014 15923.83 23169 0.009693 >>> 4 0.4 0.014 9637.47 22343 0.006083 >>> 5 0.5 0.014 5527.75 21330 0.003655 >>> 6 0.6 0.014 3060.73 20221 0.002135 >>> 7 0.7 0.014 1703.82 19205 0.001251 >>> 8 0.8 0.014 953.02 18256 0.000736 >>> 9 0.9 0.014 536.81 17382 0.000436 >>> 10 1.0 0.014 307.19 16730 0.000259 >>> 11 1.1 0.014 176 16143 0.000154 >>> which I think implies that many many genes are differentially >>> expressed. Using plot(sam.out, 1.2) shows a pretty much vertical line >>> starting at the origin, with no genes observably behaving like the >>> null model. Even if I only try this on 2 classes, and hence throwing >>> out most of the data, I'm still not getting sensible results. >>> Now I'm hoping that I'm doing something wrong, and that not 16K of my >>> genes are differentially expressed. However, I'm having difficulty >>> figuring out what it might be. The one striking thing between my data >>> set and the golub example set is that golub seems to be zero-mean - >>> is this a requirement for sam()? >>> Any other ideas of what to look for, or what other information I >>> could provide to help this question make sense, would be greatly >>> appreciated. >>> Thanks in advance, >>> Mike Dewar >>> - - - >>> Dr Michael Dewar >>> Postdoctoral Research Scientist Applied Mathematics >>> Columbia University >>> http://www.columbia.edu/~md2954/ >>> [[alternative HTML version deleted]] >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch <mailto:bioconductor at="" stat.math.ethz.ch=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> >> >> Wolfgang Huber >> EMBL >> http://www.embl.de/research/units/genome_biology/huber >> >> >> > > - - - > Dr Michael Dewar > **Postdoctoral Research Scientist > Applied Mathematics > Columbia University > http://www.columbia.edu/~md2954/ > > > > > > -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
Hi Mike I added another heatmap, not using your 97 "marker" genes, but 200 random ones instead. The clustering result for the samples is the same: http://www-huber.embl.de/users/whuber/bioc-list/100423 Also, your fData(immgen) is ill-formatted. It should be a data.frame with as many rows as features, instead it has as many columns, with one row: > dim(immgen) Features Samples 23532 128 > dim(fData(immgen)) [1] 23532 1 Best wishes Wolfgang Wolfgang Huber ha scritto: > > > Hi Mike > > I am afraid it does look like a normalisation problem. siggenes is > right, given the data. Have a look at the heatmap in the PDF file at > http://www-huber.embl.de/users/whuber/bioc-list/100423/ > and the R script in the same directory, which derives from your code below. > > Best wishes > Wolfgang > > > Mike Dewar scripsit 23/04/10 17:11: >> Hi Wolfgang, >> >> To get from the CEL files to a Bioconductor ExpressionSet is quite >> longwinded and, as it's not using Bioconductor, maybe not something I >> should abuse this list with. Instead, I've put the expression set I've >> generated up online at >> http://www.columbia.edu/~md2954/immgen.data >> >> Then, to generate the results I'm seeing, I use the code below. I'm >> pretty new to R, too, so any general hints also welcome. In the code >> below I'm interested in seeing which genes that have symbols like >> "CD4" are differentially expressed. Note that if I just use all the >> genes (and cut out the code that selects "^Cd\\d+" genes) then I still >> get results implying that my genes are not distributed in the way sam >> is expecting. >> To precisize my question then, does the result of SAM, as expressed in >> the below code, imply that my array data is badly normalised? If not, >> does it imply some horrific misunderstanding on my part on how these >> things should work? >> >> Mike >> >> library(siggenes) >> library(Biobase) >> # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e. >> #path = "/Users/mike/Data/Immgen/userData/GSE15907" >> path = "." >> # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about >> this) >> sep = "/" >> # load the data >> filename = "immgen.data" >> load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen >> # form a vector which corresponds to the class of each array >> p = pData(immgen) >> cl = colnames(p)[apply(p, 1, which)] >> # remove "phenotype" genes >> # any gene that directly encodes a surface marker is highlighted >> regexp = "^Cd\\d+" >> marker_indices = grep(regexp,fData(immgen)$symbol) >> markers = immgen[marker_indices,] >> # class labels: >> classes <- colnames(pData(markers)) >> # find out how many of each class we have >> count <- lapply(pData(markers),sum) >> # chuck out all the experiments with less than 4 examples. This will >> leave just two classes. to_keep = as.logical(sapply( >> as.data.frame(t( >> # the next line chooses those phenotypes with more than n examples >> # set n to 2 to leave 28 classes n = 3 >> pData(markers)[,classes[count>n]] )), >> sum # the sum is for each row, hence the transposition above >> )) >> # now to_keep has a 1 next to each array we want and a 0 to those we >> don't >> # hence we can use to_keep to index the markers expression set >> markers <- markers[,to_keep] >> # and then chuck out classes we've rejected >> cl <- cl[to_keep] >> # then apply sam to the markers >> sam.out = sam(data=markers,cl=cl,var.equal=FALSE) >> print(sam.out, seq(0.1, 5, 0.1)) >> plot(sam.out, 0.5) >> >> >> >> On 22 Apr 2010, at 15:41, Wolfgang Huber wrote: >> >>> Hi Mike >>> >>> can you provide a reproducible example (R script) and the output of >>> sessionInfo() for the 2-groups comparison? This would include a >>> pointer to the 6(?) CEL files on the web. >>> >>> For the 28-classes case, I doubt that hypothesis testing of the null >>> hypothesis "mean in all classes is the same" is the most useful data >>> analytic approach. Perhaps you can precisize your question. >>> >>> Best wishes >>> Wolfgang >>> >>>> Hi, >>>> I'm trying to use siggenes - sam() to look at differentially >>>> expressed genes of data taken from the Immunological Genome project >>>> (http://www.immgen.org/). A problem with this is that I have to >>>> perform the preprocessing of the original CEL files as they do not >>>> make the processed data available on GEO. To do this I'm using the >>>> aroma suite of packages to perform quantile normalisation of this >>>> data set (so far 128 arrays) in fixed memory (i.e. my laptop). This >>>> is a good thing, as it has forced me to learn a little about array >>>> preprocessing, and a bad thing as I'm new to all this and might be >>>> going horribly wrong. When it comes time to look for differentially >>>> expressed genes, I'm using siggenes - sam() and I'm getting some >>>> strange results. I'm using (what I think would be considered) many >>>> classes (28), where each class has at least 3 examples, and thus >>>> throwing out some of the arrays. The results I'm getting look like: >>>> SAM Analysis for the Multi-Class Case with 28 Classes >>>> Delta p0 False Called FDR >>>> 1 0.1 0.014 23355.05 23532 0.013997 >>>> 2 0.2 0.014 21805.98 23498 0.013087 >>>> 3 0.3 0.014 15923.83 23169 0.009693 >>>> 4 0.4 0.014 9637.47 22343 0.006083 >>>> 5 0.5 0.014 5527.75 21330 0.003655 >>>> 6 0.6 0.014 3060.73 20221 0.002135 >>>> 7 0.7 0.014 1703.82 19205 0.001251 >>>> 8 0.8 0.014 953.02 18256 0.000736 >>>> 9 0.9 0.014 536.81 17382 0.000436 >>>> 10 1.0 0.014 307.19 16730 0.000259 >>>> 11 1.1 0.014 176 16143 0.000154 >>>> which I think implies that many many genes are differentially >>>> expressed. Using plot(sam.out, 1.2) shows a pretty much vertical >>>> line starting at the origin, with no genes observably behaving like >>>> the null model. Even if I only try this on 2 classes, and hence >>>> throwing out most of the data, I'm still not getting sensible results. >>>> Now I'm hoping that I'm doing something wrong, and that not 16K of >>>> my genes are differentially expressed. However, I'm having >>>> difficulty figuring out what it might be. The one striking thing >>>> between my data set and the golub example set is that golub seems to >>>> be zero-mean - is this a requirement for sam()? >>>> Any other ideas of what to look for, or what other information I >>>> could provide to help this question make sense, would be greatly >>>> appreciated. >>>> Thanks in advance, >>>> Mike Dewar >>>> - - - >>>> Dr Michael Dewar >>>> Postdoctoral Research Scientist Applied Mathematics >>>> Columbia University >>>> http://www.columbia.edu/~md2954/ >>>> [[alternative HTML version deleted]] >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch <mailto:bioconductor at="" stat.math.ethz.ch=""> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> >>> >>> Wolfgang Huber >>> EMBL >>> http://www.embl.de/research/units/genome_biology/huber >>> >>> >>> >> >> - - - >> Dr Michael Dewar >> **Postdoctoral Research Scientist Applied Mathematics >> Columbia University >> http://www.columbia.edu/~md2954/ >> >> >> >> >> >> > > -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber ADD REPLYlink written 9.5 years ago by Wolfgang Huber13k Wolfgang Huber wrote > Also, your fData(immgen) is ill-formatted. It should be a data.frame > with as many rows as features, instead it has as many columns, with one > row: > > dim(immgen) > Features Samples > 23532 128 > > dim(fData(immgen)) > [1] 23532 1 For the record....: It is ill-formatted, but in a different way: > str(fData(immgen)) 'data.frame': 23532 obs. of 1 variable:$ symbol:List of 23532 ..$10344622: chr "Gm6123" ..$ 10344624: chr "Lypla1" ..$10344633: chr "Tcea1" ..$ 10344637: chr "Atp6v1h" I.e. "symbol" is a list of 23532 character vectors of length 1 each, while it should be a single character vectors of length 23532 (you probably used an 'lapply' somewhere an 'sapply' would have been better). This is probably a tangential problem, not sure it has anything to do with those 'batch effects' in your data. Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber