Analysis with MBNI re-mapped (custom) CDF files

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 10 days ago

Wageningen University, Wageningen, the …

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20070131/ efc837e5/attachment.pl

• 492 views

ADD COMMENT • link updated 17.9 years ago by Seth Falcon ★ 7.4k • written 17.9 years ago by Guido Hooiveld ★ 4.1k

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 10.3 years ago

"Hooiveld, Guido" <guido.hooiveld at="" wur.nl=""> writes: > Outcomes: > library(mouse4302probe) > a <- as.data.frame(mouse4302probe) > b <- as.factor(a[,4]) > table(table(b)) > > 8 9 10 11 20 21 > 1 5 20 45032 40 3 > > - How can I extract the name of (lets's say) the 230 probesets that > consists of 3 probes? Here's one way: library("mouse4302probe") a <- as.data.frame(mouse4302probe) b <- as.factor(a[,4]) zz <- table(b) zz[1:3] probes <- dimnames(zz)$b probes[1:3] counts <- as.vector(zz) counts[1:3] probesByCount <- split(probes, counts) names(probesByCount) probesByCount[["21"]] I'll let others chime in on your other questions. + seth

ADD COMMENT • link 17.9 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 6 days ago

United States

Hi Guido, First off, I would like to thank you for the honorary doctorate ;-D Hooiveld, Guido wrote: > Dear list, > > Because I like the undelying idea, I have began using the re-mapped > CDF files provided by the MBNI. However, triggered by a remark made > by Dr MacDonald "... note that there are some downsides to using > these cdfs, mainly that the standard errors of your estimates will be > highly variable, since the probesets for these cdfs are quite > variable in size (unlike the stock affy chip, where the vast majority > have 11 probes)" from this thread > http://article.gmane.org/gmane.science.biology.informatics.conductor /11282, > I determined the number of probes that map to a probe set for both > default Affymetrix CDF file and Entrez-gene based re-mapped CDF file > for the Mouse430_2 array. > > Outcomes: library(mouse4302probe) a <- as.data.frame(mouse4302probe) > b <- as.factor(a[,4]) table(table(b)) > > 8 9 10 11 20 21 1 5 20 45032 40 3 > > > > library(mm430mmentrezgprobe) a <- as.data.frame(mm430mmentrezgprobe) > b <- as.factor(a[,4]) table(table(b)) > > > 3 4 5 6 7 8 9 10 11 12 13 14 15 16 > 17 18 230 213 219 283 419 663 1265 1741 5092 284 261 234 > 193 205 206 255 > > 19 20 21 22 23 24 25 26 27 28 29 30 31 32 > 33 34 412 569 639 1249 121 98 96 91 72 89 113 122 > 173 166 279 38 > > 35 36 37 38 39 40 41 42 43 44 45 46 47 48 > 49 50 39 30 32 36 20 35 41 46 40 50 18 15 > 10 6 8 9 > > 51 52 53 54 55 56 57 58 60 61 62 63 64 65 > 66 67 9 14 13 12 18 6 6 1 4 3 4 2 2 > 2 1 1 > > 68 70 71 73 74 75 76 80 89 3 3 3 3 2 2 > 1 1 1 > > > This indeed confirms Dr MacDonald's observations, which I would like > to address in more detail... However, as a biologists with limited > experience with statistics & R/BioC, I do have some (practical) > questions: > > - How can I extract the name of (lets's say) the 230 probesets that > consists of 3 probes? > library(mm430mmentrezgcdf) > a <- as.list(mm430mmentrezgcdf) > b <- lapply(a, function(x) dim(x)[1]) > d <- names(b[which(b == 3)]) > length(d) [1] 230 > d[1:10] [1] "76826_at" "12523_at" "67804_at" "11489_at" "66414_at" "382562_at" "225651_at" [8] "385407_at" "269587_at" "21960_at" - When applying RMA, probe set expression > levels are summerized according to Median Polish. What is the minimum > number of probes (x) that have to be summerized to obtain a robust > average using Median Polish? In other words, probe sets consisting of > less than x probes are better not dealt with? The median will be robust regardless of the number of probes. The real issue is that the expression values you calculate will have varying standard errors that depend on the number of probes in the probesets. If you then do e.g., univariate t-tests over all the probesets, you are ignoring the fact that some of these estimates have much larger standard errors than others. For instance, you might get a really large t-statistic for a probeset that only had three probes and rank that as more significant than a probeset with slightly smaller t-statistic, but 50 probes. So probably you wouldn't want to use some fixed cutoff where you say that five probes is bad, but six is good. Instead you might want to do some weighting of the t-statistic based on the number of probes, or some calculation of the standard error. I'll leave that sort of stuff for the people with real PhDs ;-p - Can the standard > error of the estimated expression according to RMA be extracted from > an eSet? If so, how could this be propagated into the statistical > analysis (eg. limma) that is used to identify DEGs? You don't get a standard error from RMA. You might be able to do something with the residuals from the median polish fit to try and estimate the standard error, but that would take some hacking of the code and might not even be statistically valid, depending on what you do. You could use the affyPLM package to fit your expression values. This uses a slightly different model, and fits the data using iteratively re-weighted least squares rather than median polish. However, you do get weights as well as the residuals, which you might be able to use. I don't think there is anything in limma that you can use directly to weight your results however. The simplest alternative is to just use the MBNI cdfs as is, with the realization that you may be throwing some false positives into your list of interesting genes (or ordering things incorrectly because you are ignoring the standard errors). In truth I am not sure this is any worse than using the stock affy cdfs and ignoring the fact that a certain proportion of the probesets contain probes that either don't interrogate the transcript of interest, or bind to multiple transcripts, or bind to nothing at all. In reality the stock cdfs have the same (or greater) problems as the MBNI cdfs, it's just convenient to ignore that fact. Best, Jim > > FYI: as a biologist I have concluded that re-mapping improved my > analyses: when comparing the lists of most regulated genes based on > analyses with Affy or re-mapped CDF, the latter identified genes that > were missing in the Affy top-list, altough those genes were expected > to present based on prior knowledge. However, this only applies to > the top-regulated genes (that are expressed at relatively high > levels), I haven't carefully evaluated the complete lists yet. > > Guido > > ------------------------------------------------ Guido Hooiveld, PhD > Nutrition, Metabolism & Genomics Group Division of Human Nutrition > Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen > the Netherlands > > tel: (+)31 317 485788 fax: (+)31 317 483342 > > internet: http://nutrigene.4t.com email: guido.hooiveld at wur.nl > > > > [[alternative HTML version deleted]] > > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD COMMENT • link 17.9 years ago James W. MacDonald 67k

Login before adding your answer.