Some Genefilter questions

0

Entering edit mode

Amy Mikhail ▴ 460

@amy-mikhail-1317

Last seen 9.6 years ago

Dear Bioconductors, I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles mosquito samples hybridised to them (i.e. they are not infected mosquitoes). The 6 chips include 3 replicates, each consisting of two time points. The design matrix is as follows: > design M15d M43d [1,] 1 0 [2,] 0 1 [3,] 1 0 [4,] 0 1 [5,] 1 0 [6,] 0 1 I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy). Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0 DE genes, respectively... much less than I was expecting. As this affy chip contains probesets for both mosquito and malaria parasite genes, I am wondering: (a) if it is better to remove all the parasite probesets before my analysis; (b) if so at what stage I should do this (before or after normalisation and background correction, or does it matter?) (c) how would I filter out these probesets using genefilter (all the parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs to filter out the probesets, and if so how?) Secondly, I did not add any of the polyA controls to my samples. I would like to know: (d) Do any of the bg correct / normalisation methods I tried utilise affymetrix control probesets, and if so, how? (e) Should I also filter out the control sets - again, if so at what stage in the analysis and what would be an appropriate code to use? I did try the code for non-specific filtering (on my RMA dataset) from pg. 232 of the bioconductor monograph, but the reduction in the number of probesets was quite drastic; > f1 <- pOverA(0.25, log2(100)) > f2 <- function(x) (IQR(x) > 0.5) > ff <- filterfun(f1, f2) > selected <- genefilter(Baseage.transformed, ff) > sum(selected) [1] 404 ###(The origninal no. of probesets is 22,726)### > Baseage.sub <- Baseage.transformed[selected, ] Also, I understood from the monograph that "100" was to filter out fluorescence intensities less than this, but I am not clear if this is from raw intensities or log2 values? All the parasite probesets have raw intensities <35 .... so could I apply this as a simple filter, and would this have to be on raw (rather than normalised data)? Appologies for the long posting... Looking forward to any replies, Regards, Amy > sessionInfo() R version 2.4.0 (2006-10-03) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] "tcltk" "splines" "tools" "methods" "stats" "graphics" "grDevices" "utils" "datasets" "base" other attached packages: plasmodiumanophelescdf tkWidgets DynDoc widgetTools agahomology "1.14.0" "1.12.0" "1.12.0" "1.10.0" "1.14.2" affyPLM gcrma matchprobes affydata annaffy "1.10.0" "2.6.0" "1.6.0" "1.10.0" "1.6.0" KEGG GO limma geneplotter annotate "1.14.0" "1.14.0" "2.9.1" "1.12.0" "1.12.0" affy affyio genefilter survival Biobase "1.12.0" "1.2.0" "1.12.0" "2.29" "1.12.0" > ------------------------------------------- Amy Mikhail Research student University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ Scotland Email: a.mikhail at abdn.ac.uk Phone: 00-44-1224-272880 (lab) 00-44-1224-273256 (office)

GO Survival cdf genefilter geneplotter tkWidgets affy affydata widgetTools gcrma affyPLM • 1.9k views

ADD COMMENT • link updated 17.4 years ago by rgentleman ★ 5.5k • written 17.4 years ago by Amy Mikhail ▴ 460

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 9.0 years ago

United States

Hi, Amy Mikhail wrote: > Dear Bioconductors, > > I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles > mosquito samples hybridised to them (i.e. they are not infected > mosquitoes). The 6 chips include 3 replicates, each consisting of two > time points. The design matrix is as follows: > >> design > M15d M43d > [1,] 1 0 > [2,] 0 1 > [3,] 1 0 > [4,] 0 1 > [5,] 1 0 > [6,] 0 1 > > > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy). > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0 > DE genes, respectively... much less than I was expecting. > > As this affy chip contains probesets for both mosquito and malaria > parasite genes, I am wondering: > > (a) if it is better to remove all the parasite probesets before my analysis; Yes, if you don't intend to use them, and they are not relevant to your analysis. There is no point in doing p-value corrections for tests you know are not interesting/relevant a priori. > > (b) if so at what stage I should do this (before or after normalisation > and background correction, or does it matter?) After both and prior to analysis - otherwise you are likely to need to do some serious tweaking of the normalization code. > > (c) how would I filter out these probesets using genefilter (all the > parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs > to filter out the probesets, and if so how?) you don't need genefilter at all, this is a subseting problem. If you had an ExpressionSet you would do something like: parasites = grep("^Pf", featureNames(myExpressionSet)) mySubset = myExpressionSet[!parasites,] > > Secondly, I did not add any of the polyA controls to my samples. I would > like to know: > > (d) Do any of the bg correct / normalisation methods I tried utilise > affymetrix control probesets, and if so, how? I doubt it. > > (e) Should I also filter out the control sets - again, if so at what stage > in the analysis and what would be an appropriate code to use? > same place as you filter the parasite genes and pretty much in the same way. They are likely to start with AFFX. > I did try the code for non-specific filtering (on my RMA dataset) from pg. > 232 of the bioconductor monograph, but the reduction in the number of > probesets was quite drastic; > >> f1 <- pOverA(0.25, log2(100)) >> f2 <- function(x) (IQR(x) > 0.5) that is a typo in the text - you probably want to filter out those with IQR below the median, not for some fixed value. >> ff <- filterfun(f1, f2) >> selected <- genefilter(Baseage.transformed, ff) >> sum(selected) > [1] 404 ###(The origninal no. of probesets is 22,726)### >> Baseage.sub <- Baseage.transformed[selected, ] > > Also, I understood from the monograph that "100" was to filter out > fluorescence intensities less than this, but I am not clear if this is > from raw intensities or log2 values? raw - 100 on the log2 scale is larger than can be represented in the image file formats used. And don't do that - it is not a good idea - filter on variability. > > All the parasite probesets have raw intensities <35 .... so could I apply > this as a simple filter, and would this have to be on raw (rather than > normalised data)? Best wishes Robert > > Appologies for the long posting... > > Looking forward to any replies, > Regards, > Amy > >> sessionInfo() > R version 2.4.0 (2006-10-03) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] "tcltk" "splines" "tools" "methods" "stats" > "graphics" "grDevices" "utils" "datasets" "base" > > other attached packages: > plasmodiumanophelescdf tkWidgets DynDoc > widgetTools agahomology > "1.14.0" "1.12.0" "1.12.0" > "1.10.0" "1.14.2" > affyPLM gcrma matchprobes > affydata annaffy > "1.10.0" "2.6.0" "1.6.0" > "1.10.0" "1.6.0" > KEGG GO limma > geneplotter annotate > "1.14.0" "1.14.0" "2.9.1" > "1.12.0" "1.12.0" > affy affyio genefilter > survival Biobase > "1.12.0" "1.2.0" "1.12.0" > "2.29" "1.12.0" > > > ------------------------------------------- > Amy Mikhail > Research student > University of Aberdeen > Zoology Building > Tillydrone Avenue > Aberdeen AB24 2TZ > Scotland > Email: a.mikhail at abdn.ac.uk > Phone: 00-44-1224-272880 (lab) > 00-44-1224-273256 (office) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD COMMENT • link 17.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

Hi Amy, Don't you just love it when you get one response suggesting you do one thing (remove malarial genes after pre-processing) and another response suggesting the opposite? Although I think in this case Robert was suggesting you remove them after pre-processing because it was easier than trying to modify either the normalization code or the cdf environment, which is what Jim pointed out to you. I ran into this same problem with having probesets for other species on the soybean array, which is why I used Ariel's code. I think that if you're using a mixed species array but only put one of the species on it, then you should remove the other species' probesets BEFORE doing the normalization because they really have no bearing on the transcriptome you're trying to measure. On the other hand, if you also want to filter your species' probesets based on presence/absence, minimum cutoff, variation, etc.* , then you should filter these genes AFTER doing the pre-processing because these probesets do contain information about the transcriptome, even if it is just 'not detectably expressed'. Cheers, Jenny * Contrary to Robert, I prefer to filter on presence/absence (using Affy's calls) rather than variability :) I don't know if there is any documentation on which may be "better"... At 05:15 PM 11/29/2006, Robert Gentleman wrote: >Hi, > >Amy Mikhail wrote: > > Dear Bioconductors, > > > > I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles > > mosquito samples hybridised to them (i.e. they are not infected > > mosquitoes). The 6 chips include 3 replicates, each consisting of two > > time points. The design matrix is as follows: > > > >> design > > M15d M43d > > [1,] 1 0 > > [2,] 0 1 > > [3,] 1 0 > > [4,] 0 1 > > [5,] 1 0 > > [6,] 0 1 > > > > > > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy). > > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0 > > DE genes, respectively... much less than I was expecting. > > > > As this affy chip contains probesets for both mosquito and malaria > > parasite genes, I am wondering: > > > > (a) if it is better to remove all the parasite probesets before my > analysis; > > Yes, if you don't intend to use them, and they are not relevant to >your analysis. There is no point in doing p-value corrections for tests >you know are not interesting/relevant a priori. > > > > > (b) if so at what stage I should do this (before or after normalisation > > and background correction, or does it matter?) > > After both and prior to analysis - otherwise you are likely to need to >do some serious tweaking of the normalization code. > > > > > (c) how would I filter out these probesets using genefilter (all the > > parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs > > to filter out the probesets, and if so how?) > > you don't need genefilter at all, this is a subseting problem. > If you had an ExpressionSet you would do something like: > > parasites = grep("^Pf", featureNames(myExpressionSet)) > > mySubset = myExpressionSet[!parasites,] > > > > > Secondly, I did not add any of the polyA controls to my samples. I would > > like to know: > > > > (d) Do any of the bg correct / normalisation methods I tried utilise > > affymetrix control probesets, and if so, how? > > I doubt it. > > > > > (e) Should I also filter out the control sets - again, if so at what stage > > in the analysis and what would be an appropriate code to use? > > > > same place as you filter the parasite genes and pretty much in the >same way. They are likely to start with AFFX. > > > I did try the code for non-specific filtering (on my RMA dataset) from pg. > > 232 of the bioconductor monograph, but the reduction in the number of > > probesets was quite drastic; > > > >> f1 <- pOverA(0.25, log2(100)) > >> f2 <- function(x) (IQR(x) > 0.5) > > that is a typo in the text - you probably want to filter out those >with IQR below the median, not for some fixed value. > > >> ff <- filterfun(f1, f2) > >> selected <- genefilter(Baseage.transformed, ff) > >> sum(selected) > > [1] 404 ###(The origninal no. of probesets is 22,726)### > >> Baseage.sub <- Baseage.transformed[selected, ] > > > > Also, I understood from the monograph that "100" was to filter out > > fluorescence intensities less than this, but I am not clear if this is > > from raw intensities or log2 values? > > raw - 100 on the log2 scale is larger than can be represented in the >image file formats used. And don't do that - it is not a good idea - >filter on variability. > > > > > > All the parasite probesets have raw intensities <35 .... so could I apply > > this as a simple filter, and would this have to be on raw (rather than > > normalised data)? > > > Best wishes > Robert > > > > > Appologies for the long posting... > > > > Looking forward to any replies, > > Regards, > > Amy > > > >> sessionInfo() > > R version 2.4.0 (2006-10-03) > > i386-pc-mingw32 > > > > locale: > > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > > States.1252;LC_MONETARY=English_United > > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > > > attached base packages: > > [1] "tcltk" "splines" "tools" "methods" "stats" > > "graphics" "grDevices" "utils" "datasets" "base" > > > > other attached packages: > > plasmodiumanophelescdf tkWidgets DynDoc > > widgetTools agahomology > > "1.14.0" "1.12.0" "1.12.0" > > "1.10.0" "1.14.2" > > affyPLM gcrma matchprobes > > affydata annaffy > > "1.10.0" "2.6.0" "1.6.0" > > "1.10.0" "1.6.0" > > KEGG GO limma > > geneplotter annotate > > "1.14.0" "1.14.0" "2.9.1" > > "1.12.0" "1.12.0" > > affy affyio genefilter > > survival Biobase > > "1.12.0" "1.2.0" "1.12.0" > > "2.29" "1.12.0" > > > > > > ------------------------------------------- > > Amy Mikhail > > Research student > > University of Aberdeen > > Zoology Building > > Tillydrone Avenue > > Aberdeen AB24 2TZ > > Scotland > > Email: a.mikhail at abdn.ac.uk > > Phone: 00-44-1224-272880 (lab) > > 00-44-1224-273256 (office) > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > >-- >Robert Gentleman, PhD >Program in Computational Biology >Division of Public Health Sciences >Fred Hutchinson Cancer Research Center >1100 Fairview Ave. N, M2-B876 >PO Box 19024 >Seattle, Washington 98109-1024 >206-667-7700 >rgentlem at fhcrc.org > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu

ADD REPLY • link 17.4 years ago Jenny Drnevich ★ 2.2k

0

Entering edit mode

Hi, It may be worth pointing out that a related question can have a huge impact on normalization of certain glass arrays. One of the standard protocols on the Agilent 44K human arrays causes several hundred control spots to light up extremely brightly in the green channel, but remain completely off in the red channel. If you leave these control spots in the data set when you normalize between channels (i.e., within arrays), every known normalization methods breaks -- in the precise sense that it will systematically distort the comparison between the red and green channels. If you then model the data incorporating a dye effect, you will think that almost every gene exhibits a dye bias. On the other hand, if you remove these control spots before normalizing between channels, then modeling the dye bias suggest that it rarely exists.... As for the question originally asked here, I would not expect the foreign species probes to break the normalization (unless they somehow light up in one group of samples but not in the other). So, my own bias would be to keep them for background correction and normalization, but remove them before the rest of the analysis. Best, Kevin Jenny Drnevich wrote: > Hi Amy, > > Don't you just love it when you get one response suggesting you do one > thing (remove malarial genes after pre-processing) and another response > suggesting the opposite? Although I think in this case Robert was > suggesting you remove them after pre-processing because it was easier than > trying to modify either the normalization code or the cdf environment, > which is what Jim pointed out to you. I ran into this same problem with > having probesets for other species on the soybean array, which is why I > used Ariel's code. I think that if you're using a mixed species array but > only put one of the species on it, then you should remove the other > species' probesets BEFORE doing the normalization because they really have > no bearing on the transcriptome you're trying to measure. On the other > hand, if you also want to filter your species' probesets based on > presence/absence, minimum cutoff, variation, etc.* , then you should filter > these genes AFTER doing the pre-processing because these probesets do > contain information about the transcriptome, even if it is just 'not > detectably expressed'. > > Cheers, > Jenny > > * Contrary to Robert, I prefer to filter on presence/absence (using Affy's > calls) rather than variability :) I don't know if there is any > documentation on which may be "better"... > > At 05:15 PM 11/29/2006, Robert Gentleman wrote: >> Hi, >> >> Amy Mikhail wrote: >>> Dear Bioconductors, >>> >>> I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles >>> mosquito samples hybridised to them (i.e. they are not infected >>> mosquitoes). The 6 chips include 3 replicates, each consisting of two >>> time points. The design matrix is as follows: >>> >>>> design >>> M15d M43d >>> [1,] 1 0 >>> [2,] 0 1 >>> [3,] 1 0 >>> [4,] 0 1 >>> [5,] 1 0 >>> [6,] 0 1 >>> >>> >>> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy). >>> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0 >>> DE genes, respectively... much less than I was expecting. >>> >>> As this affy chip contains probesets for both mosquito and malaria >>> parasite genes, I am wondering: >>> >>> (a) if it is better to remove all the parasite probesets before my >> analysis; >> >> Yes, if you don't intend to use them, and they are not relevant to >> your analysis. There is no point in doing p-value corrections for tests >> you know are not interesting/relevant a priori. >> >>> (b) if so at what stage I should do this (before or after normalisation >>> and background correction, or does it matter?) >> After both and prior to analysis - otherwise you are likely to need to >> do some serious tweaking of the normalization code. >> >>> (c) how would I filter out these probesets using genefilter (all the >>> parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs >>> to filter out the probesets, and if so how?) >> you don't need genefilter at all, this is a subseting problem. >> If you had an ExpressionSet you would do something like: >> >> parasites = grep("^Pf", featureNames(myExpressionSet)) >> >> mySubset = myExpressionSet[!parasites,] >> >>> Secondly, I did not add any of the polyA controls to my samples. I would >>> like to know: >>> >>> (d) Do any of the bg correct / normalisation methods I tried utilise >>> affymetrix control probesets, and if so, how? >> I doubt it. >> >>> (e) Should I also filter out the control sets - again, if so at what stage >>> in the analysis and what would be an appropriate code to use? >>> >> same place as you filter the parasite genes and pretty much in the >> same way. They are likely to start with AFFX. >> >>> I did try the code for non-specific filtering (on my RMA dataset) from pg. >>> 232 of the bioconductor monograph, but the reduction in the number of >>> probesets was quite drastic; >>> >>>> f1 <- pOverA(0.25, log2(100)) >>>> f2 <- function(x) (IQR(x) > 0.5) >> that is a typo in the text - you probably want to filter out those >> with IQR below the median, not for some fixed value. >> >>>> ff <- filterfun(f1, f2) >>>> selected <- genefilter(Baseage.transformed, ff) >>>> sum(selected) >>> [1] 404 ###(The origninal no. of probesets is 22,726)### >>>> Baseage.sub <- Baseage.transformed[selected, ] >>> Also, I understood from the monograph that "100" was to filter out >>> fluorescence intensities less than this, but I am not clear if this is >>> from raw intensities or log2 values? >> raw - 100 on the log2 scale is larger than can be represented in the >> image file formats used. And don't do that - it is not a good idea - >> filter on variability. >> >> >>> All the parasite probesets have raw intensities <35 .... so could I apply >>> this as a simple filter, and would this have to be on raw (rather than >>> normalised data)? >> >> Best wishes >> Robert >> >>> Appologies for the long posting... >>> >>> Looking forward to any replies, >>> Regards, >>> Amy >>> >>>> sessionInfo() >>> R version 2.4.0 (2006-10-03) >>> i386-pc-mingw32 >>> >>> locale: >>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >>> States.1252;LC_MONETARY=English_United >>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] "tcltk" "splines" "tools" "methods" "stats" >>> "graphics" "grDevices" "utils" "datasets" "base" >>> >>> other attached packages: >>> plasmodiumanophelescdf tkWidgets DynDoc >>> widgetTools agahomology >>> "1.14.0" "1.12.0" "1.12.0" >>> "1.10.0" "1.14.2" >>> affyPLM gcrma matchprobes >>> affydata annaffy >>> "1.10.0" "2.6.0" "1.6.0" >>> "1.10.0" "1.6.0" >>> KEGG GO limma >>> geneplotter annotate >>> "1.14.0" "1.14.0" "2.9.1" >>> "1.12.0" "1.12.0" >>> affy affyio genefilter >>> survival Biobase >>> "1.12.0" "1.2.0" "1.12.0" >>> "2.29" "1.12.0" >>> >>> >>> ------------------------------------------- >>> Amy Mikhail >>> Research student >>> University of Aberdeen >>> Zoology Building >>> Tillydrone Avenue >>> Aberdeen AB24 2TZ >>> Scotland >>> Email: a.mikhail at abdn.ac.uk >>> Phone: 00-44-1224-272880 (lab) >>> 00-44-1224-273256 (office) >>>

ADD REPLY • link 17.4 years ago Kevin R. Coombes ▴ 140

0

Entering edit mode

Hi all, Jenny, just wanted to clarify what you said; you reckon if I only want to remove the foreign species probesets I should do this before preprocessing, but if I want to remove e.g. absent calls from my own species probes I should do this after preprocessing. Is this right? Also, how do I create the character vector of my parasite probesets for your code? Robert, I tried subsetting after preprocessing but before analysis ... it made no difference to the order of probesets, however the numbers changed slightly (all the probesets had slightly higher adjusted P.values after removing the parasite probes). See below: (a) Toptable for full dataset: ID M A t P.Value adj.P.Val B 5808 Ag.2R.2004.0_CDS_at -1.870657 9.585064 -16.705963 2.730301e-07 0.006216623 4.207052 12128 Ag.3R.1526.1_a_at -1.129926 9.969329 -13.778759 1.140079e-06 0.010670646 3.731215 6675 Ag.2R.274.0_UTR_a_at -2.967667 9.851482 -13.392310 1.405944e-06 0.010670646 3.650675 6676 Ag.2R.274.1_CDS_a_at -1.871438 9.486805 -12.842425 1.913317e-06 0.010891076 3.526999 7614 Ag.2R.354.0_UTR_at -1.266767 8.481348 -11.394707 4.581189e-06 0.020119389 3.141374 4531 Ag.2L.992.0_CDS_at 2.026152 9.203893 11.167484 5.301785e-06 0.020119389 3.071661 7990 Ag.2R.424.0_CDS_a_at 1.240622 9.747394 10.326106 9.329289e-06 0.030345512 2.787711 7615 Ag.2R.354.16_a_at -2.045494 9.100215 -10.046394 1.135967e-05 0.032331041 2.683414 13171 Ag.3R.2423.0_CDS_at -0.962208 6.088883 -9.672024 1.489835e-05 0.032613809 2.535235 1233 Ag.2L.1092.1_a_at 0.967778 11.195894 9.604850 1.565626e-05 0.032613809 2.507552 3645 Ag.2L.387.0_CDS_at -1.291859 6.257007 -9.596269 1.575616e-05 0.032613809 2.503991 6674 Ag.2R.274.0_CDS_s_at -1.748227 8.217272 -9.022044 2.439458e-05 0.046286683 2.252335 (b) Toptable for dataset minus parasite probesets: ID M A t P.Value adj.P.Val B 5808 Ag.2R.2004.0_CDS_at -1.8706568 9.585064 -16.460263 4.609906e-07 0.008415383 4.22498712 12128 Ag.3R.1526.1_a_at -1.1299262 9.969329 -13.637285 1.764053e-06 0.013877872 3.73030514 6675 Ag.2R.274.0_UTR_a_at -2.9676671 9.851482 -13.144767 2.289137e-06 0.013877872 3.61989734 6676 Ag.2R.274.1_CDS_a_at -1.8714376 9.486805 -12.626803 3.040892e-06 0.013877872 3.49400490 7614 Ag.2R.354.0_UTR_at -1.2667670 8.481348 -11.227966 6.932125e-06 0.024830944 3.09513993 4531 Ag.2L.992.0_CDS_at 2.0261521 9.203893 10.968142 8.161362e-06 0.024830944 3.01011426 7990 Ag.2R.424.0_CDS_a_at 1.2406222 9.747394 10.167325 1.380828e-05 0.036010013 2.72261326 7615 Ag.2R.354.16_a_at -2.0454939 9.100215 -9.863084 1.702538e-05 0.038169133 2.60232832 13171 Ag.3R.2423.0_CDS_at -0.9622079 6.088883 -9.542971 2.135453e-05 0.038169133 2.46851929 1233 Ag.2L.1092.1_a_at 0.9677780 11.195894 9.475125 2.242393e-05 0.038169133 2.43915802 3645 Ag.2L.387.0_CDS_at -1.2918594 6.257007 -9.440086 2.299975e-05 0.038169133 2.42385347 6674 Ag.2R.274.0_CDS_s_at -1.7482273 8.217272 -8.858858 3.545759e-05 0.053939852 2.15526082 Why would the adjusted P values be higher in the second case (number of parasite probes removed was about 4,000)? Regards, Amy ---------------------------------------------------------------------- ----- > Hi, > > It may be worth pointing out that a related question can have a huge > impact on normalization of certain glass arrays. One of the standard > protocols on the Agilent 44K human arrays causes several hundred control > spots to light up extremely brightly in the green channel, but remain > completely off in the red channel. If you leave these control spots in > the data set when you normalize between channels (i.e., within arrays), > every known normalization methods breaks -- in the precise sense that it > will systematically distort the comparison between the red and green > channels. If you then model the data incorporating a dye effect, you > will think that almost every gene exhibits a dye bias. On the other > hand, if you remove these control spots before normalizing between > channels, then modeling the dye bias suggest that it rarely exists.... > > As for the question originally asked here, I would not expect the > foreign species probes to break the normalization (unless they somehow > light up in one group of samples but not in the other). So, my own bias > would be to keep them for background correction and normalization, but > remove them before the rest of the analysis. > > Best, > Kevin > > Jenny Drnevich wrote: >> Hi Amy, >> >> Don't you just love it when you get one response suggesting you do one >> thing (remove malarial genes after pre-processing) and another response >> suggesting the opposite? Although I think in this case Robert was >> suggesting you remove them after pre-processing because it was easier >> than >> trying to modify either the normalization code or the cdf environment, >> which is what Jim pointed out to you. I ran into this same problem with >> having probesets for other species on the soybean array, which is why I >> used Ariel's code. I think that if you're using a mixed species array >> but >> only put one of the species on it, then you should remove the other >> species' probesets BEFORE doing the normalization because they really >> have >> no bearing on the transcriptome you're trying to measure. On the other >> hand, if you also want to filter your species' probesets based on >> presence/absence, minimum cutoff, variation, etc.* , then you should >> filter >> these genes AFTER doing the pre-processing because these probesets do >> contain information about the transcriptome, even if it is just 'not >> detectably expressed'. >> >> Cheers, >> Jenny >> >> * Contrary to Robert, I prefer to filter on presence/absence (using >> Affy's >> calls) rather than variability :) I don't know if there is any >> documentation on which may be "better"... >> ------------------------------------------- Amy Mikhail Research student University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ Scotland Email: a.mikhail at abdn.ac.uk Phone: 00-44-1224-272880 (lab) 00-44-1224-273256 (office)

ADD REPLY • link 17.4 years ago Amy Mikhail ▴ 460

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20061130/ f234fd61/attachment.pl

ADD REPLY • link 17.4 years ago Lourdusamy A Anbarasu ▴ 30

0

Entering edit mode

Hi, Lourdusamy A Anbarasu wrote: > Dear Dr. Robert, > > You have mentioned that the filtering on the variability is preferred > than raw intensity value. I have also read your previous post on this > issue. For filters based on CV, are there any recommended cut-off values? Not really. A widely held, but AFAIK undocumented, belief is that in any given tissue/cell about 40% of the genome is expressed at any time. So, I usually choose the median - that is somewhat conservative with respect to the above cited statistic - but this is a personal preference. I have not seen any research (and I think it would be hard). best wishes Robert > > Thanks in advance. > > Best regards, > Anbarasu > > On 11/30/06, *Robert Gentleman* <rgentlem at="" fhcrc.org=""> <mailto:rgentlem at="" fhcrc.org=""> > wrote: > > Hi, > > Amy Mikhail wrote: > > Dear Bioconductors, > > > > I am annalysing 6 PlasmodiumAnopheles genechips, which have only > Anopheles > > mosquito samples hybridised to them (i.e. they are not infected > > mosquitoes). The 6 chips include 3 replicates, each consisting > of two > > time points. The design matrix is as follows: > > > >> design > > M15d M43d > > [1,] 1 0 > > [2,] 0 1 > > [3,] 1 0 > > [4,] 0 1 > > [5,] 1 0 > > [6,] 0 1 > > > > > > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 > (in affy). > > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, > 0 and 0 > > DE genes, respectively... much less than I was expecting. > > > > As this affy chip contains probesets for both mosquito and malaria > > parasite genes, I am wondering: > > > > (a) if it is better to remove all the parasite probesets before > my analysis; > > Yes, if you don't intend to use them, and they are not relevant to > your analysis. There is no point in doing p-value corrections for tests > you know are not interesting/relevant a priori. > > > > > (b) if so at what stage I should do this (before or after > normalisation > > and background correction, or does it matter?) > > After both and prior to analysis - otherwise you are likely to > need to > do some serious tweaking of the normalization code. > > > > > (c) how would I filter out these probesets using genefilter (all the > > parasite affy IDs begin with Pf. - could I use this prefix in the > affy IDs > > to filter out the probesets, and if so how?) > > you don't need genefilter at all, this is a subseting problem. > If you had an ExpressionSet you would do something like: > > parasites = grep("^Pf", featureNames(myExpressionSet)) > > mySubset = myExpressionSet[!parasites,] > > > > > Secondly, I did not add any of the polyA controls to my > samples. I would > > like to know: > > > > (d) Do any of the bg correct / normalisation methods I tried utilise > > affymetrix control probesets, and if so, how? > > I doubt it. > > > > > (e) Should I also filter out the control sets - again, if so at > what stage > > in the analysis and what would be an appropriate code to use? > > > > same place as you filter the parasite genes and pretty much in the > same way. They are likely to start with AFFX. > > > I did try the code for non-specific filtering (on my RMA dataset) > from pg. > > 232 of the bioconductor monograph, but the reduction in the number of > > probesets was quite drastic; > > > >> f1 <- pOverA(0.25, log2(100)) > >> f2 <- function(x) (IQR(x) > 0.5) > > that is a typo in the text - you probably want to filter out those > with IQR below the median, not for some fixed value. > > >> ff <- filterfun(f1, f2) > >> selected <- genefilter(Baseage.transformed , ff) > >> sum(selected) > > [1] 404 ###(The origninal no. of probesets is 22,726)### > >> Baseage.sub <- Baseage.transformed[selected, ] > > > > Also, I understood from the monograph that "100" was to filter out > > fluorescence intensities less than this, but I am not clear if > this is > > from raw intensities or log2 values? > > raw - 100 on the log2 scale is larger than can be represented in the > image file formats used. And don't do that - it is not a good idea - > filter on variability. > > > > > > All the parasite probesets have raw intensities <35 .... so could > I apply > > this as a simple filter, and would this have to be on raw (rather > than > > normalised data)? > > > Best wishes > Robert > > > > > Appologies for the long posting... > > > > Looking forward to any replies, > > Regards, > > Amy > > > >> sessionInfo() > > R version 2.4.0 (2006-10-03) > > i386-pc-mingw32 > > > > locale: > > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > > States.1252;LC_MONETARY=English_United > > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > > > attached base packages: > > [1] "tcltk" "splines" "tools" "methods" "stats" > > "graphics" "grDevices" "utils" "datasets" "base" > > > > other attached packages: > > plasmodiumanophelescdf tkWidgets DynDoc > > widgetTools agahomology > > "1.14.0" " 1.12.0" "1.12.0" > > "1.10.0" "1.14.2" > > affyPLM gcrma matchprobes > > affydata annaffy > > "1.10.0" "2.6.0" "1.6.0" > > "1.10.0" "1.6.0" > > KEGG GO limma > > geneplotter annotate > > "1.14.0" "1.14.0" "2.9.1" > > "1.12.0" "1.12.0" > > affy affyio genefilter > > survival Biobase > > "1.12.0" "1.2.0" "1.12.0 " > > "2.29" "1.12.0" > > > > > > ------------------------------------------- > > Amy Mikhail > > Research student > > University of Aberdeen > > Zoology Building > > Tillydrone Avenue > > Aberdeen AB24 2TZ > > Scotland > > Email: a.mikhail at abdn.ac.uk <mailto:a.mikhail at="" abdn.ac.uk=""> > > Phone: 00-44-1224-272880 (lab) > > 00-44-1224-273256 (office) > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Robert Gentleman, PhD > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > PO Box 19024 > Seattle, Washington 98109-1024 > 206-667-7700 > rgentlem at fhcrc.org <mailto:rgentlem at="" fhcrc.org=""> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Lourdusamy A Anbarasu > Dipartimento Medicina Sperimentale e Sanita Pubblica > Via Scalzino 3 > 62032 Camerino (MC) -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 17.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

Hi Amy, Amy Mikhail wrote: > Dear Bioconductors, > > I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles > mosquito samples hybridised to them (i.e. they are not infected > mosquitoes). The 6 chips include 3 replicates, each consisting of two > time points. The design matrix is as follows: > > >>design > > M15d M43d > [1,] 1 0 > [2,] 0 1 > [3,] 1 0 > [4,] 0 1 > [5,] 1 0 > [6,] 0 1 > > > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy). > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0 > DE genes, respectively... much less than I was expecting. > > As this affy chip contains probesets for both mosquito and malaria > parasite genes, I am wondering: > > (a) if it is better to remove all the parasite probesets before my analysis; Probably. It's not the easiest thing to do. Here is a link to some code you can use: http://article.gmane.org/gmane.science.biology.informatics.conductor/9 869/match=remove+probes+cdf Read what Ariel and Jenny write there very closely so you don't make mistakes. > > (b) if so at what stage I should do this (before or after normalisation > and background correction, or does it matter?) Before doing anything, most likely, which is what the above code will do for you. > > (c) how would I filter out these probesets using genefilter (all the > parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs > to filter out the probesets, and if so how?) > > Secondly, I did not add any of the polyA controls to my samples. I would > like to know: > > (d) Do any of the bg correct / normalisation methods I tried utilise > affymetrix control probesets, and if so, how? No. > > (e) Should I also filter out the control sets - again, if so at what stage > in the analysis and what would be an appropriate code to use? No, there aren't enough of them to have an effect on your data. > > I did try the code for non-specific filtering (on my RMA dataset) from pg. > 232 of the bioconductor monograph, but the reduction in the number of > probesets was quite drastic; > > >>f1 <- pOverA(0.25, log2(100)) >>f2 <- function(x) (IQR(x) > 0.5) >>ff <- filterfun(f1, f2) >>selected <- genefilter(Baseage.transformed, ff) >>sum(selected) > > [1] 404 ###(The origninal no. of probesets is 22,726)### > >>Baseage.sub <- Baseage.transformed[selected, ] > > > Also, I understood from the monograph that "100" was to filter out > fluorescence intensities less than this, but I am not clear if this is > from raw intensities or log2 values? It has to be data on the natural scale. The intensities for an Affy chip come from a 16-bit TIFF image, which means the brightest value can be 2^16, which in log2 scale is 16, so you cannot even have a value that approaches 100 on the log scale. Best, Jim > > All the parasite probesets have raw intensities <35 .... so could I apply > this as a simple filter, and would this have to be on raw (rather than > normalised data)? > > Appologies for the long posting... > > Looking forward to any replies, > Regards, > Amy > > >>sessionInfo() > > R version 2.4.0 (2006-10-03) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] "tcltk" "splines" "tools" "methods" "stats" > "graphics" "grDevices" "utils" "datasets" "base" > > other attached packages: > plasmodiumanophelescdf tkWidgets DynDoc > widgetTools agahomology > "1.14.0" "1.12.0" "1.12.0" > "1.10.0" "1.14.2" > affyPLM gcrma matchprobes > affydata annaffy > "1.10.0" "2.6.0" "1.6.0" > "1.10.0" "1.6.0" > KEGG GO limma > geneplotter annotate > "1.14.0" "1.14.0" "2.9.1" > "1.12.0" "1.12.0" > affy affyio genefilter > survival Biobase > "1.12.0" "1.2.0" "1.12.0" > "2.29" "1.12.0" > > > > ------------------------------------------- > Amy Mikhail > Research student > University of Aberdeen > Zoology Building > Tillydrone Avenue > Aberdeen AB24 2TZ > Scotland > Email: a.mikhail at abdn.ac.uk > Phone: 00-44-1224-272880 (lab) > 00-44-1224-273256 (office) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD COMMENT • link 17.4 years ago James W. MacDonald 65k

0

Entering edit mode

Hi again, Some parts of my answer and of Jim's are in disagreement - it might be nice to hear other points of view here. The question is really whether there is anything to be gained by removing the probes (probesets) we know are not involved prior to normalization background correction or not. Clearly these probes will help with background correction, but they could substantially interfere with normalization. I don't personally thing (no evidence at all though) that this is a problem - but would love to see some quantitative comparisons of results that took both approaches to see if the end results are qualitatively different. best wishes Robert James W. MacDonald wrote: > Hi Amy, > > Amy Mikhail wrote: >> Dear Bioconductors, >> >> I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles >> mosquito samples hybridised to them (i.e. they are not infected >> mosquitoes). The 6 chips include 3 replicates, each consisting of two >> time points. The design matrix is as follows: >> >> >>> design >> M15d M43d >> [1,] 1 0 >> [2,] 0 1 >> [3,] 1 0 >> [4,] 0 1 >> [5,] 1 0 >> [6,] 0 1 >> >> >> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy). >> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0 >> DE genes, respectively... much less than I was expecting. >> >> As this affy chip contains probesets for both mosquito and malaria >> parasite genes, I am wondering: >> >> (a) if it is better to remove all the parasite probesets before my analysis; > > Probably. It's not the easiest thing to do. Here is a link to some code > you can use: > > http://article.gmane.org/gmane.science.biology.informatics.conductor /9869/match=remove+probes+cdf > > Read what Ariel and Jenny write there very closely so you don't make > mistakes. > >> (b) if so at what stage I should do this (before or after normalisation >> and background correction, or does it matter?) > > Before doing anything, most likely, which is what the above code will do > for you. > >> (c) how would I filter out these probesets using genefilter (all the >> parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs >> to filter out the probesets, and if so how?) >> >> Secondly, I did not add any of the polyA controls to my samples. I would >> like to know: >> >> (d) Do any of the bg correct / normalisation methods I tried utilise >> affymetrix control probesets, and if so, how? > > No. > >> (e) Should I also filter out the control sets - again, if so at what stage >> in the analysis and what would be an appropriate code to use? > > No, there aren't enough of them to have an effect on your data. > >> I did try the code for non-specific filtering (on my RMA dataset) from pg. >> 232 of the bioconductor monograph, but the reduction in the number of >> probesets was quite drastic; >> >> >>> f1 <- pOverA(0.25, log2(100)) >>> f2 <- function(x) (IQR(x) > 0.5) >>> ff <- filterfun(f1, f2) >>> selected <- genefilter(Baseage.transformed, ff) >>> sum(selected) >> [1] 404 ###(The origninal no. of probesets is 22,726)### >> >>> Baseage.sub <- Baseage.transformed[selected, ] >> >> Also, I understood from the monograph that "100" was to filter out >> fluorescence intensities less than this, but I am not clear if this is >> from raw intensities or log2 values? > > It has to be data on the natural scale. The intensities for an Affy chip > come from a 16-bit TIFF image, which means the brightest value can be > 2^16, which in log2 scale is 16, so you cannot even have a value that > approaches 100 on the log scale. > > Best, > > Jim > > > >> All the parasite probesets have raw intensities <35 .... so could I apply >> this as a simple filter, and would this have to be on raw (rather than >> normalised data)? >> >> Appologies for the long posting... >> >> Looking forward to any replies, >> Regards, >> Amy >> >> >>> sessionInfo() >> R version 2.4.0 (2006-10-03) >> i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >> States.1252;LC_MONETARY=English_United >> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] "tcltk" "splines" "tools" "methods" "stats" >> "graphics" "grDevices" "utils" "datasets" "base" >> >> other attached packages: >> plasmodiumanophelescdf tkWidgets DynDoc >> widgetTools agahomology >> "1.14.0" "1.12.0" "1.12.0" >> "1.10.0" "1.14.2" >> affyPLM gcrma matchprobes >> affydata annaffy >> "1.10.0" "2.6.0" "1.6.0" >> "1.10.0" "1.6.0" >> KEGG GO limma >> geneplotter annotate >> "1.14.0" "1.14.0" "2.9.1" >> "1.12.0" "1.12.0" >> affy affyio genefilter >> survival Biobase >> "1.12.0" "1.2.0" "1.12.0" >> "2.29" "1.12.0" >> >> >> >> ------------------------------------------- >> Amy Mikhail >> Research student >> University of Aberdeen >> Zoology Building >> Tillydrone Avenue >> Aberdeen AB24 2TZ >> Scotland >> Email: a.mikhail at abdn.ac.uk >> Phone: 00-44-1224-272880 (lab) >> 00-44-1224-273256 (office) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 17.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

Hi all, I am curious to see how they compare too - as soon as I have the subsetting and character vector sorted I will try both and let you know how it turns out. Out of interest - would it also be possible to carry out the background correction on the full dataset, then remove the parasite probesets, then normalise? (and how would one separate these functions in expresso or AffyPLM, since there is a placeholder for bg.correct but not for normalisation?) Regards, Amy ---------------------------------------------------------------------- ----- > Hi again, > > Some parts of my answer and of Jim's are in disagreement - it might be > nice to hear other points of view here. > > The question is really whether there is anything to be gained by > removing the probes (probesets) we know are not involved prior to > normalization background correction or not. > > Clearly these probes will help with background correction, but they > could substantially interfere with normalization. I don't personally > thing (no evidence at all though) that this is a problem - but would > love to see some quantitative comparisons of results that took both > approaches to see if the end results are qualitatively different. > > best wishes > Robert > > > James W. MacDonald wrote: >> Hi Amy, >> >> Amy Mikhail wrote: >>> Dear Bioconductors, >>> >>> I am annalysing 6 PlasmodiumAnopheles genechips, which have only >>> Anopheles >>> mosquito samples hybridised to them (i.e. they are not infected >>> mosquitoes). The 6 chips include 3 replicates, each consisting of two >>> time points. The design matrix is as follows: >>> >>> >>>> design >>> M15d M43d >>> [1,] 1 0 >>> [2,] 0 1 >>> [3,] 1 0 >>> [4,] 0 1 >>> [5,] 1 0 >>> [6,] 0 1 >>> >>> >>> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in >>> affy). >>> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and >>> 0 >>> DE genes, respectively... much less than I was expecting. >>> >>> As this affy chip contains probesets for both mosquito and malaria >>> parasite genes, I am wondering: >>> >>> (a) if it is better to remove all the parasite probesets before my >>> analysis; >> >> Probably. It's not the easiest thing to do. Here is a link to some code >> you can use: >> >> http://article.gmane.org/gmane.science.biology.informatics.conducto r/9869/match=remove+probes+cdf >> >> Read what Ariel and Jenny write there very closely so you don't make >> mistakes. >> >>> (b) if so at what stage I should do this (before or after normalisation >>> and background correction, or does it matter?) >> >> Before doing anything, most likely, which is what the above code will do >> for you. >> >>> (c) how would I filter out these probesets using genefilter (all the >>> parasite affy IDs begin with Pf. - could I use this prefix in the affy >>> IDs >>> to filter out the probesets, and if so how?) >>> >>> Secondly, I did not add any of the polyA controls to my samples. I >>> would >>> like to know: >>> >>> (d) Do any of the bg correct / normalisation methods I tried utilise >>> affymetrix control probesets, and if so, how? >> >> No. >> >>> (e) Should I also filter out the control sets - again, if so at what >>> stage >>> in the analysis and what would be an appropriate code to use? >> >> No, there aren't enough of them to have an effect on your data. >> >>> I did try the code for non-specific filtering (on my RMA dataset) from >>> pg. >>> 232 of the bioconductor monograph, but the reduction in the number of >>> probesets was quite drastic; >>> >>> >>>> f1 <- pOverA(0.25, log2(100)) >>>> f2 <- function(x) (IQR(x) > 0.5) >>>> ff <- filterfun(f1, f2) >>>> selected <- genefilter(Baseage.transformed, ff) >>>> sum(selected) >>> [1] 404 ###(The origninal no. of probesets is 22,726)### >>> >>>> Baseage.sub <- Baseage.transformed[selected, ] >>> >>> Also, I understood from the monograph that "100" was to filter out >>> fluorescence intensities less than this, but I am not clear if this is >>> from raw intensities or log2 values? >> >> It has to be data on the natural scale. The intensities for an Affy chip >> come from a 16-bit TIFF image, which means the brightest value can be >> 2^16, which in log2 scale is 16, so you cannot even have a value that >> approaches 100 on the log scale. >> >> Best, >> >> Jim >> >> >> >>> All the parasite probesets have raw intensities <35 .... so could I >>> apply >>> this as a simple filter, and would this have to be on raw (rather than >>> normalised data)? >>> >>> Appologies for the long posting... >>> >>> Looking forward to any replies, >>> Regards, >>> Amy >>> >>> >>>> sessionInfo() >>> R version 2.4.0 (2006-10-03) >>> i386-pc-mingw32 >>> >>> locale: >>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >>> States.1252;LC_MONETARY=English_United >>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] "tcltk" "splines" "tools" "methods" "stats" >>> "graphics" "grDevices" "utils" "datasets" "base" >>> >>> other attached packages: >>> plasmodiumanophelescdf tkWidgets DynDoc >>> widgetTools agahomology >>> "1.14.0" "1.12.0" "1.12.0" >>> "1.10.0" "1.14.2" >>> affyPLM gcrma matchprobes >>> affydata annaffy >>> "1.10.0" "2.6.0" "1.6.0" >>> "1.10.0" "1.6.0" >>> KEGG GO limma >>> geneplotter annotate >>> "1.14.0" "1.14.0" "2.9.1" >>> "1.12.0" "1.12.0" >>> affy affyio genefilter >>> survival Biobase >>> "1.12.0" "1.2.0" "1.12.0" >>> "2.29" "1.12.0" >>> >>> >>> >>> ------------------------------------------- >>> Amy Mikhail >>> Research student >>> University of Aberdeen >>> Zoology Building >>> Tillydrone Avenue >>> Aberdeen AB24 2TZ >>> Scotland >>> Email: a.mikhail at abdn.ac.uk >>> Phone: 00-44-1224-272880 (lab) >>> 00-44-1224-273256 (office) >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > -- > Robert Gentleman, PhD > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > PO Box 19024 > Seattle, Washington 98109-1024 > 206-667-7700 > rgentlem at fhcrc.org > ------------------------------------------- Amy Mikhail Research student University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ Scotland Email: a.mikhail at abdn.ac.uk Phone: 00-44-1224-272880 (lab) 00-44-1224-273256 (office)

ADD REPLY • link 17.4 years ago Amy Mikhail ▴ 460

0

Entering edit mode

Hi Amy, Amy Mikhail wrote: > Hi all, > > I am curious to see how they compare too - as soon as I have the > subsetting and character vector sorted I will try both and let you know > how it turns out. > > Out of interest - would it also be possible to carry out the background > correction on the full dataset, then remove the parasite probesets, then > normalise? (and how would one separate these functions in expresso or > AffyPLM, since there is a placeholder for bg.correct but not for > normalisation?) Yes, it should be possible. Pretty much all the functions for computing expression values (except mas5()) have a 'normalize' and a 'background' argument that you can set to FALSE if you don't want to do that step. So you could do something like: abatch <- ReadAffy() abatch.bg <- bg.correct.rma(abatch) ## subset the cdf and AffyBatch using Ariel and Jenny's code eset <- rmaabatch.bg, background=FALSE) Best, Jim > > Regards, > Amy > > -------------------------------------------------------------------- ------- -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD REPLY • link 17.4 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Robert and Jim, Many thanks for your advice. I have some more questions... First, I tried what Robert suggested on my expression set. However I got a strange result: > load("E:\\Amy - Bioconductor analysis\\03. Base age\\Affymetrix - Base Age results & analysis\\Baseage - RMA normalised.RData") > ls() [1] "Data" "eset" "phenodata" "x" "xy" "y" > parasites = grep("^Pf", featureNames(eset)) > parasites [1] 18192 18193 18194 18195 18196 18197 18198 18199 18200 18201 18202 18203 [13] 18204 18205 18206 18207 18208 18209 18210 18211 18212 18213 18214 18215 [25] 18216 18217 18218 18219 18220 18221 18222 18223 18224 18225 18226 18227 ### this list continues untill no. 4,514 ### I was expexting the parasite affy IDs to be listed here, but these are (I think) the probeset numbers (I can't tell if they are the right ones or not...)? > mossie.sub = eset[!parasites,] > mossie.sub Expression Set (exprSet) with 0 genes 6 samples phenoData object with 3 variables and 6 cases varLabels Name: short name of datasets for graphs Population: Age of adult mosquitoes (in days) included in the sample Replicate: Replicate number of the experiment So now it has removed all the genes... I don't understand why this would happen since the subset called "parasites" only contains a fraction of the total number of probesets (4,514 out of 22,769). Next, I wanted to try Jim's suggestion on the raw data. I can follow Jenny's post up to: " all you need now is your affybatch object, and a character vector of probe set names" I have an affybatch object, but how do I create a character vector for the probesets I want to remove? I'm still not very R-literate, so tried using the same code as previous except with the raw data instead of my expression set but the "featureNames" bit was a problem: > parasites = grep("^Pf", featureNames(data)) Error in function (classes, fdef, mtable) : unable to find an inherited method for function "featureNames", for signature "function" Any ideas? Regards, Amy ---------------------------------------------------------------------- ----- > Hi Amy, > > Amy Mikhail wrote: >> Dear Bioconductors, >> >> I am annalysing 6 PlasmodiumAnopheles genechips, which have only >> Anopheles >> mosquito samples hybridised to them (i.e. they are not infected >> mosquitoes). The 6 chips include 3 replicates, each consisting of two >> time points. The design matrix is as follows: >> >> >>>design >> >> M15d M43d >> [1,] 1 0 >> [2,] 0 1 >> [3,] 1 0 >> [4,] 0 1 >> [5,] 1 0 >> [6,] 0 1 >> >> >> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in >> affy). >> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and >> 0 >> DE genes, respectively... much less than I was expecting. >> >> As this affy chip contains probesets for both mosquito and malaria >> parasite genes, I am wondering: >> >> (a) if it is better to remove all the parasite probesets before my >> analysis; > > Probably. It's not the easiest thing to do. Here is a link to some code > you can use: > > http://article.gmane.org/gmane.science.biology.informatics.conductor /9869/match=remove+probes+cdf > > Read what Ariel and Jenny write there very closely so you don't make > mistakes. > >> >> (b) if so at what stage I should do this (before or after normalisation >> and background correction, or does it matter?) > > Before doing anything, most likely, which is what the above code will do > for you. > >> >> (c) how would I filter out these probesets using genefilter (all the >> parasite affy IDs begin with Pf. - could I use this prefix in the affy >> IDs >> to filter out the probesets, and if so how?) >> >> Secondly, I did not add any of the polyA controls to my samples. I >> would >> like to know: >> >> (d) Do any of the bg correct / normalisation methods I tried utilise >> affymetrix control probesets, and if so, how? > > No. > >> >> (e) Should I also filter out the control sets - again, if so at what >> stage >> in the analysis and what would be an appropriate code to use? > > No, there aren't enough of them to have an effect on your data. > >> >> I did try the code for non-specific filtering (on my RMA dataset) from >> pg. >> 232 of the bioconductor monograph, but the reduction in the number of >> probesets was quite drastic; >> >> >>>f1 <- pOverA(0.25, log2(100)) >>>f2 <- function(x) (IQR(x) > 0.5) >>>ff <- filterfun(f1, f2) >>>selected <- genefilter(Baseage.transformed, ff) >>>sum(selected) >> >> [1] 404 ###(The origninal no. of probesets is 22,726)### >> >>>Baseage.sub <- Baseage.transformed[selected, ] >> >> >> Also, I understood from the monograph that "100" was to filter out >> fluorescence intensities less than this, but I am not clear if this is >> from raw intensities or log2 values? > > It has to be data on the natural scale. The intensities for an Affy chip > come from a 16-bit TIFF image, which means the brightest value can be > 2^16, which in log2 scale is 16, so you cannot even have a value that > approaches 100 on the log scale. > > Best, > > Jim > > > >> >> All the parasite probesets have raw intensities <35 .... so could I >> apply >> this as a simple filter, and would this have to be on raw (rather than >> normalised data)? >> >> Appologies for the long posting... >> >> Looking forward to any replies, >> Regards, >> Amy >> >> >>>sessionInfo() >> >> R version 2.4.0 (2006-10-03) >> i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >> States.1252;LC_MONETARY=English_United >> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] "tcltk" "splines" "tools" "methods" "stats" >> "graphics" "grDevices" "utils" "datasets" "base" >> >> other attached packages: >> plasmodiumanophelescdf tkWidgets DynDoc >> widgetTools agahomology >> "1.14.0" "1.12.0" "1.12.0" >> "1.10.0" "1.14.2" >> affyPLM gcrma matchprobes >> affydata annaffy >> "1.10.0" "2.6.0" "1.6.0" >> "1.10.0" "1.6.0" >> KEGG GO limma >> geneplotter annotate >> "1.14.0" "1.14.0" "2.9.1" >> "1.12.0" "1.12.0" >> affy affyio genefilter >> survival Biobase >> "1.12.0" "1.2.0" "1.12.0" >> "2.29" "1.12.0" >> >> >> >> ------------------------------------------- >> Amy Mikhail >> Research student >> University of Aberdeen >> Zoology Building >> Tillydrone Avenue >> Aberdeen AB24 2TZ >> Scotland >> Email: a.mikhail at abdn.ac.uk >> Phone: 00-44-1224-272880 (lab) >> 00-44-1224-273256 (office) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues. > ------------------------------------------- Amy Mikhail Research student University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ Scotland Email: a.mikhail at abdn.ac.uk Phone: 00-44-1224-272880 (lab) 00-44-1224-273256 (office)

ADD REPLY • link 17.4 years ago Amy Mikhail ▴ 460

0

Entering edit mode

Hi, Amy Mikhail wrote: > Hi Robert and Jim, > > Many thanks for your advice. I have some more questions... > > First, I tried what Robert suggested on my expression set. However I got > a strange result: > >> load("E:\\Amy - Bioconductor analysis\\03. Base age\\Affymetrix - Base > Age results & analysis\\Baseage - RMA normalised.RData") >> ls() > [1] "Data" "eset" "phenodata" "x" "xy" "y" > >> parasites = grep("^Pf", featureNames(eset)) >> parasites > [1] 18192 18193 18194 18195 18196 18197 18198 18199 18200 18201 18202 > 18203 > [13] 18204 18205 18206 18207 18208 18209 18210 18211 18212 18213 18214 > 18215 > [25] 18216 18217 18218 18219 18220 18221 18222 18223 18224 18225 18226 > 18227 ### this list continues untill no. 4,514 ### you can tell by using featureNames(eset)[parasites] the values in the parasites vector are the indices of the features > > I was expexting the parasite affy IDs to be listed here, but these are (I > think) the probeset numbers (I can't tell if they are the right ones or > not...)? > >> mossie.sub = eset[!parasites,] oops - should have been mossie.sub = eset[-parasites,] my mistake - I keep thinking grep returns a logical vector for some reason. >> mossie.sub > Expression Set (exprSet) with > 0 genes > 6 samples > phenoData object with 3 variables and 6 cases > varLabels > Name: short name of datasets for graphs > Population: Age of adult mosquitoes (in days) included in > the sample > Replicate: Replicate number of the experiment > > So now it has removed all the genes... I don't understand why this would > happen since the subset called "parasites" only contains a fraction of the > total number of probesets (4,514 out of 22,769). > > Next, I wanted to try Jim's suggestion on the raw data. I can follow > Jenny's post up to: > > " all you need now is your affybatch object, and a character vector of > probe set names" > > I have an affybatch object, but how do I create a character vector for the > probesets I want to remove? > > I'm still not very R-literate, so tried using the same code as previous > except with the raw data instead of my expression set but the > "featureNames" bit was a problem: > >> parasites = grep("^Pf", featureNames(data)) > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function "featureNames", > for signature "function" > > Any ideas? > > Regards, > > Amy > > -------------------------------------------------------------------- ------- > >> Hi Amy, >> >> Amy Mikhail wrote: >>> Dear Bioconductors, >>> >>> I am annalysing 6 PlasmodiumAnopheles genechips, which have only >>> Anopheles >>> mosquito samples hybridised to them (i.e. they are not infected >>> mosquitoes). The 6 chips include 3 replicates, each consisting of two >>> time points. The design matrix is as follows: >>> >>> >>>> design >>> M15d M43d >>> [1,] 1 0 >>> [2,] 0 1 >>> [3,] 1 0 >>> [4,] 0 1 >>> [5,] 1 0 >>> [6,] 0 1 >>> >>> >>> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in >>> affy). >>> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and >>> 0 >>> DE genes, respectively... much less than I was expecting. >>> >>> As this affy chip contains probesets for both mosquito and malaria >>> parasite genes, I am wondering: >>> >>> (a) if it is better to remove all the parasite probesets before my >>> analysis; >> Probably. It's not the easiest thing to do. Here is a link to some code >> you can use: >> >> http://article.gmane.org/gmane.science.biology.informatics.conducto r/9869/match=remove+probes+cdf >> >> Read what Ariel and Jenny write there very closely so you don't make >> mistakes. >> >>> (b) if so at what stage I should do this (before or after normalisation >>> and background correction, or does it matter?) >> Before doing anything, most likely, which is what the above code will do >> for you. >> >>> (c) how would I filter out these probesets using genefilter (all the >>> parasite affy IDs begin with Pf. - could I use this prefix in the affy >>> IDs >>> to filter out the probesets, and if so how?) >>> >>> Secondly, I did not add any of the polyA controls to my samples. I >>> would >>> like to know: >>> >>> (d) Do any of the bg correct / normalisation methods I tried utilise >>> affymetrix control probesets, and if so, how? >> No. >> >>> (e) Should I also filter out the control sets - again, if so at what >>> stage >>> in the analysis and what would be an appropriate code to use? >> No, there aren't enough of them to have an effect on your data. >> >>> I did try the code for non-specific filtering (on my RMA dataset) from >>> pg. >>> 232 of the bioconductor monograph, but the reduction in the number of >>> probesets was quite drastic; >>> >>> >>>> f1 <- pOverA(0.25, log2(100)) >>>> f2 <- function(x) (IQR(x) > 0.5) >>>> ff <- filterfun(f1, f2) >>>> selected <- genefilter(Baseage.transformed, ff) >>>> sum(selected) >>> [1] 404 ###(The origninal no. of probesets is 22,726)### >>> >>>> Baseage.sub <- Baseage.transformed[selected, ] >>> >>> Also, I understood from the monograph that "100" was to filter out >>> fluorescence intensities less than this, but I am not clear if this is >>> from raw intensities or log2 values? >> It has to be data on the natural scale. The intensities for an Affy chip >> come from a 16-bit TIFF image, which means the brightest value can be >> 2^16, which in log2 scale is 16, so you cannot even have a value that >> approaches 100 on the log scale. >> >> Best, >> >> Jim >> >> >> >>> All the parasite probesets have raw intensities <35 .... so could I >>> apply >>> this as a simple filter, and would this have to be on raw (rather than >>> normalised data)? >>> >>> Appologies for the long posting... >>> >>> Looking forward to any replies, >>> Regards, >>> Amy >>> >>> >>>> sessionInfo() >>> R version 2.4.0 (2006-10-03) >>> i386-pc-mingw32 >>> >>> locale: >>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >>> States.1252;LC_MONETARY=English_United >>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] "tcltk" "splines" "tools" "methods" "stats" >>> "graphics" "grDevices" "utils" "datasets" "base" >>> >>> other attached packages: >>> plasmodiumanophelescdf tkWidgets DynDoc >>> widgetTools agahomology >>> "1.14.0" "1.12.0" "1.12.0" >>> "1.10.0" "1.14.2" >>> affyPLM gcrma matchprobes >>> affydata annaffy >>> "1.10.0" "2.6.0" "1.6.0" >>> "1.10.0" "1.6.0" >>> KEGG GO limma >>> geneplotter annotate >>> "1.14.0" "1.14.0" "2.9.1" >>> "1.12.0" "1.12.0" >>> affy affyio genefilter >>> survival Biobase >>> "1.12.0" "1.2.0" "1.12.0" >>> "2.29" "1.12.0" >>> >>> >>> >>> ------------------------------------------- >>> Amy Mikhail >>> Research student >>> University of Aberdeen >>> Zoology Building >>> Tillydrone Avenue >>> Aberdeen AB24 2TZ >>> Scotland >>> Email: a.mikhail at abdn.ac.uk >>> Phone: 00-44-1224-272880 (lab) >>> 00-44-1224-273256 (office) >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> Affymetrix and cDNA Microarray Core >> University of Michigan Cancer Center >> 1500 E. Medical Center Drive >> 7410 CCGC >> Ann Arbor MI 48109 >> 734-647-5623 >> >> >> ********************************************************** >> Electronic Mail is not secure, may not be read every day, and should not >> be used for urgent or sensitive issues. >> > > > ------------------------------------------- > Amy Mikhail > Research student > University of Aberdeen > Zoology Building > Tillydrone Avenue > Aberdeen AB24 2TZ > Scotland > Email: a.mikhail at abdn.ac.uk > Phone: 00-44-1224-272880 (lab) > 00-44-1224-273256 (office) > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 17.4 years ago rgentleman ★ 5.5k

Login before adding your answer.