Breaking the "most genes not differentially expressed" assumption

0

Entering edit mode

Paolo Innocenti ▴ 320

@paolo-innocenti-2191

Last seen 9.8 years ago

Hi all, I have dataset of 120 Affy arrays, 60 males and 60 females. The expression profiles of the 2 groups differs dramatically, i.e. if I run a standard RMA + limma, I have ~90% of the genes differentially expressed. Also, downregulated genes are twice as many than upregulated genes, although if I impose a cutoff of two-fold difference in expression, they are almost equal (15% up and 15% down). This is clearly breaking the assumption that most of the genes on the array should not be differentially expressed, but the result is in line with the current knowledge of sex-biased gene expression in my model organism. I have done some quality control plots, available here: - Boxplot: http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_boxplot.pdf - Frequency histogram: http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_histogram.pdf - RLE and NUSE plots: http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_RLEandNUSE1.pdf - CorrelationPlot: http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf - PCA, after RMA normalization: http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_pca.pdf Now, my questions are: 1) Is my issue really a issue? If so, how can I perform a robust normalization of my arrays? 2) Is there a tool to assess how "robust" your pre-processing method is in respect to this issue? 3) Sex-biased gene expression is not the only biological question in my experiment. Is the massive size of this effect going to affect the "detectability" of other smaller effects? (through normalization or correction for multiple testing or other?) Thanks, paolo -- Paolo Innocenti Department of Animal Ecology, EBC Uppsala University Norbyv?gen 18D 75236 Uppsala, Sweden

Normalization affy limma Normalization affy limma • 1.2k views

ADD COMMENT • link updated 15.1 years ago by Wolfgang Huber ★ 13k • written 15.2 years ago by Paolo Innocenti ▴ 320

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 6 days ago

United States

On Mon, Apr 27, 2009 at 11:25 AM, Paolo Innocenti <paolo.innocenti@ebc.uu.se> wrote: > Hi all, > > I have dataset of 120 Affy arrays, 60 males and 60 females. > The expression profiles of the 2 groups differs dramatically, i.e. if I run > a standard RMA + limma, I have ~90% of the genes differentially expressed. You'll probably want to provide the code that you used to perform the RMA and limma analyses. It is difficult to comment on the results of an analysis without seeing what was done to produce them. Also, be sure to provide sessionInfo() in case there are questions about what packages are used. Sean > Also, downregulated genes are twice as many than upregulated genes, > although if I impose a cutoff of two-fold difference in expression, they are > almost equal (15% up and 15% down). > This is clearly breaking the assumption that most of the genes on the array > should not be differentially expressed, but the result is in line with the > current knowledge of sex-biased gene expression in my model organism. > > I have done some quality control plots, available here: > - Boxplot: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_boxplot.pdf > > - Frequency histogram: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_histogram.pdf > > - RLE and NUSE plots: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_RLEandNUSE1.pdf > > - CorrelationPlot: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf > > - PCA, after RMA normalization: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_pca.pdf > > Now, my questions are: > 1) Is my issue really a issue? If so, how can I perform a robust > normalization of my arrays? > > 2) Is there a tool to assess how "robust" your pre-processing method is in > respect to this issue? > > 3) Sex-biased gene expression is not the only biological question in my > experiment. Is the massive size of this effect going to affect the > "detectability" of other smaller effects? (through normalization or > correction for multiple testing or other?) > > Thanks, > paolo > > > -- > Paolo Innocenti > Department of Animal Ecology, EBC > Uppsala University > NorbyvÃ¤gen 18D > 75236 Uppsala, Sweden > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 15.2 years ago Sean Davis 21k

0

Entering edit mode

Thanks for the quick reply, here is my code, run in a fresh session: require(affy) miame <- read.MIAME("../stat_input_rawdata/hemiarray.miame") phenodata <- read.AnnotatedDataFrame( "../stat_input_rawdata/hemiarray.phenodata",sep=" ") data <- ReadAffy( filenames=phenodata$filenames, sampleNames=sampleNames(phenodata), phenoData=phenodata, description=miame, celfile.path="../stat_input_rawdata") eset <- rma(data) require(limma) f.sex <- factor(pData(data)$sex) design <- model.matrix(~0 + f.sex) colnames(design) <- levels(f.sex) fit <- lmFit(eset,design) contrast <- makeContrasts(M-F, levels=design) fit2 <- contrasts.fit(fit,contrast) fit2 <- eBayes(fit2) pval <- 0.0001 adjusted <- "fdr" res <- decideTests(fit2, adjust=adjusted, method="separate", p.value=pval) My results: > summary(res) M - F -1 11653 0 1442 1 5857 And my sessioInfo(): > sessionInfo() R version 2.8.0 (2008-10-20) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] affy_1.20.0 Biobase_2.2.1 limma_2.16.3 loaded via a namespace (and not attached): [1] affyio_1.10.1 preprocessCore_1.4.0 ---- Best, paolo Sean Davis wrote: > > > On Mon, Apr 27, 2009 at 11:25 AM, Paolo Innocenti > <paolo.innocenti at="" ebc.uu.se="" <mailto:paolo.innocenti="" at="" ebc.uu.se="">> wrote: > > Hi all, > > I have dataset of 120 Affy arrays, 60 males and 60 females. > The expression profiles of the 2 groups differs dramatically, i.e. > if I run a standard RMA + limma, I have ~90% of the genes > differentially expressed. > > > You'll probably want to provide the code that you used to perform the > RMA and limma analyses. It is difficult to comment on the results of an > analysis without seeing what was done to produce them. Also, be sure to > provide sessionInfo() in case there are questions about what packages > are used. > > Sean > > > > Also, downregulated genes are twice as many than upregulated genes, > although if I impose a cutoff of two-fold difference in expression, > they are almost equal (15% up and 15% down). > This is clearly breaking the assumption that most of the genes on > the array should not be differentially expressed, but the result is > in line with the current knowledge of sex-biased gene expression in > my model organism. > > I have done some quality control plots, available here: > - Boxplot: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_boxplot.pdf > > - Frequency histogram: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_histogram.pdf > > - RLE and NUSE plots: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_RLEandNUSE1.pdf > > - CorrelationPlot: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf > > - PCA, after RMA normalization: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_pca.pdf > > Now, my questions are: > 1) Is my issue really a issue? If so, how can I perform a robust > normalization of my arrays? > > 2) Is there a tool to assess how "robust" your pre-processing method > is in respect to this issue? > > 3) Sex-biased gene expression is not the only biological question in > my experiment. Is the massive size of this effect going to affect > the "detectability" of other smaller effects? (through normalization > or correction for multiple testing or other?) > > Thanks, > paolo > > > -- > Paolo Innocenti > Department of Animal Ecology, EBC > Uppsala University > Norbyv?gen 18D > 75236 Uppsala, Sweden > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Paolo Innocenti Department of Animal Ecology, EBC Uppsala University Norbyv?gen 18D 75236 Uppsala, Sweden

ADD REPLY • link 15.2 years ago Paolo Innocenti ▴ 320

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 days ago

EMBL European Molecular Biology Laborat…

Hi Paolo Some suggestions here: 1.) The correlation plot http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf looks bizarre. Can you explain what it shows, and why you think it is consistent with a successful experiment? 2.) How does the array index relate to whether the sample is male/female? Could it be that further experimental factors (time, lab, reagent batch) are confounded with sex? 3.) I am puzzled by your sessionInfo(). How could you run "rma" without having a cdf package loaded? 4.) You could try using different normalisation methods. The quantile normalisation used within rma is rather aggressive. You could try methods based on affine linear or local polynomial regression. Best wishes Wolfgang ------------------------------------------------ Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber Paolo Innocenti ha scritto: > Hi all, > > I have dataset of 120 Affy arrays, 60 males and 60 females. > The expression profiles of the 2 groups differs dramatically, i.e. if I > run a standard RMA + limma, I have ~90% of the genes differentially > expressed. Also, downregulated genes are twice as many than upregulated > genes, although if I impose a cutoff of two-fold difference in > expression, they are almost equal (15% up and 15% down). > This is clearly breaking the assumption that most of the genes on the > array should not be differentially expressed, but the result is in line > with the current knowledge of sex-biased gene expression in my model > organism. > > I have done some quality control plots, available here: > - Boxplot: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_boxplot.pdf > > - Frequency histogram: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_histogram.pdf > > - RLE and NUSE plots: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_RLEandNUSE1.pdf > > - CorrelationPlot: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf > > - PCA, after RMA normalization: > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_pca.pdf > > Now, my questions are: > 1) Is my issue really a issue? If so, how can I perform a robust > normalization of my arrays? > > 2) Is there a tool to assess how "robust" your pre-processing method is > in respect to this issue? > > 3) Sex-biased gene expression is not the only biological question in my > experiment. Is the massive size of this effect going to affect the > "detectability" of other smaller effects? (through normalization or > correction for multiple testing or other?) > > Thanks, > paolo > >

ADD COMMENT • link 15.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang and list, thanks for the suggestion: Wolfgang Huber wrote: > 1.) The correlation plot > http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf looks > bizarre. Can you explain what it shows, and why you think it is > consistent with a successful experiment? If you refer to the "chessboard" effect, it should simply show higher correlation within males and within females: the samples are indexed as 4 males, 4 females, 4 males, 4 females, ecc... Moreover, the first set of 8 samples are genetically more similar than the second set of 8, etc... (I can't see that effect in the plot though). In a recent paper, Ayroles, J. F., Carbone, M. A., Stone, E. A., Jordan, K. W., Lyman, R. F., Magwire, M. M., Rollmann, S. M., Duncan, L. H., Lawrence, F., Anholt, R. R. H., & Mackay, T. F. C. 2009. Systems genetics of complex traits in Drosophila melanogaster. Nat Genet 41: 299-307. they found 88% of the genes differentially expressed in M vs. F. (consistent with my results), and the correlation plot on their data (ArrayExpress E-MEXP-1594), after the same preprocessing, looks like this: http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot_ayroles. pdf (here the samples are 2 males, 2 females, 2 males, ...) > 2.) How does the array index relate to whether the sample is > male/female? Could it be that further experimental factors (time, lab, > reagent batch) are confounded with sex? As I said, 4 males, 4 females, ecc... I don't think there's any other possible "lab" effect: flies were flash-frozen in 1 hour interval, RNA extraction were performed in batches balanced for sex (half-half) and replicates, and randomized within sex within a 3 days interval, arrays run in batches balanced for sex and replicates within 3-4 weeks. Same fresh reagents, same protocols, same guy (me). BUT, flies are sexually dimorphic, other than for gene expression, for SIZE as well. This means that extracting from whole flies gave a consistent difference in total yield, females giving on average 3 times as much as males. I'm no expert here, but might it be that difference in initial quantity (even though RNA has been diluted/concentrated to reach the same concentration before hybridization) have an effect on the results (relative difference in degradation rate when stored for 2 weeks in -80 relatively to quantity, just wild-guessing...)? So with pre-processing I might be wiping out just technical errors instead of biological variation (would that be corroborated by the fact that I still find a HUGE effect?) > 3.) I am puzzled by your sessionInfo(). How could you run "rma" without > having a cdf package loaded? My mistake: I think a did a mistake in saving-reloading environments (don't know what I did exactly). Running the same code in another fresh session without mistakes gives the sessionInfo() attached below (with the cdf). > 4.) You could try using different normalisation methods. The quantile > normalisation used within rma is rather aggressive. You could try > methods based on affine linear or local polynomial regression. I'll give it a try with a few other normalisation methods. But the question remains: with an "aggressive method" I find a huge effect. If I apply a softer one, what am I expected to find? At the end of the day: do my data look normal, and: am I breaking assumptions? Thanks a lot for your help, and thanks in advance for any additional help, best, paolo > Best wishes > Wolfgang > > sessionInfo() R version 2.8.0 (2008-10-20) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] limma_2.16.3 drosophila2cdf_2.3.0 affy_1.20.0 [4] Biobase_2.2.1 loaded via a namespace (and not attached): [1] affyio_1.10.1 preprocessCore_1.4.0 > ------------------------------------------------ > Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber > > > > Paolo Innocenti ha scritto: >> Hi all, >> >> I have dataset of 120 Affy arrays, 60 males and 60 females. >> The expression profiles of the 2 groups differs dramatically, i.e. if >> I run a standard RMA + limma, I have ~90% of the genes differentially >> expressed. Also, downregulated genes are twice as many than >> upregulated genes, although if I impose a cutoff of two-fold >> difference in expression, they are almost equal (15% up and 15% down). >> This is clearly breaking the assumption that most of the genes on the >> array should not be differentially expressed, but the result is in >> line with the current knowledge of sex-biased gene expression in my >> model organism. >> >> I have done some quality control plots, available here: >> - Boxplot: >> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_boxplot.pdf >> >> - Frequency histogram: >> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_histogram.pdf >> >> - RLE and NUSE plots: >> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_RLEandNUSE1.pdf >> >> - CorrelationPlot: >> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_correlationplot.pdf >> >> - PCA, after RMA normalization: >> http://www.iee.uu.se/zooekol/pdf/hemiarray_qc_pca.pdf >> >> Now, my questions are: >> 1) Is my issue really a issue? If so, how can I perform a robust >> normalization of my arrays? >> >> 2) Is there a tool to assess how "robust" your pre-processing method >> is in respect to this issue? >> >> 3) Sex-biased gene expression is not the only biological question in >> my experiment. Is the massive size of this effect going to affect the >> "detectability" of other smaller effects? (through normalization or >> correction for multiple testing or other?) >> >> Thanks, >> paolo >> >> > > > > -- Paolo Innocenti Department of Animal Ecology, EBC Uppsala University Norbyv?gen 18D 75236 Uppsala, Sweden

ADD REPLY • link 15.1 years ago Paolo Innocenti ▴ 320

Login before adding your answer.