Is a subset of my arrays from degraded RNA?

0

Entering edit mode

Peter Davidsen ▴ 210

@peter-davidsen-4584

Last seen 8.9 years ago

Dear List, Although I do realise that my question has more to do with actual data interpretation that coding using BioC packages, I'm hoping for some input from other users with experience in microarray data analysis. I order to support my explanation below, I have made a pdf with diagnostic plots. I will refer to specific slides as I go along. The presentation can be downloaded here: https://db.tt/jBqPNxIN At the moment I'm analysing some microarray data as part of a collaboration. Unfortunately, I have very little knowledge about the actual generation/processing of these samples which could help address my question. By doing a boxplot on the raw Affymetrix chip data (from the U133plus2 platform), I noticed 2 'batches' based on differences in signal intensities. Hierarchical clustering using all probesets on the array supports this devision (Page 1 and 2). Noteworthy, this separation into batches (i.e. a high and a low intensity batch) can partially be traced back to the ScanDate of the arrays. That is, the ~100 samples were scanned over three consecutive days; all samples scanned on the first day belong to the high intensity batch whereas all samples scanned on day 3 belong to the low intensity batch. Noteworthy, around half of the samples scanned on day 2 fall into the high and low intensity batch, respectively. When I do a RLE plot (Page 3 - top), the median value for most of the samples from the low intensity batch is between 0.1 and 0.2 (and not zero as expected). Further, whereas ~40% of the probesets are called "present" in the high intensity batch using the simpleaffy package, only around ~30-35% are called present in the low intensity batch (Page 3 - bottom). Now, when I do boxplots specifically for the AFFX control probesets, I discovered that the intensity is in fact higher in all low intensity samples (Page 4). Furthermore, when I focus on the Affy hybridization controls (i.e. bioB, bioC, BioD and creX) the line plot looks good and the signal intensity is comparable between samples in the two batches (Page 5, left side). If I instead plot the poly-A controls I again see a significant difference in intensity between batches (with low-intensity samples having a higher signal). In addition, the signal values consistently follow the order Phe<lys<thr<dap (page="" 5,="" right="" side).="" nb:="" i'm="" a="" bit="" unsure="" as="" to="" the="" importance="" of="" the="" latter="" observation.="" the="" qc="" plots="" presented="" above="" suggest="" to="" me="" that="" the="" rna="" from="" the="" low="" intensity="" samples="" could="" potentially="" suffer="" from="" a="" rna="" degradation="" issue.="" however,="" both="" the="" 3'="" 5'="" ratios="" for="" beta-actin="" and="" gapdh="" as="" well="" as="" rna="" degradation="" plots="" using="" affyplm="" do="" not="" support="" my="" assumption="" regarding="" degraded="" rna="" (pages="" 6="" and="" 7).="" in="" fact,="" the="" ratios="" for="" gapdh="" indicates="" a="" higher="" signal="" intensity="" in="" the="" 5-prime="" end,="" which="" i="" find="" a="" bit="" odd.="" however,="" when="" i="" instead="" take="" advantage="" of="" the="" recent="" affyrnadegradation="" package="" i="" do="" get="" a="" small="" yet="" significant="" difference="" between="" batches="" in="" terms="" of="" the="" computed="" decay="" value="" (aka="" parameter="" d)="" (page="" 8).="" i="" then="" tried="" to="" normalize="" the="" two="" batches="" of="" samples="" independently="" (using="" rma).="" this="" allowed="" me="" to="" compare="" the="" mean="" signal="" intensity="" for="" each="" probeset="" across="" the="" chip="" as="" the="" biological="" samples="" are="" indeed="" comparable="" between="" batches.="" a="" scatterplot="" (page="" 9)="" clearly="" demonstrates="" that="" many="" probesets="" lie="" close="" to="" the="" diagonal="" line="" despite="" the="" overall="" difference="" in="" intensity="" described="" on="" the="" first="" page.="" further,="" by="" correlating="" the="" expression="" of="" specific="" probesets="" to="" an="" established="" physiological="" variable="" it="" is="" apparent="" that="" the="" slight="" drop="" in="" signal="" intensity="" do="" not="" affect="" the="" strong="" association="" to="" the="" physiological="" variable="" (page="" 10).="" if="" i="" instead="" focus="" on="" another="" representative="" probeset--which="" is="" farther="" from="" the="" diagonal="" line--the="" correlation="" to="" the="" same="" physiological="" readout="" is="" clearly="" weaker="" in="" the="" low="" intensity="" batch="" (page="" 11).="" could="" it="" be="" that="" only="" a="" smaller="" subset="" of="" the="" transcripts="" are="" significantly="" affected="" by="" rna="" degradation="" in="" the="" low="" intensity="" samples?="" and="" how="" could="" i="" potentially="" demonstrate="" this?="" in="" relation="" to="" the="" question:="" when="" i="" do="" ma="" plots="" against="" a="" "pseudo="" reference="" chip"="" representing="" the="" probeset-wise="" medians="" across="" all="" ~100="" rma="" normalized="" samples,="" it="" also="" becomes="" apparent="" that="" a="" fraction="" of="" the="" probesets="" for="" most="" of="" the="" low="" intensity="" samples="" lie="" far="" below="" the="" m="0" line="" (see="" page="" 12="" for="" a="" representative="" example).="" however,="" to="" my="" surprise="" only="" a="" very="" small="" fraction="" of="" probesets="" are="" consistently="" below="" m="-1.5." in="" other="" words,="" different="" low-intensity="" samples="" have="" different="" "outlying"="" probesets="" compared="" to="" the="" overall="" median.="" to="" summarize,="" i="" have="" now="" put="" forward="" various="" qc="" plots="" that="" show="" that="" the="" low="" intensity="" samples="" are="" overall="" are="" different.="" as="" i'm="" unsure="" which="" way="" forward="" is="" the="" best="" (nb:="" my="" aim="" to="" the="" do="" a="" standard="" deg="" analysis),="" i="" would="" appreciate="" any="" thoughts="" or="" comments="" from="" members="" of="" this="" list.="" kind="" regards,="" peter="" <="" div="">

Microarray GO Clustering affy affyPLM simpleaffy Microarray GO Clustering affy affyPLM • 1.4k views

ADD COMMENT • link 10.0 years ago Peter Davidsen ▴ 210

0

Entering edit mode

Peter Davidsen ▴ 210

@peter-davidsen-4584

Last seen 8.9 years ago

Dear List, Although I do realise that my question has more to do with actual data interpretation that coding using BioC packages, I'm hoping for some input from other users with experience in microarray data analysis. I order to support my explanation below, I have made a pdf with diagnostic plots. I will refer to specific slides as I go along. The presentation can be downloaded here: https://db.tt/jBqPNxIN At the moment I'm analysing some microarray data as part of a collaboration. Unfortunately, I have very little knowledge about the actual generation/processing of these samples which could help address my question. By doing a boxplot on the raw Affymetrix chip data (from the U133plus2 platform), I noticed 2 'batches' based on differences in signal intensities. Hierarchical clustering using all probesets on the array supports this devision (Page 1 and 2). Noteworthy, this separation into batches (i.e. a high and a low intensity batch) can partially be traced back to the ScanDate of the arrays. That is, the ~100 samples were scanned over three consecutive days; all samples scanned on the first day belong to the high intensity batch whereas all samples scanned on day 3 belong to the low intensity batch. Noteworthy, around half of the samples scanned on day 2 fall into the high and low intensity batch, respectively. When I do a RLE plot (Page 3 - top), the median value for most of the samples from the low intensity batch is between 0.1 and 0.2 (and not zero as expected). Further, whereas ~40% of the probesets are called "present" in the high intensity batch using the simpleaffy package, only around ~30-35% are called present in the low intensity batch (Page 3 - bottom). Now, when I do boxplots specifically for the AFFX control probesets, I discovered that the intensity is in fact higher in all low intensity samples (Page 4). Furthermore, when I focus on the Affy hybridization controls (i.e. bioB, bioC, BioD and creX) the line plot looks good and the signal intensity is comparable between samples in the two batches (Page 5, left side). If I instead plot the poly-A controls I again see a significant difference in intensity between batches (with low-intensity samples having a higher signal). In addition, the signal values consistently follow the order Phe<lys<thr<dap (page="" 5,="" right="" side).="" nb:="" i'm="" a="" bit="" unsure="" as="" to="" the="" importance="" of="" the="" latter="" observation.="" the="" qc="" plots="" presented="" above="" suggest="" to="" me="" that="" the="" rna="" from="" the="" low="" intensity="" samples="" could="" potentially="" suffer="" from="" a="" rna="" degradation="" issue.="" however,="" both="" the="" 3'="" 5'="" ratios="" for="" beta-actin="" and="" gapdh="" as="" well="" as="" rna="" degradation="" plots="" using="" affyplm="" do="" not="" support="" my="" assumption="" regarding="" degraded="" rna="" (pages="" 6="" and="" 7).="" in="" fact,="" the="" ratios="" for="" gapdh="" indicates="" a="" higher="" signal="" intensity="" in="" the="" 5-prime="" end,="" which="" i="" find="" a="" bit="" odd.="" however,="" when="" i="" instead="" take="" advantage="" of="" the="" recent="" affyrnadegradation="" package="" i="" do="" get="" a="" small="" yet="" significant="" difference="" between="" batches="" in="" terms="" of="" the="" computed="" decay="" value="" (aka="" parameter="" d)="" (page="" 8).="" i="" then="" tried="" to="" normalize="" the="" two="" batches="" of="" samples="" independently="" (using="" rma).="" this="" allowed="" me="" to="" compare="" the="" mean="" signal="" intensity="" for="" each="" probeset="" across="" the="" chip="" as="" the="" biological="" samples="" are="" indeed="" comparable="" between="" batches.="" a="" scatterplot="" (page="" 9)="" clearly="" demonstrates="" that="" many="" probesets="" lie="" close="" to="" the="" diagonal="" line="" despite="" the="" overall="" difference="" in="" intensity="" described="" on="" the="" first="" page.="" further,="" by="" correlating="" the="" expression="" of="" specific="" probesets="" to="" an="" established="" physiological="" variable="" it="" is="" apparent="" that="" the="" slight="" drop="" in="" signal="" intensity="" do="" not="" affect="" the="" strong="" association="" to="" the="" physiological="" variable="" (page="" 10).="" if="" i="" instead="" focus="" on="" another="" representative="" probeset--which="" is="" farther="" from="" the="" diagonal="" line--the="" correlation="" to="" the="" same="" physiological="" readout="" is="" clearly="" weaker="" in="" the="" low="" intensity="" batch="" (page="" 11).="" could="" it="" be="" that="" only="" a="" smaller="" subset="" of="" the="" transcripts="" are="" significantly="" affected="" by="" rna="" degradation="" in="" the="" low="" intensity="" samples?="" and="" how="" could="" i="" potentially="" demonstrate="" this?="" in="" relation="" to="" the="" question:="" when="" i="" do="" ma="" plots="" against="" a="" "pseudo="" reference="" chip"="" representing="" the="" probeset-wise="" medians="" across="" all="" ~100="" rma="" normalized="" samples,="" it="" also="" becomes="" apparent="" that="" a="" fraction="" of="" the="" probesets="" for="" most="" of="" the="" low="" intensity="" samples="" lie="" far="" below="" the="" m="0" line="" (see="" page="" 12="" for="" a="" representative="" example).="" however,="" to="" my="" surprise="" only="" a="" very="" small="" fraction="" of="" probesets="" are="" consistently="" below="" m="-1.5." in="" other="" words,="" different="" low-intensity="" samples="" have="" different="" "outlying"="" probesets="" compared="" to="" the="" overall="" median.="" to="" summarize,="" i="" have="" now="" put="" forward="" various="" qc="" plots="" that="" show="" that="" the="" low="" intensity="" samples="" are="" overall="" are="" different.="" as="" i'm="" unsure="" which="" way="" forward="" is="" the="" best="" (nb:="" my="" aim="" to="" the="" do="" a="" standard="" deg="" analysis),="" i="" would="" appreciate="" any="" thoughts="" or="" comments="" from="" members="" of="" this="" list.="" kind="" regards,="" peter="" <="" div="">

ADD COMMENT • link 10.0 years ago Peter Davidsen ▴ 210

0

Entering edit mode

Hi Peter, I don't think this is an issue of partially degraded samples. If that were the case I wouldn't expect the AFFX control probes to be so much brighter in the 'low intensity' set of samples. The bioB, bioC, etc probes are hybridization controls, and are added to the hyb cocktail after all the processing, but before hybing the cDNA to the array. The Lys, Phe, Thr, are poly-A controls, and are added prior to the IVT step (right after mRNA purification, but prior to reverse transcribing to cDNA). Since the hyb controls all look similar between batches, but the poly-A controls are not, it implies to me that something screwy happened during processing for the low intensity batch. In my experience, differences like this are unfixable (one of the best predictors of success IMO, is to use plotDensity() to compare arrays; if they aren't all similarly shaped and relatively close to each other, it usually spells doom). There are packages intended to account for this sort of thing, however. You might look at the fRMA or SCAN.UPC packages, which are intended to allow you to process arrays from different batches together. You might also look into the sva package, which is intended to detect and remove batch effects. This is of course dependent upon the wet lab doing a reasonable job of randomizing the samples into different groups. If you have most or all of one sample type in the low intensity sample group, then, well, ugh. Best, Jim On 7/24/2014 11:01 AM, Peter Davidsen wrote: > Dear List, > > Although I do realise that my question has more to do with actual data > interpretation that coding using BioC packages, I'm hoping for some > input from other users with experience in microarray data analysis. > > I order to support my explanation below, I have made a pdf with > diagnostic plots. I will refer to specific slides as I go along. The > presentation can be downloaded here: https://db.tt/jBqPNxIN > > At the moment I'm analysing some microarray data as part of a > collaboration. Unfortunately, I have very little knowledge about the > actual generation/processing of these samples which could help address > my question. > > By doing a boxplot on the raw Affymetrix chip data (from the U133plus2 > platform), I noticed 2 'batches' based on differences in signal > intensities. Hierarchical clustering using all probesets on the array > supports this devision (Page 1 and 2). Noteworthy, this separation > into batches (i.e. a high and a low intensity batch) can partially be > traced back to the ScanDate of the arrays. That is, the ~100 samples > were scanned over three consecutive days; all samples scanned on the > first day belong to the high intensity batch whereas all samples > scanned on day 3 belong to the low intensity batch. Noteworthy, around > half of the samples scanned on day 2 fall into the high and low > intensity batch, respectively. > > When I do a RLE plot (Page 3 - top), the median value for most of the > samples from the low intensity batch is between 0.1 and 0.2 (and not > zero as expected). Further, whereas ~40% of the probesets are called > "present" in the high intensity batch using the simpleaffy package, > only around ~30-35% are called present in the low intensity batch > (Page 3 - bottom). > > Now, when I do boxplots specifically for the AFFX control probesets, I > discovered that the intensity is in fact higher in all low intensity > samples (Page 4). > Furthermore, when I focus on the Affy hybridization controls (i.e. > bioB, bioC, BioD and creX) the line plot looks good and the signal > intensity is comparable between samples in the two batches (Page 5, > left side). If I instead plot the poly-A controls I again see a > significant difference in intensity between batches (with > low-intensity samples having a higher signal). In addition, the signal > values consistently follow the order Phe<lys<thr<dap (page="" 5,="" right=""> side). NB: I'm a bit unsure as to the importance of the latter > observation. > > The QC plots presented above suggest to me that the RNA from the low > intensity samples could potentially suffer from a RNA degradation > issue. However, both the 3'/5' ratios for beta-actin and GAPDH as well > as RNA degradation plots using affyPLM do not support my assumption > regarding degraded RNA (Pages 6 and 7). In fact, the ratios for GAPDH > indicates a higher signal intensity in the 5-prime end, which I find a > bit odd. > However, when I instead take advantage of the recent > AffyRNADegradation package I do get a small yet significant difference > between batches in terms of the computed decay value (aka parameter d) > (Page 8). > > I then tried to normalize the two batches of samples independently > (using RMA). This allowed me to compare the mean signal intensity for > each probeset across the chip as the biological samples are indeed > comparable between batches. A scatterplot (Page 9) clearly > demonstrates that many probesets lie close to the diagonal line > despite the overall difference in intensity described on the first > page. > Further, by correlating the expression of specific probesets to an > established physiological variable it is apparent that the slight drop > in signal intensity do not affect the strong association to the > physiological variable (Page 10). If I instead focus on another > representative probeset--which is farther from the diagonal line-- the > correlation to the same physiological readout is clearly weaker in the > low intensity batch (Page 11). > > Could it be that only a smaller subset of the transcripts are > significantly affected by RNA degradation in the low intensity > samples? and how could I potentially demonstrate this? > > In relation to the question: When I do MA plots against a "pseudo > reference chip" representing the probeset-wise medians across all ~100 > RMA normalized samples, it also becomes apparent that a fraction of > the probesets for most of the low intensity samples lie far below the > M=0 line (see Page 12 for a representative example). However, to my > surprise only a very small fraction of probesets are consistently > below M=-1.5. In other words, different low-intensity samples have > different "outlying" probesets compared to the overall median. > > To summarize, I have now put forward various QC plots that show that > the low intensity samples are overall are different. As I'm unsure > which way forward is the best (NB: my aim to the do a standard DEG > analysis), I would appreciate any thoughts or comments from members of > this list. > > Kind regards, > Peter > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.0 years ago James W. MacDonald 66k

0

Entering edit mode

Hi Peter, I don't think this is an issue of partially degraded samples. If that were the case I wouldn't expect the AFFX control probes to be so much brighter in the 'low intensity' set of samples. The bioB, bioC, etc probes are hybridization controls, and are added to the hyb cocktail after all the processing, but before hybing the cDNA to the array. The Lys, Phe, Thr, are poly-A controls, and are added prior to the IVT step (right after mRNA purification, but prior to reverse transcribing to cDNA). Since the hyb controls all look similar between batches, but the poly-A controls are not, it implies to me that something screwy happened during processing for the low intensity batch. In my experience, differences like this are unfixable (one of the best predictors of success IMO, is to use plotDensity() to compare arrays; if they aren't all similarly shaped and relatively close to each other, it usually spells doom). There are packages intended to account for this sort of thing, however. You might look at the fRMA or SCAN.UPC packages, which are intended to allow you to process arrays from different batches together. You might also look into the sva package, which is intended to detect and remove batch effects. This is of course dependent upon the wet lab doing a reasonable job of randomizing the samples into different groups. If you have most or all of one sample type in the low intensity sample group, then, well, ugh. Best, Jim On 7/24/2014 11:01 AM, Peter Davidsen wrote: > Dear List, > > Although I do realise that my question has more to do with actual data > interpretation that coding using BioC packages, I'm hoping for some > input from other users with experience in microarray data analysis. > > I order to support my explanation below, I have made a pdf with > diagnostic plots. I will refer to specific slides as I go along. The > presentation can be downloaded here: https://db.tt/jBqPNxIN > > At the moment I'm analysing some microarray data as part of a > collaboration. Unfortunately, I have very little knowledge about the > actual generation/processing of these samples which could help address > my question. > > By doing a boxplot on the raw Affymetrix chip data (from the U133plus2 > platform), I noticed 2 'batches' based on differences in signal > intensities. Hierarchical clustering using all probesets on the array > supports this devision (Page 1 and 2). Noteworthy, this separation > into batches (i.e. a high and a low intensity batch) can partially be > traced back to the ScanDate of the arrays. That is, the ~100 samples > were scanned over three consecutive days; all samples scanned on the > first day belong to the high intensity batch whereas all samples > scanned on day 3 belong to the low intensity batch. Noteworthy, around > half of the samples scanned on day 2 fall into the high and low > intensity batch, respectively. > > When I do a RLE plot (Page 3 - top), the median value for most of the > samples from the low intensity batch is between 0.1 and 0.2 (and not > zero as expected). Further, whereas ~40% of the probesets are called > "present" in the high intensity batch using the simpleaffy package, > only around ~30-35% are called present in the low intensity batch > (Page 3 - bottom). > > Now, when I do boxplots specifically for the AFFX control probesets, I > discovered that the intensity is in fact higher in all low intensity > samples (Page 4). > Furthermore, when I focus on the Affy hybridization controls (i.e. > bioB, bioC, BioD and creX) the line plot looks good and the signal > intensity is comparable between samples in the two batches (Page 5, > left side). If I instead plot the poly-A controls I again see a > significant difference in intensity between batches (with > low-intensity samples having a higher signal). In addition, the signal > values consistently follow the order Phe<lys<thr<dap (page="" 5,="" right=""> side). NB: I'm a bit unsure as to the importance of the latter > observation. > > The QC plots presented above suggest to me that the RNA from the low > intensity samples could potentially suffer from a RNA degradation > issue. However, both the 3'/5' ratios for beta-actin and GAPDH as well > as RNA degradation plots using affyPLM do not support my assumption > regarding degraded RNA (Pages 6 and 7). In fact, the ratios for GAPDH > indicates a higher signal intensity in the 5-prime end, which I find a > bit odd. > However, when I instead take advantage of the recent > AffyRNADegradation package I do get a small yet significant difference > between batches in terms of the computed decay value (aka parameter d) > (Page 8). > > I then tried to normalize the two batches of samples independently > (using RMA). This allowed me to compare the mean signal intensity for > each probeset across the chip as the biological samples are indeed > comparable between batches. A scatterplot (Page 9) clearly > demonstrates that many probesets lie close to the diagonal line > despite the overall difference in intensity described on the first > page. > Further, by correlating the expression of specific probesets to an > established physiological variable it is apparent that the slight drop > in signal intensity do not affect the strong association to the > physiological variable (Page 10). If I instead focus on another > representative probeset--which is farther from the diagonal line-- the > correlation to the same physiological readout is clearly weaker in the > low intensity batch (Page 11). > > Could it be that only a smaller subset of the transcripts are > significantly affected by RNA degradation in the low intensity > samples? and how could I potentially demonstrate this? > > In relation to the question: When I do MA plots against a "pseudo > reference chip" representing the probeset-wise medians across all ~100 > RMA normalized samples, it also becomes apparent that a fraction of > the probesets for most of the low intensity samples lie far below the > M=0 line (see Page 12 for a representative example). However, to my > surprise only a very small fraction of probesets are consistently > below M=-1.5. In other words, different low-intensity samples have > different "outlying" probesets compared to the overall median. > > To summarize, I have now put forward various QC plots that show that > the low intensity samples are overall are different. As I'm unsure > which way forward is the best (NB: my aim to the do a standard DEG > analysis), I would appreciate any thoughts or comments from members of > this list. > > Kind regards, > Peter > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.0 years ago James W. MacDonald 66k

Login before adding your answer.