RNA degradation plots don't make any sense for Affy's random primer arrays.
Back in the good 'ol days (you know, the 2000's), Affy array processing was based on oligo-dT primers that were used for converting mRNA to cDNA, starting at the poly-A tail. Because of this, the probes for each gene were clustered near the 3' end of the gene, to ensure that you would be able to get reasonable measurements. In other words, there is no guarantee that the RT will proceed to the end of the mRNA (and the mRNA may have been partially degraded, starting from the 5' end), so you want to put the probes right close to where you start the RT to ensure you have enough cDNA to get a good measurement.
Since the Exon ST arrays, Affy has changed to using random primers for generating the cDNA, so there is no longer a 3' bias to the resulting labeled cDNA, and no expectation that you would be able to detect any mRNA degradation. You could hypothetically figure out where each of the probes 'belong' in each transcript and come up with your own version of the plot, but it wasn't particularly useful for the 3' biased arrays and I would imagine will be even less so for the random primer arrays.
That's awesome! They don't like the RNA Integrity Number? That, like, everybody else uses? That's being a true contrarian.
Anyway, you could do some sort of bootleg approximation, with emphasis on the bootleg part of that statement. First, do note that the Gene ST arrays have (in general) four probes per probe set region (PSR), which is the 'probeset' summarization level, and they have any number of PSRs that they summarize to make an estimate for a transcript. So this isn't going to be as easy as it was for the 3' biased arrays. First, let's extract our data. I will be using Rat Gene 2.0 ST arrays, but it's all the same. First let's get set up.
> library(oligo)
> dat <- read.celfiles(list.celfiles())
> con <- db(pd.ragene.2.0.st)
Now let's get all the probe IDs (fids) per transcript-level probeset:
So now we have, for each probeset, in PSR order, the fid values. In the example above we have four fids that all go into one fsetid (PSR), that are then combined into a meta_fsetid. The order of the PSRs is correct, but we aren't sure that the fids are in genomic order. And I don't know how to ensure that they are, so let's ignore that, shall we?
We now just want to extract the probe expression values for a set of these probesets, in order (which we already have), and then plot. So let's just use a subset,
So this list has the rows for the probes for each probeset, in genomic-ish order, so we want to extract and compute a mean expression level for each probeset-position. Put another way, we want the mean expression for all the probes that come first in a probeset, by sample, the mean for the second set of probes, etc. So we end up with a mean expression value for each relative position over all this subset of probesets.
Not only is that a mouthful, this code is going to be a whopper. First get the expression values from our GeneFeatureSet, then we'll extract the probes in order, and compute the mean values. Note that I have 36 samples in my GeneFeatureSet, so I iterate over each sample here.
FWIW: after removing the NAs from the transcript-level probeset table (z), this code is also working nicely for mouse Gene ST1.1 arrays (pd.mogene.1.1.st.v1)
Specifically; after the dbGetQuery (code box 2), and before making the GRanges object (box 3), add this line :
z <- z[!is.na(z$start) | !is.na(z$stop),]
Otherwise an error will occur:
> zgr <- GRanges(paste0("chr", z$chrom), IRanges(z$start, z$stop), strand, meta_fsetid = z$meta_fsetid, fsetid = z$fsetid, fid = z$fid)
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
solving row 200731: range cannot be determined from the supplied arguments (too many NAs)
Is the exprs(dat) is that using affy or oligo? and what level was the exprs computed? as i have computed it at the core level according to the oligo package. I am pretty new to this area and have been asked to do this plot for this type of chip for my analysis.
The chips were run using mogene.2.0.st array and pd.mogene.2.0.st annotation
The problem is the number 36; as James indicated this line of code iterates over each of the 36 samples he has in his dataset. If you have a different number of samples you have to change it in that line, i.e. to dim(vals)[2].
The intensity values that are used to generate the 'RNA digestion plot' are the raw, non-normalized expression values. Recall that you read in the raw data using dat <- read.celfiles(list.celfiles()), and subsequently you extract the expression values from the non-normalized GeneFeatureSet using the function exprs() [vals <- exprs(dat)].
Finally, as was also discussed above, you could wonder what the value is of a RNA digestion plot for these arrays since in the labeling procedure random primers are used.
Thanks for your comments really helpful! Yep i was using the RMA normalised data so now it works thank you! I have been told to do this plot even though its not really suitable for this type of chip - i dont think it will tell me anything different that i dont already know from my NUSE, RLE, PCA but there you go! :)
Hi James,
Thank you for your answer.
One of the reviews of our paper insisted on showing (figure) no RNA degradation in our sample despite of RIN > 9.
Would you please suggest a plot which could satisfy him/her.
Thank you very much.
Anita
That's awesome! They don't like the RNA Integrity Number? That, like, everybody else uses? That's being a true contrarian.
Anyway, you could do some sort of bootleg approximation, with emphasis on the bootleg part of that statement. First, do note that the Gene ST arrays have (in general) four probes per probe set region (PSR), which is the 'probeset' summarization level, and they have any number of PSRs that they summarize to make an estimate for a transcript. So this isn't going to be as easy as it was for the 3' biased arrays. First, let's extract our data. I will be using Rat Gene 2.0 ST arrays, but it's all the same. First let's get set up.
Now let's get all the probe IDs (fids) per transcript-level probeset:
Now let's make a GRanges object and sort EDIT:I forgot to make the strand
And split that out by meta_fsetid (transcript-level probeset ID)
So now we have, for each probeset, in PSR order, the fid values. In the example above we have four fids that all go into one fsetid (PSR), that are then combined into a meta_fsetid. The order of the PSRs is correct, but we aren't sure that the fids are in genomic order. And I don't know how to ensure that they are, so let's ignore that, shall we?
We now just want to extract the probe expression values for a set of these probesets, in order (which we already have), and then plot. So let's just use a subset,
Lots of these are pretty short, so let's just use those with >= 20 probes, and also subset longer probesets to just 20 probes.
So this list has the rows for the probes for each probeset, in genomic-ish order, so we want to extract and compute a mean expression level for each probeset-position. Put another way, we want the mean expression for all the probes that come first in a probeset, by sample, the mean for the second set of probes, etc. So we end up with a mean expression value for each relative position over all this subset of probesets.
Not only is that a mouthful, this code is going to be a whopper. First get the expression values from our GeneFeatureSet, then we'll extract the probes in order, and compute the mean values. Note that I have 36 samples in my GeneFeatureSet, so I iterate over each sample here.
Thank you very, very much for this detailed help.
Anita
FWIW: after removing the NAs from the transcript-level probeset table (z), this code is also working nicely for mouse Gene ST1.1 arrays (
pd.mogene.1.1.st.v1
)Specifically; after the
dbGetQuery
(code box 2), and before making the GRanges object (box 3), add this line :Otherwise an error will occur:
Hi,
I am trying to follow this step and i am getting an error throwing up:
Error in vals[x, y] : subscript out of bounds
This is when i run the following line:
Is the exprs(dat) is that using affy or oligo? and what level was the exprs computed? as i have computed it at the core level according to the oligo package. I am pretty new to this area and have been asked to do this plot for this type of chip for my analysis.
The chips were run using mogene.2.0.st array and pd.mogene.2.0.st annotation
Thanks
Danielle
To analyze your data set you likely copied all of James's code above "as is". Then you also copied this line:
The problem is the number 36; as James indicated this line of code iterates over each of the 36 samples he has in his dataset. If you have a different number of samples you have to change it in that line, i.e. to
dim(vals)[2]
.The intensity values that are used to generate the 'RNA digestion plot' are the raw, non-normalized expression values. Recall that you read in the raw data using
dat <- read.celfiles(list.celfiles())
, and subsequently you extract the expression values from the non-normalized GeneFeatureSet using the functionexprs()
[vals <- exprs(dat)
].Finally, as was also discussed above, you could wonder what the value is of a RNA digestion plot for these arrays since in the labeling procedure random primers are used.
Hi Guido,
Thanks for your comments really helpful! Yep i was using the RMA normalised data so now it works thank you! I have been told to do this plot even though its not really suitable for this type of chip - i dont think it will tell me anything different that i dont already know from my NUSE, RLE, PCA but there you go! :)
Thanks for your help,
Danielle