Search
Question: Comparison between limma and edgeR of reads ratio
0
4 weeks ago by
catpaw.tw0 wrote:

Hi All,

Thanks in advance for the help! I am still new to edgeR and welcome for any suggestions!

I am trying to figure out which method is better off to analyze my data. I am interesting in translational change before vs. after stress in yeast(s. cerevisiae). I carried out polysome profiles before and after stress, then RNA-seq of collected fractions of whole polysome profiles (total 9 fractions per polysome profile). During RNA sample extraction, I spike-in equal amount of RNA (from S.pombe) for later normalization. I used the normalized counts (based on spike-in counts), summed up the reads of polysome fractions(poly) and reads of the whole polysome profile(total), and tried to use the ratio (poly/total) to access the translation change. I have 2 biological replicates and 2 conditions, before and after stress. I have used the ratio number in limma and got reasonable results. Meanwhile, I also get suggestions to use edgeR with summed poly-counts as input and total counts as offsets. But, when I run the edgeR, it crushed after a certain code. Here is my code for running edgeR:

> x = DGEList(counts = inp) # inp contains normalized reads of summed polysome fractions
> x\$offest = off # off is the sum of total reads, have the same dimension as inp
> x = estimateCommonDisp(x)

>x = estimateGLMTrendedDisp(x, design)

The R is crushed on this step.

Any suggestions and thoughts with my codes? I am also wondering what people think about using limma vs. edgeR for this analysis.

Thanks,

Elisha

modified 29 days ago by Gordon Smyth33k • written 4 weeks ago by catpaw.tw0
1. I assume you mean that "R crashed", not that you were crushed.
2. You will have to give some more detail on your experimental design. What are the biological conditions? How many replicates are present per condition?
3. You will also have to give more details on the nature of polysome-seq, this is not a well-known protocol. I would guess that you're sequencing RNA before and after polysome enrichment, hence the "poly-counts" and "total counts".
4. When you say that "these counts have already been normalized with Spike-in", I don't know what that means here. edgeR has no concept of "normalized counts", you must supply the raw counts.

Hi Aaron,

Thanks for the comments, I have edited the original post. Let me know if anything is unclear. And yes, I understand that edgeR only takes raw counts, but under my case, I am worried that using raw counts will cause biases due to tech variations among fractions reads. For example, one fraction of polysomes has raw read at 5 and the other one is 50, but after normalization, they are actually 3 vs. 4. If I sum up the raw counts, the one have higher counts will contribute more than the others, and I will get mislead by one particular fraction. Let me know if you think I shouldn't use edgeR or the normalized counts and welcome any thoughts and suggestions. Thanks!

From reading "summed up the reads of polysome fractions(poly) and reads of the whole polysome profile(total)", I still don't understand the difference between the poly and total reads. What are the fractions? Why do you need to collect 9 of them? If the 9 fractions are different, why do you sum them together? What's the difference between the summed fractions and the whole polysome profile?

Polysome profile is a technique used to study the mRNAs associated with ribosomes. The polysome fraction indicates the amount of mRNAs that is associated with ribosomes, and the total (whole polysome profiles) represents the total mRNAs amount (no matter they are free or engaged with ribosomes). It's common to use the ratio as a proximate estimate of a certain transcript's translation status in molecular biology. We already did a spline test, and I'd like to use the sum up reads as an alternative way to analyze my data.

It seems like something is wrong with the offset setting, are you familiar with offset?

I was one of the authors of the scaleOffset function, so yes, I would say I am familiar with offsets.

1
4 weeks ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

To summarize your experimental design and data:

• You have 2 conditions, before and after stress.
• Each condition has two biological replicates.
• Each replicate has two counts - ribosome-bound and total - for each gene.

I would not use the total counts to compute offsets here, as edgeR treats the supplied offsets as known values. This means that variability in the total counts is not properly modelled. Instead, I would do something like this:

condition <- rep(c("before", "after"), each=4)
replicate <- gl(4, 2)
count <- rep(c("ribo", "total"), 4)
design <- model.matrix(~0 + replicate + count:condition)
design <- design[,!grepl("counttotal", colnames(design))] # get to full rank

The first four columns of design represent the log-total expression in each replicate, and are not useful. The second last column (countribo:conditionafter) represents the log-fold change of the ribosomal count over the total count in the after-stress condition, while the last column (countribo:conditionbefore) represents the log-fold change of the ribosomal count over the total count in the before stress condition. It is then simple to test for changes in ribosomal/total ratios between conditions:

colnames(design) <- make.names(colnames(design)) # syntactically valid.
con <- makeContrasts(countribo.conditionafter - countribo.conditionbefore, levels=design)

Edit: The advice above assumes that the total and ribosome-bound counts came from two separate experiments, much like an IP and input control in ChIP-seq experiments. However, if the total count was obtained by adding the ribosome-bound count to a non-ribosome-bound count, then the total and ribosome-bound counts would be correlated. This would not be good as it would invalidate independence assumptions in the model. Instead, you should model the non-ribosome-bound count directly:

count <- rep(c("ribo", "nonribo"), 4)

You may also be interested in the modelMatrixMeth function, which provides a convenient way of setting up the design matrix above. It was initially designed for methylation experiments (methylated vs. unmethylated counts) but is equally applicable here.

Hi Aaron,

Thanks so much for the suggestion, I will work on it. Just one more question, when you said "log-fold change", are you saying to use log-counts as input in edgeR? And, yes, you are correct, the total is the sum of ribosome bound plus the non-bound. The poly and total counts are correlated.

Use raw counts as input to edgeR. A log-link GLM is used, so the coefficients are automatically on the log-scale.

0
29 days ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

Following up on Aaron's answer, you can analyse your bound vs non-bound counts in edgeR as one would analyse methylated counts, see:

http://www.statsci.org/smyth/pubs/edgeRMethylationPreprintJan2018.pdf

Your data is exactly analogous, with your bound counts corresponding to methylated counts and non-bound counts corresponding to unmethylated counts.