CPM after RUVseq
1
0
Entering edit mode
Jess • 0
@8c50de0e
Last seen 21 days ago
United States

Hello! I have a pipeline for targeted RNASeq data that includes edgeR for initial normalization followed by RUVSeq (RUVr). Output of RUVr is expected: df$W and df$NormalizedCounts. However, the normalized counts are in logCPM and I would like to back log them for data exploration purposes. This level of stats is a bit over my head (I'm a PhD physiology student) so I am struggling to understand what direction to take my code to back log them. If I simply take the normalized counts and apply 10^(x+1) I get counts that seem unreasonably high. This makes sense to me given that "The normalized counts are indeed simply the residuals from ordinary least squares regression of logY on W (with the offset if needed)". However, I don't understand how to apply that equation with the data in R to work back to the CPM. Can someone point me to some general R code or help with the concept of what to do to get back to CPM rather than logCPM?

Thank you! JK

RUVSeq CPM Normalization • 259 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

The normalizedCounts aren't logCPM. They are, as advertised, normalizedCounts.

> example("RUVr")
<snip>
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000078095  943  665  696  151   241   310
ENSDARG00000078169   97  183  151  132   156   126
ENSDARG00000090312    0    0    0    0     0     0
ENSDARG00000088587    1   12    2    2     5     3
ENSDARG00000092475   12   32    0    4     1     1
ENSDARG00000015065  363  268 6300  288   324   285
> head(y$counts) <------------------------- That's the input count matrix Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13 ENSDARG00000078095 618 1490 796 168 35 851 ENSDARG00000078169 113 258 153 105 125 89 ENSDARG00000090312 0 0 0 0 0 0 ENSDARG00000088587 0 31 2 3 0 14 ENSDARG00000092475 9 62 0 4 0 2 ENSDARG00000015065 204 678 7440 348 30 1118  Both the input and output are counts. As further evidence, > summary(normCounts(seqRUVr)) Ctl1 Ctl3 Ctl5 Trt9 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 Median : 20.5 Median : 24.0 Median : 14.0 Median : 13.5 Mean : 1994.3 Mean : 1072.1 Mean : 1957.9 Mean : 4702.5 3rd Qu.: 190.5 3rd Qu.: 204.8 3rd Qu.: 205.2 3rd Qu.: 201.0 Max. :421074.0 Max. :195040.0 Max. :527618.0 Max. :1314806.0 Trt11 Trt13 Min. : 0 Min. : 0 1st Qu.: 0 1st Qu.: 0 Median : 14 Median : 13 Mean : 3526 Mean : 4293 3rd Qu.: 192 3rd Qu.: 197 Max. :1075880 Max. :1314591  If they were logCPM, they wouldn't exceed maybe 17 - 18, and there would be negative values, like this: > summary(cpm(normCounts(seqRUVr), log = TRUE)) Ctl1 Ctl3 Ctl5 Trt9 Min. :-0.6751 Min. :-0.6751 Min. :-0.6751 Min. :-0.6751 1st Qu.:-0.6751 1st Qu.:-0.6751 1st Qu.:-0.6751 1st Qu.:-0.6751 Median : 3.3273 Median : 4.4010 Median : 2.8429 Median : 1.7021 Mean : 3.5232 Mean : 4.2210 Mean : 3.4572 Mean : 2.6139 3rd Qu.: 6.4611 3rd Qu.: 7.4554 3rd Qu.: 6.5944 3rd Qu.: 5.3135 Max. :17.5609 Max. :17.3460 Max. :17.9129 Max. :17.9660 Trt11 Trt13 Min. :-0.6751 Min. :-0.6751 1st Qu.:-0.6751 1st Qu.:-0.6751 Median : 2.0916 Median : 1.7653 Mean : 2.8922 Mean : 2.7434 3rd Qu.: 5.6580 3rd Qu.: 5.4144 Max. :18.0921 Max. :18.0972  0 Entering edit mode Thank you James! It hadn't occurred to me to input CPM rather than logCPM which is why my df$NormalizedCounts didn't exceed 17ish in value and have negative values.

Having utilized RUVr with both CPM and logCPM, inputting the logCPM gives us output that is more reflective of what we'd expect biologically. If I used edgeR to normalize initially (which used log2 to make my logCPM input), can I 2^(df\$NormalizedCounts) to back convert or is this problematic because edgeR utilizes library size?

0
Entering edit mode

You shouldn't use CPM or logCPM for any of the steps! To use RUVr you need to fit a preliminary model using edgeR, which takes counts as input, and then you give RUVr counts as well. This is pretty clearly spelled out in the help page for RUVr. For example

Usage:

RUVr(x, cIdx, k, residuals, center=TRUE, round=TRUE, epsilon=1, tolerance=1e-8, isLog=FALSE)

Arguments:

x: Either a genes-by-samples numeric matrix or a
SeqExpressionSet object containing the read counts.


Nothing about CPM or logCPM there, just counts. And below that it says

residuals: A genes-by-samples matrix of residuals obtained from a
first-pass regression of the counts on the covariates of
interest, usually the negative binomial deviance residuals
obtained from 'edgeR' with the 'residuals' method.


Again, no CPM or logCPM here. Just counts. And the example shows how to do it, using some data from the zebrafishRNASeq package that you can use as an example for your data.

0
Entering edit mode

Also, you might be confused about something. CPM is counts/million counts, which uses library size to scale normalize and logCPM is just the logs of the CPM values. And edgeR doesn't normalize anything, really, unless you are talking about CPM or logCPM values. Usually you use TMM to estimate the library size, and that is used as an offset in the GLM rather than explicitly normalizing the data first.

0
Entering edit mode

Ah I think I know where the communication breakdown is occurring. When I think of RUVr, I'm including fitting the preliminary model in edgeR, not just the RUVr line; I've been following along with the help page and tried the zebrafishRNASeq package initially.

When I switch to my data I've been also using my predecessor's code: she used TMM on raw data to estimate library size df <-calcNormFactors(df), obtained the matrix of logCPM counts <- cpm(df, log=T) and then used that count matrix to fit the design as described in the help page line for line (except higher k).

design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
res <- residuals(fit, type="deviance")

set4 <- RUVr(set, genes, k=3, res)
pData(set4)


Since she used a logCPM count matrix to fit a model prior to RUVr(), the normalized counts came back logCPM in appearance.

I’m guessing this pipeline is inappropriate since it essentially renormalized counts that were already normalized.

1
Entering edit mode

I am unfamiliar with a counts function (your counts(set)), so it's not clear what exactly you are putting into that DGEList. But if it's logCPM values, then for sure it's wrong.

0
Entering edit mode

Thank you, it was!