CPM after RUVseq
Entering edit mode
Jess • 0
Last seen 28 days ago
United States

Hello! I have a pipeline for targeted RNASeq data that includes edgeR for initial normalization followed by RUVSeq (RUVr). Output of RUVr is expected: df$W and df$NormalizedCounts. However, the normalized counts are in logCPM and I would like to back log them for data exploration purposes. This level of stats is a bit over my head (I'm a PhD physiology student) so I am struggling to understand what direction to take my code to back log them. If I simply take the normalized counts and apply 10^(x+1) I get counts that seem unreasonably high. This makes sense to me given that "The normalized counts are indeed simply the residuals from ordinary least squares regression of logY on W (with the offset if needed)". However, I don't understand how to apply that equation with the data in R to work back to the CPM. Can someone point me to some general R code or help with the concept of what to do to get back to CPM rather than logCPM?

Thank you! JK

RUVSeq CPM Normalization • 276 views
Entering edit mode
Last seen 56 minutes ago
United States

The normalizedCounts aren't logCPM. They are, as advertised, normalizedCounts.

> example("RUVr")
RUVr> head(normCounts(seqRUVr))
                   Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000078095  943  665  696  151   241   310
ENSDARG00000078169   97  183  151  132   156   126
ENSDARG00000090312    0    0    0    0     0     0
ENSDARG00000088587    1   12    2    2     5     3
ENSDARG00000092475   12   32    0    4     1     1
ENSDARG00000015065  363  268 6300  288   324   285
> head(y$counts)                                 <------------------------- That's the input count matrix
                   Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000078095  618 1490  796  168    35   851
ENSDARG00000078169  113  258  153  105   125    89
ENSDARG00000090312    0    0    0    0     0     0
ENSDARG00000088587    0   31    2    3     0    14
ENSDARG00000092475    9   62    0    4     0     2
ENSDARG00000015065  204  678 7440  348    30  1118

Both the input and output are counts. As further evidence,

> summary(normCounts(seqRUVr))
      Ctl1               Ctl3               Ctl5               Trt9          
 Min.   :     0.0   Min.   :     0.0   Min.   :     0.0   Min.   :      0.0  
 1st Qu.:     0.0   1st Qu.:     0.0   1st Qu.:     0.0   1st Qu.:      0.0  
 Median :    20.5   Median :    24.0   Median :    14.0   Median :     13.5  
 Mean   :  1994.3   Mean   :  1072.1   Mean   :  1957.9   Mean   :   4702.5  
 3rd Qu.:   190.5   3rd Qu.:   204.8   3rd Qu.:   205.2   3rd Qu.:    201.0  
 Max.   :421074.0   Max.   :195040.0   Max.   :527618.0   Max.   :1314806.0  
     Trt11             Trt13        
 Min.   :      0   Min.   :      0  
 1st Qu.:      0   1st Qu.:      0  
 Median :     14   Median :     13  
 Mean   :   3526   Mean   :   4293  
 3rd Qu.:    192   3rd Qu.:    197  
 Max.   :1075880   Max.   :1314591

If they were logCPM, they wouldn't exceed maybe 17 - 18, and there would be negative values, like this:

> summary(cpm(normCounts(seqRUVr), log = TRUE))
      Ctl1              Ctl3              Ctl5              Trt9        
 Min.   :-0.6751   Min.   :-0.6751   Min.   :-0.6751   Min.   :-0.6751  
 1st Qu.:-0.6751   1st Qu.:-0.6751   1st Qu.:-0.6751   1st Qu.:-0.6751  
 Median : 3.3273   Median : 4.4010   Median : 2.8429   Median : 1.7021  
 Mean   : 3.5232   Mean   : 4.2210   Mean   : 3.4572   Mean   : 2.6139  
 3rd Qu.: 6.4611   3rd Qu.: 7.4554   3rd Qu.: 6.5944   3rd Qu.: 5.3135  
 Max.   :17.5609   Max.   :17.3460   Max.   :17.9129   Max.   :17.9660  
     Trt11             Trt13        
 Min.   :-0.6751   Min.   :-0.6751  
 1st Qu.:-0.6751   1st Qu.:-0.6751  
 Median : 2.0916   Median : 1.7653  
 Mean   : 2.8922   Mean   : 2.7434  
 3rd Qu.: 5.6580   3rd Qu.: 5.4144  
 Max.   :18.0921   Max.   :18.0972
Entering edit mode

Thank you James!

It hadn't occurred to me to input CPM rather than logCPM which is why my df$NormalizedCounts didn't exceed 17ish in value and have negative values.

Having utilized RUVr with both CPM and logCPM, inputting the logCPM gives us output that is more reflective of what we'd expect biologically. If I used edgeR to normalize initially (which used log2 to make my logCPM input), can I 2^(df$NormalizedCounts) to back convert or is this problematic because edgeR utilizes library size?

Entering edit mode

You shouldn't use CPM or logCPM for any of the steps! To use RUVr you need to fit a preliminary model using edgeR, which takes counts as input, and then you give RUVr counts as well. This is pretty clearly spelled out in the help page for RUVr. For example


     RUVr(x, cIdx, k, residuals, center=TRUE, round=TRUE, epsilon=1, tolerance=1e-8, isLog=FALSE)


       x: Either a genes-by-samples numeric matrix or a
          SeqExpressionSet object containing the read counts.

Nothing about CPM or logCPM there, just counts. And below that it says

residuals: A genes-by-samples matrix of residuals obtained from a
          first-pass regression of the counts on the covariates of
          interest, usually the negative binomial deviance residuals
          obtained from 'edgeR' with the 'residuals' method.

Again, no CPM or logCPM here. Just counts. And the example shows how to do it, using some data from the zebrafishRNASeq package that you can use as an example for your data.

Entering edit mode

Also, you might be confused about something. CPM is counts/million counts, which uses library size to scale normalize and logCPM is just the logs of the CPM values. And edgeR doesn't normalize anything, really, unless you are talking about CPM or logCPM values. Usually you use TMM to estimate the library size, and that is used as an offset in the GLM rather than explicitly normalizing the data first.

Entering edit mode

Ah I think I know where the communication breakdown is occurring. When I think of RUVr, I'm including fitting the preliminary model in edgeR, not just the RUVr line; I've been following along with the help page and tried the zebrafishRNASeq package initially.

When I switch to my data I've been also using my predecessor's code: she used TMM on raw data to estimate library size df <-calcNormFactors(df), obtained the matrix of logCPM counts <- cpm(df, log=T) and then used that count matrix to fit the design as described in the help page line for line (except higher k).

design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
res <- residuals(fit, type="deviance")

set4 <- RUVr(set, genes, k=3, res)

Since she used a logCPM count matrix to fit a model prior to RUVr(), the normalized counts came back logCPM in appearance.

I’m guessing this pipeline is inappropriate since it essentially renormalized counts that were already normalized.

Entering edit mode

I am unfamiliar with a counts function (your counts(set)), so it's not clear what exactly you are putting into that DGEList. But if it's logCPM values, then for sure it's wrong.

Entering edit mode

Thank you, it was!


Login before adding your answer.

Traffic: 323 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6