Question

tximport question on what counts to use for LogFC calculations

0

Entering edit mode

mailmpunta • 0

@ea293cee

Last seen 3.1 years ago

Italy

Hi,

from what I understand then (note: this is a follow up to a question asked on the tximport github) the cpm counts calculated in the last line of the tximport edgeR vignette, that is:

cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)

cannot be used directly (I mean as they are) to calculate logFC between expression levels of a gene of interest in two different samples (say I have only two samples, pre and post-treatment). Instead one should further process y using something like:

y <- estimateDisp(y)
et <- exactTest(y)

In other words, neither the cmps calculated as above nor the count values found in y can be used as they are for between samples comparisons, is that correct?

Thank you very much for your help with this matter! Marco

tximport • 1.5k views

ADD COMMENT • link updated 3.1 years ago by Michael Love 43k • written 3.1 years ago by mailmpunta • 0

score 0 · Answer 1 · 2021-11-28

0

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 4 days ago

San Diego

CPMs can't be used for proper DE assessments. But you can't do proper DE assessments with 1 vs 1 sample either, even if you had counts properly normalized.

ADD COMMENT • link 3.1 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

Apologies, I should have been more expansive in my first post. I understand that 1 vs 1 cannot give you e.g. reliable p-values for differences but my goal here is simply to create a heatmap of pre vs post treatment logFC across individual patients. My question about cmps is linked to this snippet of code from the tximport vignette:

cts <- txi$counts
normMat <- txi$length

# Obtaining per-observation scaling factors for length, adjusted to avoid
# changing the magnitude of the counts.
normMat <- normMat/exp(rowMeans(log(normMat)))
normCts <- cts/normMat

# Computing effective library sizes from scaled counts, to account for
# composition biases between samples.
library(edgeR)
eff.lib <- calcNormFactors(normCts) * colSums(normCts)

# Combining effective library sizes with the length factors, and calculating
# offsets for a log-link GLM.
normMat <- sweep(normMat, 2, eff.lib, "*")
normMat <- log(normMat)

# Creating a DGEList object for use in edgeR.
y <- DGEList(cts)
y <- scaleOffset(y, normMat)
# filtering
keep <- filterByExpr(y)
## Warning in filterByExpr.DGEList(y): All samples appear to belong to the same
## group.
y <- y[keep, ]
# y is now ready for estimate dispersion functions see edgeR User's Guide
For creating a matrix of CPMs within edgeR, the following code chunk can be used:

cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)

where the calcNormFactors function is used to calculate the TMM effective library size given that the default 'method' argument in calcNormFactor appears to be "TMM". In fact from what I understand TMM counts can be obtained by simply running:

y <- calcNormFactors(y)
y <- edgeR:cpm(y)

where y is a DGEList object. If what I just wrote is correct (which might not be) then the 'cpms' calculated in the tximport vignette are not the classic CPMs, but rather more akin to TMMs. Also I don't really understand why not using simply the two lines of code above (calcNormFactors+cpm) to get from tximport$counts (run with the option countsFromAbundance="no") to a y object that can be used in downstream analysis.

My impression is that for what I want (that is, a heatmap of pre vs post treatment logFC where columns are patients and rows are genes from a list of genes of interest, I could just run the following:


cts <- txi$counts
colnames(cts) <- colnames(txi$length)
cts_df <- as.data.frame(cts)
y <- DGEList(cts)
y <- calcNormFactors(y, method = "TMM")
TMM_log_counts <- cpm(y, log = "TRUE")

and then take the difference between the log values in the TMM_log_counts matrix.

Does this sound reasonable or am I (very possibly) missing something?

Thanks for your help! Marco

ADD REPLY • link 3.1 years ago mailmpunta • 0

score 0 · Answer 2 · 2021-11-29

"In other words, neither the cmps calculated as above nor the count values"

I think you could use the CPMs and it would give a qualitatively similar answer. I think LFC from edgeR standard pipeline will give a better estimate of LFC for certain designs as it takes into account the precision of the observations for the parameter estimate. But for two group comparisons you may get something similar with CPM.