The correlation between normalized counts and TPM based on 2-factor analysis by DESeq2
1
0
Entering edit mode
hequn • 0
@3c7c0ace
Last seen just now
France

Enter the body of text here Hello everyone,

I have some questions about DESeq2 2-factor analysis. One of our ongoing project is about RNA-seq data from patient and control (2 genotypes) with or without stimulation(not stimulation, stimulation 1 and 2, 3 conditions in total), for each sample, we have different number of technical replicates, and we are very interested in interaction between genotype and condition. I tried to carry on my analysis by 2-factor mode using DESeq2 with design ~ condition + genotype + condition:genotype with collapsing the technical replicates firstly. The “results” function was used to extract individual interaction term(element from resultsNames(object)) from object has been called by DESeq. Some top DEGs were selected to do visualization by “plotcounts” function, and we found that normalized counts for some genes are not consistent with their TPM. Then we checked the correlations between log-transformed TPM and normalized counts based on 2-factor, and the correlation for single sample based on expressed genes is good

(figure 1)

while the correlation for the ratio of each group, for example, correlation of log( normalized counts of (patient-stimulation 1/ctrl-stimulation 1) ) and log( TPM (patient-stimulation 1/ctrl-stimulation 1) ), based on 2-factor mode is not good

(figure 2).

Do you how to explain this question? Do I need to take some action to correct the normalization based on two factor?

Thank you in advance.

Code should be placed in three backticks as shown below


# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session

sessionInfo( )

DESeq2 RNASeq • 157 views
1
Entering edit mode

How did you compute TPMs? In what ways were the results vastly different? That TPMs and normalized counts don't fully correlate isn't surprising, there's a reason TPMs aren't used for statistics.

0
Entering edit mode

Thank you so much for you response.

TPM is calculated by the company, I do not know which software they used to generate them. I had used kallisto to produce TPMs before. The difference is that the correlation between the ratio of normalized counts and TPM are very good based on one-factor as shown in figure 3.

0
Entering edit mode

Could you post your code where you make these plots?

0
Entering edit mode

This is the code for figure 1

normalized_counts = as.data.frame(counts(dds, normalized = TRUE))
normalized_counts$ID = rownames(normalized_counts) TPM = read.csv("TPM.csv") rownames(TPM) = TPM$X
TPM = TPM[,-c(1,8,13)]
TPM$ID = rownames(TPM) normlized_counts.TPM = merge(TPM, normalized_counts, by.x = "ID", by.y = "ID") rownames(normlized_counts.TPM) = normlized_counts.TPM$ID
normlized_counts.TPM = normlized_counts.TPM[,-1]


normlized_counts.TPM.log = log2(normlized_counts.TPM + 0.5 )

ggscatter(normlized_counts.TPM.log, x = "TPM", y = "normalizd counts",
add = "reg.line", conf.int = TRUE,
cor.method = "pearson",
title = "Correlation between TPM and normalized counts for C21 sample two factor",
xlab = "log2(TPM + 0.5)", ylab = "log2(normalized counts + 0.5)", xlim = c(0, 20),
add.params = list(color = "blue", fill = "lightgray")) +
stat_cor(method = "pearson", label.x = 3, label.y = 20)


This code for figure 2

  tpm_2.vs.5.two.factor.TPM = merge(res.2.vs.5.two.factor.norm.counts, TPM, by.x = "ID", by.y = "ID")
tpm_2.vs.5.two.factor.TPM[,2:ncol(tpm_2.vs.5.two.factor.TPM)] = apply(tpm_2.vs.5.two.factor.TPM[,2:ncol(tpm_2.vs.5.two.factor.TPM)],
2, as.numeric)
tpm_2.vs.5.two.factor.TPM$log.2.vs.5.two.factor = log2( (((tpm_2.vs.5.two.factor.TPM[,"H180_A_stimulate.collapse"] + tpm_2.vs.5.two.factor.TPM[,"H226_A_stimulate.collapse"] ) )/2)/ (((tpm_2.vs.5.two.factor.TPM[,"C21_A_stimulate.collapse"] + tpm_2.vs.5.two.factor.TPM[,"C64_A_stimulate.collapse"] + tpm_2.vs.5.two.factor.TPM[,"C96_A_stimulate.collapse"]) )/3 + 0.01) + 0.01  ) tpm_2.vs.5.two.factor.TPM$log.2.vs.5.TPM = log2( (((tpm_2.vs.5.two.factor.TPM[,"L.collapse"] + tpm_2.vs.5.two.factor.TPM[,"M.collapse"] ) )/2)/ (((tpm_2.vs.5.two.factor.TPM[,"D.collapse"] + tpm_2.vs.5.two.factor.TPM[,"E.collapse"] + tpm_2.vs.5.two.factor.TPM[,"F.collapse"]) )/3 + 0.01) + 0.01
)

ggscatter(tpm_2.vs.5.two.factor.TPM, x = "normalized.counts", y = "TPM",
add = "reg.line", conf.int = TRUE,
#cor.coef = TRUE,
cor.method = "pearson",
title = "Correlation between log(Patient-A / Ctrl-A) two factor \nVS log(Patient-A / Ctrl-A) TPM",
xlab = "log2(normalized counts(Patient/(Ctrl+0.01) +0.01))", ylab = "log2(TPM(Patient/(Ctrl+0.01) +0.01))",
xlim = c(0, 15), ylim = c(0, 15),
add.params = list(color = "blue", fill = "lightgray")) +
stat_cor(method = "pearson", label.x = 3, label.y = 15)

0
Entering edit mode
@mikelove
Last seen 38 minutes ago
United States

My first guess is that there is a bug in the merging code. I don't see anything in particular, but you've got a lot of code and whereas things are tightly linked in DESeq2 functions, there could have been a slip in the merging or construction of LFCs.

But in general, the LFC using counts take into account precision of the data, so won't be correlated with r=1 if you do something manually with TPMs.

0
Entering edit mode

Thank you very much for your reply. I agree with you, what I do not understand is that the correlation by one-factor mode seems good but not good for two-factor.