Hi everyone,
I have some questions regarding the normalizations applied by DESeq2 and some contradictory results I have obtained, I hope you can help me. Our experimental design is as follows: we have libraries of small RNAs from 10 healthy individuals and 20 infected patients (10 with mild symptoms and 10 with severe symptoms), which are sampled at 2 different time points. We want to see how the RNA families vary between the different severity groups and time points and to obtain possible biomarkers. The strategy followed for the analysis is as follows:
- We generate a matrix of counts containing each of the sequences present in the 50 samples (rows) and the absolute no. of counts for each sample (column), using in-house scripts. We also obtain the same matrix with the normalized counts in RPMs.
- To be able to work in R with this matrix, and since we are only interested in potential endogenous biomarkers, we filter the matrix keeping only what maps against human, and eliminating rows with very low counts.
- This matrix is the input we give to
DESEqDataSetFromMatrix()
.
We then perform the various contrasts we are interested in. For example:
acute_vs_mild_agesex <- DESeqDataSetFromMatrix(countData = counts_table_human_sexage,
colData = sampletable_2,
design = ~ Sex + Age_group + Group)
acute_vs_mild_mild_agesex$Group <- relevel(acute_vs_mild_mild_agesex$Group, ref ="T1_mild")
acute_vs_mild_mild_agesex$Age_group <- factor(acute_vs_mild_mild_agesex$Age_group, c(1,2,3,4))
acute_vs_mild_agesex2 <- DESeq(acute_vs_mild_agesex)
res_T1acute_vs_mild_agesex <- results(acute_vs_mild_agesex2, contrast =c("Group", "T1_acute", "T1_mild"), lfcThreshold = 0.585)
shrunk_T1acute_vs_mild_agesex <- lfcShrink(acute_vs_mild_agesex2, coef="Group_T1_acute_vs_T1_mild", res=res_T1acute_vs_mild_agesex)
With this contrast, what I intend to do is to obtain the DE sequences between acute and mild patients (in time 1), controlling for the variables sex and age group. It should be noted that my groups of patients are not balanced in the age group category: one of the groups presents mostly older patients. Here my first doubt arises: is it correct to include the age group variable in the design even though it is not balanced?
But my main doubt appears when reviewing the results of the contrast. DESEq2 returns 1113 sequences as significant DE (padj < 0.05). But in some cases, when I go to the counts tables (both the one from DESeq2 and the ones I get originally) and check the normalized counts of X sequence of interest, I don't see the fold change that DESeq2 tells me. For example:
Result:
seq baseMean log2FoldChange lfcSE stat pvalue padj
seqX 2527.6030 3.348650 0.8625103 3.204194 1.354413e-03 0.0240597523
Normalized DESeq2 counts of the sequence X
Acute T1: 2.476341e+03 1.378931e+03 1657.99238 2.440383e+03 2.010505e+03 1.016864e+03 2.047027e+03 1.581439e+03 1.468407e+03 2006.66562
Mild T1: 9.940807e+02 7.19070704e+02 1.26202034e+03 2.559517e+03 6.664843e+02 9.851480e+02 720.272552 522.275026 2471.92615 1.765345e+04
Counts in the original matrix (normalized in RPMs).
Acute: 605.636 137.624 101.907 136.753 782.274 316.689 287.133 150.494 370.804 59.239
Mild: 322.246 172.691 359.175 783.238 251.813 381.638 35.494 109.578 62.841 666.775
Is this behavior to be expected, the fact that I cannot observe with the naked eye in the normalized counts a differential expression of the sequences? Can it be due to the normalizations and internal transformations that DESeq2 performs on the counts or is it a false positive?
I hope I have explained myself, otherwise I will provide all the extra info needed.
Thank you very much!
Thank you very much for your quick response! Following the steps you suggested in the second answer I could visualize the change noted by DESeq2.