Hello, I am interested in correlating counts between two insect species. For each species, I imported kallisto
results and summarized to the gene-level using tximport
, created a DESeqDataSet object, and normalized the counts using the "median ratio method":
counts_sp1 <- DESeq2::counts(dds_sp1, normalized = TRUE)
counts_sp2 <- DESeq2::counts(dds_sp2, normalized = TRUE)
Then, I used biomaRt
to retrieve orthology information and created a count matrix including the two species; I replaced the original gene identifiers of each species with unique ortholog identifiers. For example:
| ortho_id | sp1_rep1 | sp1_rep2 | sp1_rep3 | sp1_rep4 | sp2_rep1 | sp2_rep2 | sp2_rep3 | sp2_rep4 |
|----------|----------|----------|----------|----------|----------|----------|----------|----------|
| ortho01 | 1130.25 | 1313.85 | 1662.45 | 1921.63 | 1135.85 | 1211.97 | 927.94 | 1258.67 |
| ortho02 | 680.00 | 1276.19 | 921.22 | 1182.74 | 1357.92 | 1124.72 | 1445.95 | 1072.90 |
| ortho03 | 797.55 | 639.31 | 712.38 | 712.36 | 748.45 | 795.46 | 667.38 | 890.66 |
| ortho04 | 1187.66 | 1522.17 | 1095.07 | 1201.00 | 1337.27 | 1392.64 | 1429.92 | 1404.23 |
| ortho05 | 3472.52 | 3878.38 | 3335.18 | 3553.46 | 806.13 | 796.10 | 969.89 | 858.79 |
Lastly, I used the count matrix including both species to compute size factors again and then divided the count values of each sample by its corresponding size factor:
ortho_size_factors <- DESeq2::estimateSizeFactorsForMatrix(ortho_matrix)
norm_ortho_matrix <- sweep(ortho_matrix, 2, ortho_size_factors, FUN = "/")
Here is the result:
Is there something fundamentally wrong with the last normalization step? I am planning to use the normalized counts of a subset of genes to calculate Spearman correlation coefficients between species. Any help would be greatly appreciated.
Yes, I used
DESeqDataSetFromTximport
:ortho_matrix
only includes the normalized counts of both species. I useddplyr
functions to bind the two data frames of normalized counts (e.g.,dplyr::inner_join(ortho_counts_sp1, ortho_counts_sp2, by = "ortho_id"
).Thank you for your help.
Ok this seems fine then, at least, what you have in the end is counts, scaled by effective gene length, and then scaled using median ratio method to remove global differences due to sequencing depth. I think that you will get the same output perhaps with just putting counts() into
ortho_matrix
, not normalized counts, because you run size factor estimation again.