Question

Normalizing counts for cross-species correlation test

0

Entering edit mode

fl ▴ 20

@fl-16173

Last seen 4 months ago

Germany

Hello, I am interested in correlating counts between two insect species. For each species, I imported kallisto results and summarized to the gene-level using tximport, created a DESeqDataSet object, and normalized the counts using the "median ratio method":

counts_sp1 <- DESeq2::counts(dds_sp1, normalized = TRUE)
counts_sp2 <- DESeq2::counts(dds_sp2, normalized = TRUE)

Then, I used biomaRt to retrieve orthology information and created a count matrix including the two species; I replaced the original gene identifiers of each species with unique ortholog identifiers. For example:

| ortho_id | sp1_rep1 | sp1_rep2 | sp1_rep3 | sp1_rep4 | sp2_rep1 | sp2_rep2 | sp2_rep3 | sp2_rep4 |
|----------|----------|----------|----------|----------|----------|----------|----------|----------|
| ortho01  | 1130.25  | 1313.85  | 1662.45  | 1921.63  | 1135.85  | 1211.97  | 927.94   | 1258.67  |
| ortho02  | 680.00   | 1276.19  | 921.22   | 1182.74  | 1357.92  | 1124.72  | 1445.95  | 1072.90  |
| ortho03  | 797.55   | 639.31   | 712.38   | 712.36   | 748.45   | 795.46   | 667.38   | 890.66   |
| ortho04  | 1187.66  | 1522.17  | 1095.07  | 1201.00  | 1337.27  | 1392.64  | 1429.92  | 1404.23  |
| ortho05  | 3472.52  | 3878.38  | 3335.18  | 3553.46  | 806.13   | 796.10   | 969.89   | 858.79   |

Lastly, I used the count matrix including both species to compute size factors again and then divided the count values of each sample by its corresponding size factor:

ortho_size_factors <- DESeq2::estimateSizeFactorsForMatrix(ortho_matrix)
norm_ortho_matrix <- sweep(ortho_matrix, 2, ortho_size_factors, FUN = "/")

Here is the result:

enter image description here

Is there something fundamentally wrong with the last normalization step? I am planning to use the normalized counts of a subset of genes to calculate Spearman correlation coefficients between species. Any help would be greatly appreciated.

deseq2 • 698 views

ADD COMMENT • link updated 4.3 years ago by Michael Love 43k • written 4.3 years ago by fl ▴ 20

score 0 · Answer 1 · 2020-08-13

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

"created a DESeqDataSet object"

How? Using DESeqDataSetFromTximport?

And what is in ortho_matrix?

ADD COMMENT • link 4.3 years ago Michael Love 43k

0

Entering edit mode

Yes, I used DESeqDataSetFromTximport:

txi_sp1 <- tximport::tximport(files_sp1,
  type = "kallisto",
  tx2gene = tx2gene_sp1,
  txOut = FALSE
)

dds_sp1 <- DESeq2::DESeqDataSetFromTximport(txi_sp1,
  colData = coldata_sp1,
  design = ~1
)

dds_sp1 <- DESeq2::estimateSizeFactors(dds_sp1)
keep <- rowSums(DESeq2::counts(dds_sp1) >= 10) >= 4
dds_sp1 <- dds_sp1[keep, ]

counts_sp1 <- DESeq2::counts(dds_sp1, normalized = TRUE)

ortho_matrix only includes the normalized counts of both species. I used dplyr functions to bind the two data frames of normalized counts (e.g., dplyr::inner_join(ortho_counts_sp1, ortho_counts_sp2, by = "ortho_id").

Thank you for your help.

ADD REPLY • link 4.3 years ago fl ▴ 20

1

Entering edit mode

Ok this seems fine then, at least, what you have in the end is counts, scaled by effective gene length, and then scaled using median ratio method to remove global differences due to sequencing depth. I think that you will get the same output perhaps with just putting counts() into ortho_matrix, not normalized counts, because you run size factor estimation again.

ADD REPLY • link 4.3 years ago Michael Love 43k