I've been working on a meta-transcriptomic pipeline for a set of incubation experiments (for phytoplankton) from a 2019 research cruise in the California coast, and am at a stage now where I am needing to convert my alignment output (Salmon) into a matrix file (containing raw counts) used for DESeq2. I already have my phylodb and keggannot BLAST outputs and am hoping to merge these with the counts, lengths, and abundances file that was the result of Tximport.
I've already calculated TPM from the raw counts using the count columns (from each sample) and divided them by the avg_gene_length (derived from the length columns). The result gave me a TPM matrix which I then merged with my phylodb and kegg outputs. I did all that on R on the cluster since my files are too big (25G) to use on R itself, but had realized that DESeq2 does not require me to normalize my counts and only requires raw counts. My questions are:
- Would calculating the TPM be considered normalizing in this case? If so, would I have to go back and only merge my raw counts output (directly from tximport) with the phylodb/kegg outputs before feeding my counts into DESeq2?
- Does DESeq2 perform the normalization steps for me? If so, why would I not need to incorporate the length columns or the avg_gene_length column and only the raw counts?
Here is my tximport R script that I submitted as an sbatch job on the cluster
BiocManager::install(‘tximport’) library(DESeq2) library(tidyverse) library(RColorBrewer) library(pheatmap) library(tximport) library(ggplot2) library(ggrepel) samples <- list.files(path = “/pine/scr/j/o/johnsony/PUP_quants”, full.names = T) files <- file.path(samples, “quant.sf”) names(files) <- str_replace(samples, “/pine/scr/j/o/johnsony/PUP_quants”, “”) %>% str_replace(“.salmon”, “”) tx2gene <- read.delim("/pine/scr/j/o/johnsony/annotations/kegg_annotations/mega_kegg.tsv") txi.keggannot.transcriptlevel <- tximport(files, type = “salmon”, tx2gene = tx2gene[,c(“query”, “KO”)], txOut = TRUE) txi.keggannot.transcriptlevel.csv <- write.csv(txi.keggannot.transcriptlevel, “txi.keggannot.transcriptlevel.csv”)
Here is my TPM calculation script (which I'm wondering if I need to perform before DESeq2):
kegg_tl <- read.csv(‘txi.keggannot.transcriptlevel.csv’) rownames(kegg_tl) <- kegg_tl$X colnames(kegg_tl)[colnames(kegg_tl) == ‘X’] <- ‘query’ x <- kegg_tl %>% transmute(kegg_tl, avg_gene_length = rowMeans(select(kegg_tl, starts_with(“length”)), na.rm = TRUE)) rownames(x) <- rownames(kegg_tl) y <- x[36:69] / x$avg_gene_length tpm.mat <- t( t(y) * 1e6 / colSums(y) ) tpm.mat <- data.frame(tpm.mat) tpm.mat$TrinityID <- rownames(tpm.mat) tpm_mat.csv <- write.csv(tpm.mat, “tpm_mat.csv”)
Sorry if these all seem like beginner's questions, but I do hope to have more clarification in regards to DESeq2 before I can properly perform statistics on my samples. Please let me know if you have any advice or suggestions. Thank you!