Question

Tximport/DESeq2 Normalization Process

0

Entering edit mode

johnsony • 0

@dad31ed1

Last seen 2.8 years ago

United States

Hi all,

I've been working on a meta-transcriptomic pipeline for a set of incubation experiments (for phytoplankton) from a 2019 research cruise in the California coast, and am at a stage now where I am needing to convert my alignment output (Salmon) into a matrix file (containing raw counts) used for DESeq2. I already have my phylodb and keggannot BLAST outputs and am hoping to merge these with the counts, lengths, and abundances file that was the result of Tximport.

I've already calculated TPM from the raw counts using the count columns (from each sample) and divided them by the avg_gene_length (derived from the length columns). The result gave me a TPM matrix which I then merged with my phylodb and kegg outputs. I did all that on R on the cluster since my files are too big (25G) to use on R itself, but had realized that DESeq2 does not require me to normalize my counts and only requires raw counts. My questions are:

Would calculating the TPM be considered normalizing in this case? If so, would I have to go back and only merge my raw counts output (directly from tximport) with the phylodb/kegg outputs before feeding my counts into DESeq2?
Does DESeq2 perform the normalization steps for me? If so, why would I not need to incorporate the length columns or the avg_gene_length column and only the raw counts?

Here is my tximport R script that I submitted as an sbatch job on the cluster

BiocManager::install(‘tximport’)
library(DESeq2)
library(tidyverse)
library(RColorBrewer)
library(pheatmap)
library(tximport)
library(ggplot2)
library(ggrepel)

samples <- list.files(path = “/pine/scr/j/o/johnsony/PUP_quants”, full.names = T)
files <- file.path(samples, “quant.sf”)
names(files) <- str_replace(samples, “/pine/scr/j/o/johnsony/PUP_quants”, “”) %>%
str_replace(“.salmon”, “”)

tx2gene <- read.delim("/pine/scr/j/o/johnsony/annotations/kegg_annotations/mega_kegg.tsv")

txi.keggannot.transcriptlevel <- tximport(files, type = “salmon”, tx2gene = tx2gene[,c(“query”, “KO”)], txOut = TRUE)
txi.keggannot.transcriptlevel.csv <- write.csv(txi.keggannot.transcriptlevel, “txi.keggannot.transcriptlevel.csv”)

Here is my TPM calculation script (which I'm wondering if I need to perform before DESeq2):

kegg_tl <- read.csv(‘txi.keggannot.transcriptlevel.csv’)
rownames(kegg_tl) <- kegg_tl$X

colnames(kegg_tl)[colnames(kegg_tl) == ‘X’] <- ‘query’

x <- kegg_tl %>%
transmute(kegg_tl, avg_gene_length = rowMeans(select(kegg_tl, starts_with(“length”)), na.rm = TRUE))

rownames(x) <- rownames(kegg_tl)

y <- x[36:69] / x$avg_gene_length

tpm.mat <- t( t(y) * 1e6 / colSums(y) )
tpm.mat <- data.frame(tpm.mat)
tpm.mat$TrinityID <- rownames(tpm.mat)
tpm_mat.csv <- write.csv(tpm.mat, “tpm_mat.csv”)

Sorry if these all seem like beginner's questions, but I do hope to have more clarification in regards to DESeq2 before I can properly perform statistics on my samples. Please let me know if you have any advice or suggestions. Thank you!

tximport Normalization DESeq2 • 1.3k views

ADD COMMENT • link updated 2.9 years ago by Michael Love 41k • written 2.9 years ago by johnsony • 0

score 0 · Answer 1 · 2021-05-28

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 20 minutes ago

United States

Are all the samples quantified against the same reference (e.g. the same Salmon index)?

Usually, you would just let tximport and DESeq2 take are of all these steps, simply with:

txi <- tximport(files, ...)
dds <- DESeqDataSetFromTximport(txi, coldata, ~condition)

The point of tximport is so users don't have to do the guesswork on import and scaling factors.

ADD COMMENT • link 2.9 years ago Michael Love 41k

0

Entering edit mode

Hi Mike,

Yes all of the samples were quantified against the same reference (Salmon index). And I ended up with a quant.sf file for each individual sample, which I then used tximport for with the first R script shown above and created the txi.keggannot.transcriptlevel.csv file that includes the abundance, counts, and length columns for all of the samples combined. Is it just the counts columns that I use to feed into DESeq2? I think another problem I have is regarding my file size. Since this is a meta-transcriptome, I'm not sure if it is capable of running using RStudios, or would I have to run DESeq2 through the cluster? The output I have is 25G and the only way I can manipulate it and create a file that's able to be fed into DESeq2 would be through the LongLeaf cluster.

Thank you!

ADD REPLY • link 2.9 years ago johnsony • 0

0

Entering edit mode

re: "Is it just the counts columns that I use to feed into DESeq2?"

No the line of code I show above (dds <- ...) uses more than the counts column. It also uses the lengths.

I'd recommend minimal filtering from the start because you probably have a lot of transcripts that are not expressed, and you can drop these even before importing into DESeq2.

E.g.:

# X or more samples with TPM > threshold:
keep <- rowSums(txi$abundance >= tpm_threshold) >= X 
filt <- lapply(txi[1:3], function(mat) mat[keep,])
filt$countsFromAbundance <- txi$countsFromAbundance # keep this metadata

ADD REPLY • link 2.9 years ago Michael Love 41k

0

Entering edit mode

Hi Mike,

I've tried using the script, and had made a sampleTable first from the colnames of my txi file, and it looks something like this:

          sampleTypewideshelf
countsQ1          wideshelfT0
countsQ2          wideshelfT0
countsQ3          wideshelfT0
countsQ4      wideshelfT1Ctrl
countsQ7        wideshelfT1Fe
countsQ10      wideshelfT1DFB
...

But whenever I use this script:

ddswideshelf <- DESeqDataSetFromTximport(txi.mat.wideshelf, colData = sampleTypewideshelf, ~condition)

I get an error saying:

Error in round(txi$counts) : 
  non-numeric argument to mathematical function

my txi.mat.wideshelf dataset looks like this, with each sample represented by a Q number, and the counts containing the raw counts from my quant.sf output:

 [1] "Domain"              "Phylum"              "Class"              
 [4] "Order"               "Family"              "Genus"              
 [7] "Species"             "Organism"            "TrinityID"          
[10] "KO"                  "countsQ1"            "countsQ2"           
[13] "countsQ3"            "countsQ4"            "countsQ7"           
[16] "countsQ10"           "countsQ13"           "countsQ14"          
[19] "countsQ15"           "countsQ16"           "countsQ17"          
[22] "countsQ18"           "countsQ19"           "lengthQ1"           
[25] "lengthQ2"            "lengthQ3"            "lengthQ4"           
[28] "lengthQ7"            "lengthQ10"           "lengthQ13"          
[31] "lengthQ14"           "lengthQ15"           "lengthQ16"          
[34] "lengthQ17"           "lengthQ18"           "lengthQ19"          
[37] "countsFromAbundance"

I checked online to see what the error's about and it seems like I'm trying to pass these data frames into a function that only recognizes vectors? If so, how may I convert my raw counts into a vector? Thanks!

ADD REPLY • link 2.9 years ago johnsony • 0

0

Entering edit mode

See here:

https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#DESeq2

The input to DESeqDataSetFromTximport is the output from tximport. Above you are trying to provide something else to this function. The paradigm is:

txi <- tximport(...)
dds <- DESeqDataSetFromTximport(txi, ...)

ADD REPLY • link 2.9 years ago Michael Love 41k