Hi all,
I've been working on a meta-transcriptomic pipeline for a set of incubation experiments (for phytoplankton) from a 2019 research cruise in the California coast, and am at a stage now where I am needing to convert my alignment output (Salmon) into a matrix file (containing raw counts) used for DESeq2. I already have my phylodb and keggannot BLAST outputs and am hoping to merge these with the counts, lengths, and abundances file that was the result of Tximport.
I've already calculated TPM from the raw counts using the count columns (from each sample) and divided them by the avg_gene_length (derived from the length columns). The result gave me a TPM matrix which I then merged with my phylodb and kegg outputs. I did all that on R on the cluster since my files are too big (25G) to use on R itself, but had realized that DESeq2 does not require me to normalize my counts and only requires raw counts. My questions are:
- Would calculating the TPM be considered normalizing in this case? If so, would I have to go back and only merge my raw counts output (directly from tximport) with the phylodb/kegg outputs before feeding my counts into DESeq2?
- Does DESeq2 perform the normalization steps for me? If so, why would I not need to incorporate the length columns or the avg_gene_length column and only the raw counts?
Here is my tximport R script that I submitted as an sbatch job on the cluster
BiocManager::install(‘tximport’)
library(DESeq2)
library(tidyverse)
library(RColorBrewer)
library(pheatmap)
library(tximport)
library(ggplot2)
library(ggrepel)
samples <- list.files(path = “/pine/scr/j/o/johnsony/PUP_quants”, full.names = T)
files <- file.path(samples, “quant.sf”)
names(files) <- str_replace(samples, “/pine/scr/j/o/johnsony/PUP_quants”, “”) %>%
str_replace(“.salmon”, “”)
tx2gene <- read.delim("/pine/scr/j/o/johnsony/annotations/kegg_annotations/mega_kegg.tsv")
txi.keggannot.transcriptlevel <- tximport(files, type = “salmon”, tx2gene = tx2gene[,c(“query”, “KO”)], txOut = TRUE)
txi.keggannot.transcriptlevel.csv <- write.csv(txi.keggannot.transcriptlevel, “txi.keggannot.transcriptlevel.csv”)
Here is my TPM calculation script (which I'm wondering if I need to perform before DESeq2):
kegg_tl <- read.csv(‘txi.keggannot.transcriptlevel.csv’)
rownames(kegg_tl) <- kegg_tl$X
colnames(kegg_tl)[colnames(kegg_tl) == ‘X’] <- ‘query’
x <- kegg_tl %>%
transmute(kegg_tl, avg_gene_length = rowMeans(select(kegg_tl, starts_with(“length”)), na.rm = TRUE))
rownames(x) <- rownames(kegg_tl)
y <- x[36:69] / x$avg_gene_length
tpm.mat <- t( t(y) * 1e6 / colSums(y) )
tpm.mat <- data.frame(tpm.mat)
tpm.mat$TrinityID <- rownames(tpm.mat)
tpm_mat.csv <- write.csv(tpm.mat, “tpm_mat.csv”)
Sorry if these all seem like beginner's questions, but I do hope to have more clarification in regards to DESeq2 before I can properly perform statistics on my samples. Please let me know if you have any advice or suggestions. Thank you!
Hi Mike,
Yes all of the samples were quantified against the same reference (Salmon index). And I ended up with a quant.sf file for each individual sample, which I then used tximport for with the first R script shown above and created the txi.keggannot.transcriptlevel.csv file that includes the abundance, counts, and length columns for all of the samples combined. Is it just the counts columns that I use to feed into DESeq2? I think another problem I have is regarding my file size. Since this is a meta-transcriptome, I'm not sure if it is capable of running using RStudios, or would I have to run DESeq2 through the cluster? The output I have is 25G and the only way I can manipulate it and create a file that's able to be fed into DESeq2 would be through the LongLeaf cluster.
Thank you!
re: "Is it just the counts columns that I use to feed into DESeq2?"
No the line of code I show above (
dds <- ...
) uses more than the counts column. It also uses the lengths.I'd recommend minimal filtering from the start because you probably have a lot of transcripts that are not expressed, and you can drop these even before importing into DESeq2.
E.g.:
Hi Mike,
I've tried using the script, and had made a sampleTable first from the colnames of my txi file, and it looks something like this:
But whenever I use this script:
I get an error saying:
my txi.mat.wideshelf dataset looks like this, with each sample represented by a Q number, and the counts containing the raw counts from my quant.sf output:
I checked online to see what the error's about and it seems like I'm trying to pass these data frames into a function that only recognizes vectors? If so, how may I convert my raw counts into a vector? Thanks!
See here:
https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#DESeq2
The input to
DESeqDataSetFromTximport
is the output fromtximport
. Above you are trying to provide something else to this function. The paradigm is: