Question

DESeq Input from Salmon + tximport/tximeta

0

Entering edit mode

LuciaNhu • 0

@6c040327

Last seen 21 days ago

Germany

Hi,

I am writing to inquire about the input for DESeq from Salmon using tximport/tximeta.

I ran the nf-core/rnaseq pipeline for RNA data analysis. Unfortunately, I mistakenly deleted the quant.sf and transcriptome BAM files, and I only have the final outputs: gene_lengths, gene_tpm, gene_counts, gene_counts_length_scaled.tsv, and gene_counts_scaled.tsv.

From my observations of the nf-core pipeline, the process is as follows: quant.sf -> tximport -> summarizeToGene (tximeta).

In the code, the nf-core team defines the pattern for file names based on the quantification type and imports the transcript-level quantifications using tximport. They also create a SummarizedExperiment object and process gene-level data using summarizeToGene if a mapping (tx2gene) is available.

The nf-core team uses gene_counts_length_scaled.tsv as the input for the following code:

dds <- DESeqDataSetFromMatrix(countData=round(counts), colData=coldata, design=~1)
dds <- estimateSizeFactors(dds)

Since the gene_counts_length_scaled.tsv file has already been normalized based on library size and gene length, the nf-core team visualizes the heatmap and PCA without rerunning dds <- DESeq(dds).

In my case, I need to compare two groups based on a specific list of genes, so I need to run dds <- DESeq(dds) manually. DESeq requires raw counts, which leads me to believe that my input should be gene_counts.tsv.

Is my solution correct?

Best regards,

Nhu

tximportData summarizeToGene tximport DESeq2 salmon • 621 views

ADD COMMENT • link 7 weeks ago • updated 6 weeks ago LuciaNhu • 0

score 1 · Answer 1 · 2024-11-04

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

Given nf-core's pipeline here, you can use rounded counts from gene_counts_length_scaled.tsv.

ADD COMMENT • link 7 weeks ago Michael Love 43k

0

Entering edit mode

Thank you to the author for the quick reply. If I use gene_counts_length_scaled.tsv as a metric that has already normalized for library size and the average transcript length, what additional components will dds <- DESeq(dds) continue to normalize? Will this affect my final results? I think that pre-normalized counts (like length-scaled counts) can lead to over-normalization when DESeq2 applies its own methods on top of existing adjustments, potentially distorting expression values.

I want to observe which genes are upregulated and downregulated, so I need the log2FoldChange and padj. Below is the nf-core code.

# Define pattern for file names based on quantification type
pattern <- ifelse('$quant_type' == "kallisto", "abundance.tsv", "quant.sf")
fns <- list.files('quants', pattern = pattern, recursive = T, full.names = T)
names <- basename(dirname(fns))
names(fns) <- names
dropInfReps <- '$quant_type' == "kallisto"

# Import transcript-level quantifications
txi <- tximport(fns, type = '$quant_type', txOut = TRUE, dropInfReps = dropInfReps)

# Read transcript and sample data
transcript_info <- read_transcript_info('$tx2gene')

# Make coldata just to appease the summarizedexperiment
coldata <- data.frame(files = fns, names = names)
rownames(coldata) <- coldata[["names"]]

# Create initial SummarizedExperiment object
se <- create_summarized_experiment(txi[["counts"]], txi[["abundance"]], txi[["length"]],
    DataFrame(coldata), transcript_info\$transcript)

# Setting parameters for writing tables
params <- list(
    list(obj = se, slot = "abundance", suffix = "transcript_tpm.tsv"),
    list(obj = se, slot = "counts", suffix = "transcript_counts.tsv"),
    list(obj = se, slot = "length", suffix = "transcript_lengths.tsv")
)

# Process gene-level data if tx2gene mapping is available
if ("tx2gene" %in% names(transcript_info) && !is.null(transcript_info\$tx2gene)) {
    tx2gene <- transcript_info\$tx2gene
    gi <- summarizeToGene(txi, tx2gene = tx2gene)
    gi.ls <- summarizeToGene(txi, tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
    gi.s <- summarizeToGene(txi, tx2gene = tx2gene, countsFromAbundance = "scaledTPM")

    gene_info <- transcript_info\$gene[match(rownames(gi[[1]]), transcript_info\$gene[["gene_id"]]),]
    rownames(gene_info) <- NULL
    col_data_frame <- DataFrame(coldata)

    # Create gene-level SummarizedExperiment objects
    gse <- create_summarized_experiment(gi[["counts"]], gi[["abundance"]], gi[["length"]],
        col_data_frame, gene_info)
    gse.ls <- create_summarized_experiment(gi.ls[["counts"]], gi.ls[["abundance"]], gi.ls[["length"]],
        col_data_frame, gene_info)
    gse.s <- create_summarized_experiment(gi.s[["counts"]], gi.s[["abundance"]], gi.s[["length"]],
        col_data_frame, gene_info)

    params <- c(params, list(
        list(obj = gse, slot = "length", suffix = "gene_lengths.tsv"),
        list(obj = gse, slot = "abundance", suffix = "gene_tpm.tsv"),
        list(obj = gse, slot = "counts", suffix = "gene_counts.tsv"),
        list(obj = gse.ls, slot = "counts", suffix = "gene_counts_length_scaled.tsv"),
        list(obj = gse.s, slot = "counts", suffix = "gene_counts_scaled.tsv")
    ))
}

ADD REPLY • link 6 weeks ago LuciaNhu • 0

2

Entering edit mode

If I use gene_counts_length_scaled.tsv as a metric that has already normalized for library size and the average transcript length

No, only for average transcript length, not for depth or libary composition. The output is still raw counts, so you can use DESeq2 without any modifications.

ADD REPLY • link 6 weeks ago ATpoint ★ 4.6k

1

Entering edit mode

Agree.

for library size and the average transcript length, what additional components will dds <- DESeq(dds) continue to normalize? Will this affect my final results? I think that pre-normalized counts (like length-scaled counts) can lead to over-normalization when DESeq2 applies its own methods on top of existing adjustments, potentially distorting expression values.

The way that DESeq2 works with average transcript length correction, it cannot "over-normalize".

ADD REPLY • link 6 weeks ago Michael Love 43k

0

Entering edit mode

Thank you both. Special thanks from Prof. Michael Love. It's rare to find tool authors who respond to users so quickly and supportively. Many thanks for your tool and your contribution.

ADD REPLY • link 6 weeks ago LuciaNhu • 0