Question

What type of counts data to import for performing Isoform analysis in edgeR

0

Entering edit mode

mohammedtoufiq91 ▴ 10

@mohammedtoufiq91-17679

Last seen 3 months ago

United States

Hi,

I am interested in performing isoform analysis on short read data (150bp) using the edgeR package, following the example from the edgeR User's Guide, section "4.6 Differential transcript expression of human lung adenocarcinoma cell lines."

I ran my pipeline using the nf-core/rnasplice pipeline and obtained counts and TPM values: nf-core rnasplice output:

From the example 4.6 in the edgeR User's Guide, I tried importing the "quant.sf" files but experienced difficulties. Based on the example, I imported scaled counts as suggested:

# Define the path to your TSV file
file_path <- "/projects/salmon/tximport/salmon.merged.transcript_counts_scaled.tsv"

# Import the TSV file
scaled.counts <- read.delim(file_path, header = TRUE, sep = "\t", row.names = 1)

# Create DGEList object
y <- DGEList(counts = scaled.counts, samples = Samples_metadata)
dim(y)

Which of the following files makes the most sense to import for "4.6 Differential transcript expression of human lung adenocarcinoma cell lines"?

Counts from nf-core/rnasplice:

salmon.merged.transcript_counts.tsv: Matrix of isoform-level raw counts across all samples.
salmon.merged.transcript_counts_scaled.tsv: Matrix of isoform-level scaled raw counts across all samples.
salmon.merged.transcript_counts_length_scaled.tsv: Matrix of isoform-level length-scaled raw counts across all samples.
salmon.merged.transcript_counts_dtu_scaled.tsv: Matrix of isoform-level dtu scaled raw counts across all samples.

TPMs from nf-core/rnasplice:

salmon.merged.transcript_tpm.tsv: Matrix of isoform-level TPM values across all samples.
salmon.merged.transcript_tpm_scaled.tsv: Matrix of isoform-level scaled TPM values across all samples.
salmon.merged.transcript_tpm_length_scaled.tsv: Matrix of isoform-level length-scaled TPM values across all samples.
salmon.merged.transcript_tpm_dtu_scaled.tsv: Matrix of isoform-level dtu scaled TPM values across all samples.

limma isoform edgeR salmon R • 2.8k views

ADD COMMENT • link updated 3 months ago by ATpoint ★ 4.7k • written 3 months ago by mohammedtoufiq91 ▴ 10

score 1 · Answer 1 · 2024-11-28

1

Entering edit mode

ATpoint ★ 4.7k

@atpoint-13662

Last seen 2 days ago

Germany

You don't need any of these files. You need the catchSalmon function which reads quant.sf and the inferential (Bootstrap/Gibbs) replicates. Where in nf-core you can find that I cannot tell, but this is what the user guide tells, see also ?catchSalmon.

ADD COMMENT • link 3 months ago ATpoint ★ 4.7k

0

Entering edit mode

ATpoint Thank you. I did wanted to try using the quant.sf files from the salmon run by referring edgeR user guide example 4.6 Differential transcript expression of human lung adenocarcinoma cell lines I tried the below steps, but I see there are NA values in the Overdispersion, and after running scaled.counts, the table is populated with NA values (see below screenshots)

> setwd("/projects/RNA/salmon/")

> quant <- dirname(list.files(".", "quant.sf", recursive = TRUE, full.names = TRUE))

> print(quant)

[1] "./DD21_progenitors" "./DD39_progenitors" "./DD48_progenitors" "./DD98_progenitors"

> catch <- catchSalmon(paths = quant)

Reading ./DD21_progenitors, 427366 transcripts, 0 none samples
Reading ./DD39_progenitors, 427366 transcripts, 0 none samples
Reading ./DD48_progenitors, 427366 transcripts, 0 none samples
Reading ./DD98_progenitors, 427366 transcripts, 0 none samples

> scaled.counts <- catch$counts/catch$annotation$Overdispersion

ADD REPLY • link 3 months ago mohammedtoufiq91 ▴ 10

1

Entering edit mode

Probably nf-core did not run with bootstrap/gibbs replicates. Check the code. See salmon docs on how to turn that on.

ADD REPLY • link 3 months ago ATpoint ★ 4.7k

0

Entering edit mode

OP needs to run Salmon with --numGibbsSamples=50. I suspect that nf-core just won't do that, so Salmon needs to be run directly.

ADD REPLY • link 3 months ago Gordon Smyth 52k

1

Entering edit mode

All nf-core pipelines can be run with any additional tool CLI flags, on a per-user basis without editing any pipeline code. See docs on how to do it here.

In this case I think it would be a question of the OP setting this Nextflow config locally:

process {
    withName: 'SALMON_QUANT_SALMON' {
        ext.args = "--numGibbsSamples=50"
    }
}

If it's an argument that should be set for all pipeline users then it's best to create an issue on the pipeline's GitHub repo.

ADD REPLY • link 3 months ago Phil ▴ 10

1

Entering edit mode

I think this already happened, see https://github.com/nf-core/rnasplice/issues/162

ADD REPLY • link 3 months ago ATpoint ★ 4.7k

score 1 · Answer 2 · 2024-11-29

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 16 minutes ago

WEHI, Melbourne, Australia

I haven't used nf-core but, from browsing through their documentation, they didn't seem to have any pipeline for differential transcript expression. To run the edgeR pipeline, you should run Salmon directly with --numGibbsSamples=50. Otherwise, edgeR has no way to measure the assignment uncertainty of each transcript count. See

Baldoni PL, Chen L, Smyth GK (2024). Faster and more accurate assessment of differential transcript expression with Gibbs sampling and edgeR v4. NAR Genomics and Bioinformatics 6(4), lqae151. https://doi.org/10.1093/nargab/lqae151

Note that the edgeR concept of "scaled counts" means "divided counts" and is quite different from scaled counts in nf-core.

ADD COMMENT • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

Gordon Smyth

I am exploring workflows for differential transcript expression analysis and have been using the nfcore rnasplice pipeline, which outputs a variety of files, including downstream results:

https://nf-co.re/rnasplice/1.0.4/ https://nf-co.re/rnasplice/1.0.4/docs/output/

In particular, I have been working with the salmon.merged.transcript_counts_length_scaled.tsv file, which provides a matrix of isoform level length-scaled raw counts across all samples. I wanted to ask if you think the following workflow (below) I am trying is appropriate for differential transcript expression analysis.

In the meantime, I will explore and re-run Salmon with the --numGibbsSamples=50 option to account for transcript count uncertainty before proceeding as well.

# Create DGEList object
y <- DGEList(counts = scaled.counts,
             samples = Samples_metadata)

# Filtering
keep <- filterByExpr(y, design)
table(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- normLibSizes(y, method = "TMM")

## Obtaining Log CPM values
logCPM <- cpm(y, log = TRUE, prior.count = 1)
class(logCPM)
logCPM <- as.data.frame(logCPM)

##  Dispersion estimation
y <- estimateDisp(y, design, robust=TRUE)
y$common.dispersion
plotBCV(y)

fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit)

## Differential expression
contr <- makeContrasts(Treatment - Control, levels=design)
qlf <- glmQLFTest(fit, contrast=contr)
is.de <- decideTests(qlf, adjust.method = "none", p.value=0.005)
summary(is.de)
tt <- topTags(qlf,n = Inf)

# Get results
results <- topTags(qlf, n = Inf)

Your insights would be greatly appreciated.

ADD REPLY • link 3 months ago mohammedtoufiq91 ▴ 10

1

Entering edit mode

I've told you that you have to run Salmon directly, and this really is important. The whole point of edgeR's DTE pipeline is that edgeR is able to estimate the read-to-transcript ambiguity from the Gibbs (or bootstrap) samples, and that the resulting divided counts can be analysed by edgeR's regular gene-level pipeline. If you don't have the Gibbs samples, then the whole point of this pipeline is lost. You can't follow the User's Guide case study but omit the critical step. I will edit the User's Guide to emphasize this point more in the case study.

It is inappropriate to input the salmon.merged.transcript_counts_length_scaled.tsv file to an edgeR analysis. It will give poor results. It will poor statistical power overall but also with potential false discoveries for transcripts that are short or which overlap other transcripts. ATPoint and I have both told you that you cannot use any of the output files from nf-core. If I understand the nf-core and tximport documentation correctly, then I expect the salmon.merged.transcript_counts_length_scaled.tsv file will be especially bad in an edgeR pipeline because, not only is read-to-transcript ambiguity not being measured, but the length scaling interfers with the mean-variance trend as well.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

Gordon Smyth I will re-run Salmon with the --numGibbsSamples=50 option. Thank you.

ADD REPLY • link 3 months ago mohammedtoufiq91 ▴ 10