Question

Transcript expression analysis in edgeR from gtf file obtained from merging long read rna-seq and standard gencode gtf

0

Entering edit mode

mohammedtoufiq91 ▴ 10

@mohammedtoufiq91-17679

Last seen 5 weeks ago

United States

Hi,

I performed isoform expression analysis on our Illumina short-read RNA-Seq data (150 bp paired-end) using the edgeR package, following the approach outlined in section 4.6 of the edgeR User's Guide ("Differential transcript expression of human lung adenocarcinoma cell lines").

Transcript quantification was done using Salmon with --numGibbsSamples=50 to generate Gibbs replicates. I used the catchSalmon() function to import both the quant.sf files and the inferential replicates. edgeR utilizes these Gibbs samples to estimate read-to-transcript ambiguity, allowing the resulting adjusted counts to be analyzed using its standard gene-level pipeline.

For transcript annotation, I used a merged GTF file that combines isoforms identified from PacBio long-read IsoSeq data with the GENCODE v44 annotation. This merged GTF was used to build the hg38 STAR index, which was then used for alignment, quantification and finally transcript analysis in edgeR.

As we know that edgeR utilizes these Gibbs samples to estimate read-to-transcript ambiguity and measure the assignment uncertainty of each transcript count. Mostly likely, it seems like that use case in the edgeR user guide was performed using standard gtf file [without long read data]. What if the isoforms or transcripts constructed by merging, and moreover these isoforms are similar with certian base pair changes either at the 5' or 3' UTR. How does edgeR perform here in read-to-transcript ambiguity? do you have use cases or come across any use cases where gtf file is created with long read and known standard gtf file?

What is your opinion on leveraging this edgeR workflow for the gft file [merged gtf file from long read rna-seq data + standard gencode annotation]?

Thank you,

Toufiq

RNASeq edgeR LongRead Transcript • 760 views

ADD COMMENT • link 8 weeks ago mohammedtoufiq91 ▴ 10

0

Entering edit mode

I assume that you have run Salmon on the BAM files from STAR. The transcripts quantified by Salmon and the RTA dispersions estimated by edgeR will depend on the transcript FASTA file. The GTF file used by STAR will make relatively little difference. Did you make a merged FASTA file as well, or only a merged GTF?

ADD REPLY • link 8 weeks ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth thank you for the response.

I created the STAR index using the standard hg38.fa file and merged gtf file.

Yes, I run Salmon on the BAM files from STAR. I made only the merged gtf file but did not make merged fasta file for creating the STAR index.

Should I re-make STAR index with the merged transcript FASTA file and merged gtf file and then re-run the STAR to get my bam files > salmon quantification > Input to edgeR

ADD REPLY • link 8 weeks ago mohammedtoufiq91 ▴ 10

0

Entering edit mode

No, you haven't quite answered my question yet. I am not asking about the genome FASTA file (hg38.fa) that is input to STAR but rather about the transcript FASTA (gencode.v48.transcripts.fa.gz) that is input to Salmon. It is the Salmon annotation that is important, not so much how STAR is run. You don't actually need STAR at all, it is optional, because Salmon can run well enough on the raw FASTQ files.

I'm also not sure what you mean by the standard hg38.fa file, because there is no file by that name on Gencode. Do you mean GRCh38.p14.genome.fa.gz?

ADD REPLY • link 8 weeks ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth thanks.

Sorry for the confusion, Yes, that's correct, the standard hg38.fa file I used is the GENCODE GRCh38.p14 genome FASTA (genome.fa.gz). I renamed it after downloading it from the GENCODE website for consistency in my pipeline.

I utilized the nf-core/rnasplice pipeline (v1.0.4), which includes Salmon quantification as part of its workflow: https://nf-co.re/rnasplice/1.0.4

This pipeline offers two options for quantification:

Direct quantification using Salmon on FASTQ files.
- Alignment-based quantification, where reads are first aligned using STAR to generate BAM files, followed by Salmon quantification on those BAMs.

I chose the alignment-based approach because I needed BAM files for additional downstream analyses.

Importantly, the pipeline automatically generates transcript fasta and Salmon index files from merged gtf and genome fasta files by default. This merged FASTA and merged gtf is then used for Salmon quantification on the aligned BAM files.

In summary, yes, merged transcript FASTA file was used for salmon quantification.

ADD REPLY • link 8 weeks ago mohammedtoufiq91 ▴ 10

score 1 · Answer 1 · 2025-08-11

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 27 minutes ago

WEHI, Melbourne, Australia

I share your concern that merging annotations might cause excess ambiguity. I do not have direct experience here, but I suspect that merging de novo transcripts from IsoSeq with known transcripts from Gencode might produce a lot of duplicates or near duplicates, and that will have a deleterious effect on quantification and on the downstream edgeR analysis. Expression will be incorrectly spread over duplicates and RTA dispersion will be high. I believe there are tools to reduce the duplicates, which you might investigate.

The issues here are really to do with IsoSeq and Salmon, so you might seek guidance from the authors of those packages, or perhaps ask for help on Biostars. Nothing can be done to solve the problem at the edgeR analysis stage -- edgeR just uses what it gets from Salmon.

Having said that, edgeR can still analyse the merged transcripts ok, it just might not be as efficient as including only strictly truly novel transcripts from IsoSeq.

ADD COMMENT • link 8 weeks ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth thank you very much for the feedback. I will try to seek guidance from the author's of those packages as well. Anyways, this is how the preliminary data looks (attached image).

ADD REPLY • link 8 weeks ago mohammedtoufiq91 ▴ 10

0

Entering edit mode

You should compute divided counts before estimating edgeR dispersions. These plots look like they are using raw counts.

You could also try running Salmon on just the Gencode annotation and compare the distribution of RTA dispersions from catchSalmon with and without the IsoSeq transcripts.

ADD REPLY • link 8 weeks ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth thanks. The above plots are obtained by edgeR steps provided below:

# Load required libraries
library(edgeR)
library(readr)
library(rtracklayer)


## Set directory from "star_salmon" 
setwd("/projects/RNAsplice_23pal003004_ss_v1/star_salmon/")
quant <- dirname(list.files(".", "quant.sf", recursive = TRUE, full.names = TRUE))
print(quant)
catch <- catchSalmon(paths = quant)
scaled.counts <- catch$counts/catch$annotation$Overdispersion
colnames(scaled.counts)

gtf <- rtracklayer::import('/projects/RNAsplice_23pal003004_ss_v1/LUAD_v3.gencode_v44.merged.gtf')
gtf <- as.data.frame(gtf)

# Create sample information
Samples_metadata <- data.frame(
  Samples = c("A1_progenitors", "B1_progenitors", "A2_progenitors", "B2_progenitors", "A3_progenitors", "B3_progenitors"),
  Donor = c("A1", "B1", "A2", "B2", "A3", "B3",),
  Category = c("A", "B", "A", "B", "A", "B"), 
  stringsAsFactors = FALSE
)

colnames(Samples_metadata)
## Make rownames
rownames(Samples_metadata) <- Samples_metadata$Samples
rownames(Samples_metadata)

## Map samples
all(colnames(scaled.counts) == rownames(Samples_metadata))


# Create DGEList object
y <- DGEList(counts = scaled.counts,
             samples = Samples_metadata,
             group = Samples_metadata$Category,
             genes = catch$annotation)
dim(y)

## Add gene annotations
gtf <- gtf[match(rownames(y), gtf$transcript_id),]
colnames(gtf )
y$genes <- cbind(y$genes,gtf[,c(11:13)])
head(y$genes)
## Map
all(rownames(y) == y$genes$transcript_id)
identical(rownames(y), y$genes$transcript_id)

## Build design matrix
Category <- factor(Samples_metadata$Category, levels = c("A", "B"))
design <- model.matrix(~0+Category)
colnames(design) <- levels(Category)
all(rownames(Samples_metadata) == colnames(y))

# Filter low expression transcripts
keep <- filterByExpr(y, design)
table(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
dim(y)

y <- normLibSizes(y, method = "TMM")
y$samples

## Plot PCA
plotMDS(y, dim.plot = c(1,2))
plotMDS(y, dim.plot = c(2,3))
plotMDS(y, dim.plot = c(3,1))

##  Dispersion estimation
## We estimate NB dispersions using the estimateDisp function and visualize the estimated values with plotBCV.
y <- estimateDisp(y, design, robust=TRUE)
y$common.dispersion
0.3066385
plotBCV(y)

fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit)

## Differential expression
## Differentially expressed transcripts are tested between groups using
## the QL F-test. We use the makeContrasts function to create the relevant contrast.
contr <- makeContrasts(A - B, levels=design)
qlf <- glmQLFTest(fit, contrast=contr)

ADD REPLY • link 8 weeks ago mohammedtoufiq91 ▴ 10

1

Entering edit mode

OK, I see you have used divided counts. There is a subset of transcripts with very high BCVs. however, which is curious.

ADD REPLY • link 8 weeks ago Gordon Smyth 53k