Hi,
I performed isoform expression analysis on our Illumina short-read RNA-Seq data (150 bp paired-end) using the edgeR package, following the approach outlined in section 4.6 of the edgeR
User's Guide ("Differential transcript expression of human lung adenocarcinoma cell lines").
Transcript quantification was done using Salmon with --numGibbsSamples=50
to generate Gibbs replicates. I used the catchSalmon()
function to import both the quant.sf files and the inferential replicates. edgeR utilizes these Gibbs samples to estimate read-to-transcript ambiguity, allowing the resulting adjusted counts to be analyzed using its standard gene-level pipeline.
For transcript annotation, I used a merged GTF file that combines isoforms identified from PacBio long-read IsoSeq data with the GENCODE v44 annotation. This merged GTF was used to build the hg38 STAR index, which was then used for alignment, quantification and finally transcript analysis in edgeR.
As we know that edgeR utilizes these Gibbs samples to estimate read-to-transcript ambiguity and measure the assignment uncertainty of each transcript count. Mostly likely, it seems like that use case in the edgeR user guide was performed using standard gtf file [without long read data]. What if the isoforms or transcripts constructed by merging, and moreover these isoforms are similar with certian base pair changes either at the 5' or 3' UTR. How does edgeR perform here in read-to-transcript ambiguity? do you have use cases or come across any use cases where gtf file is created with long read and known standard gtf file?
What is your opinion on leveraging this edgeR workflow for the gft file [merged gtf file from long read rna-seq data + standard gencode annotation]?
Thank you,
Toufiq
I assume that you have run Salmon on the BAM files from STAR. The transcripts quantified by Salmon and the RTA dispersions estimated by edgeR will depend on the transcript FASTA file. The GTF file used by STAR will make relatively little difference. Did you make a merged FASTA file as well, or only a merged GTF?
Gordon Smyth thank you for the response.
I created the STAR index using the standard
hg38.fa
file and merged gtf file.Yes, I run Salmon on the BAM files from STAR. I made only the merged gtf file but did not make merged fasta file for creating the STAR index.
Should I re-make STAR index with the merged transcript FASTA file and merged gtf file and then re-run the STAR to get my bam files > salmon quantification > Input to edgeR
No, you haven't quite answered my question yet. I am not asking about the genome FASTA file (hg38.fa) that is input to STAR but rather about the transcript FASTA (gencode.v48.transcripts.fa.gz) that is input to Salmon. It is the Salmon annotation that is important, not so much how STAR is run. You don't actually need STAR at all, it is optional, because Salmon can run well enough on the raw FASTQ files.
I'm also not sure what you mean by the standard hg38.fa file, because there is no file by that name on Gencode. Do you mean GRCh38.p14.genome.fa.gz?
Gordon Smyth thanks.
Sorry for the confusion, Yes, that's correct, the standard hg38.fa file I used is the GENCODE GRCh38.p14 genome FASTA (genome.fa.gz). I renamed it after downloading it from the GENCODE website for consistency in my pipeline.
I utilized the nf-core/rnasplice pipeline (v1.0.4), which includes Salmon quantification as part of its workflow: https://nf-co.re/rnasplice/1.0.4
This pipeline offers two options for quantification:
I chose the alignment-based approach because I needed BAM files for additional downstream analyses.
Importantly, the pipeline automatically generates transcript fasta and Salmon index files from merged gtf and genome fasta files by default. This merged FASTA and merged gtf is then used for Salmon quantification on the aligned BAM files.
In summary, yes, merged transcript FASTA file was used for salmon quantification.