How to build and map to a "pre-mRNA" reference transcriptome for single nucleus RNA-seq data?
Entering edit mode
Last seen 2.9 years ago
USA/Baltimore/Johns Hopkins


What is the best way to create a reference transcriptome for both "pre-mRNA" and mature RNA in Bioconductor? I've got some single-nucleus RNA-seq (snRNA-seq) data (full-length transcripts from smart-seq2 plate-based protocol) and I would like to have a reference transcriptome that allows reads to map to both introns and exons as the protocol will capture pre-mRNA in the nucleus and mature RNA outside the nucleus sometimes.

At first, I just blindly tried out salmon for quantification, using an ensembl reference (cDNA), and tximeta to get the annotation automatically.

# download ensemble fastq cdna file 

# create salmon index (3-5 mins)
salmon index -t Homo_sapiens.GRCh38.cdna.all.fa.gz -i ensemble.grch38_salmon-index-v0.14.1

Then, I remembered the cDNA file will have introns already cut out and this won't work. This paper on bioRxiv ( points to CellRanger's suggestion ( of "listing each gene transcript locus as an exon. Thus, these intronic reads will be included in the UMI counts for each gene and barcode."

Should I follow these steps and create a custom GTF file? Any advice would be greatly appreciated.


tximeta ensembldb single-cell snRNA-seq tximport • 2.4k views
Entering edit mode

This is perhaps a more general discussion point and not really specific to this bioconductor support forum. Anyways, I would start by just mapping the data to the complete reference genome using bwa. Following this I would evaluate how much of my reads that map to different areas of the genome and make my call on howto proceed based on that.

Entering edit mode

That's fair. I wasn't sure if this was something that the authors of tximport, tximeta, ensembldb have come across and thought I would ask on the Bioc support site. I'm happy to withdraw my question if it's not relevant enough to this community.

Entering edit mode

I think implicit in the question is the desire to do this in or with R / Bioconductor, so the question is appropriate here.

Entering edit mode
Last seen 4 hours ago
United States

If you want the sequence, you can use the transcripts() function on a TxDb or EndDb to obtain a GRanges per transcript and then Biostrings::getSeq() to get the sequence. To obtain the matching genome, I’d look up reference genomes available with AnnotationHub:

ah <- AnnotationHub()

Adding on to this, some backend details: we've been working on how tximeta will identify the transcripts in cases where additional sequence is involved. For example, in this preprint, Rob Patro's group worked on adding decoy genomic sequence during quantification to improve estimation against the standard transcripts. As far as tximeta and identifying the transcript provenance, we will end up having multiple checksums for analyses like this one, where we hash the standard reference transcripts for use with reference source APIs like refget, and then also hash every sequence that goes into the index for reproducibility of the analysis. I think this is already implemented in latest Salmon for decoy genomic sequences, but we'd probably take a similar approach for ERCC spike-ins, viral sequence, fusion genes, etc.

Entering edit mode
Last seen 12 months ago

Hi Stephanie ,

Just trying to extrapolate from your question, if I understand it correctly, you wan't to quantify snRNA-seq SMART-seq2 data; since the single-nuclei data is suppose to have high fraction of pre-mRNA molecules, you wan't the aligner / quantification tool to be aware of the full genome (or at least the intronic) region to be able to reduce the FP exonic mappings ? If my understanding is correct, then I think you can skip the separate GTF creation step all together. In the latest beta release of salmon (Stable to come out next week but feel free to try), you can give both the full Genome and transcriptome to the salmon index (just as you shared above) and the tool will take care of the FP. For more information you can checkout the details here, I am guessing within a week we are updating the preprint too.

Hope it helps, if this was actually the question ? Please feel free to correct or rephrase my understanding of the question.

Entering edit mode
Robert Castelo ★ 3.0k
Last seen 5 weeks ago
Barcelona/Universitat Pompeu Fabra

I recently had to deal with intronic reads from a bulk total RNA sequencing experiment (see Fig. 1a from this preprint) and derived an intronic annotation for that purpose using Bioconductor. Our aim was to count strictly intronic reads. If this is also your aim, then the script below may be useful for you.


txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
introns <- unlist(intronsByTranscript(txdb))
strictintrons <- setdiff(introns, exons(txdb))
ov <- findOverlaps(strictintrons, introns, type="within")
l <- split(subjectHits(ov), queryHits(ov))
txnegs <- select(txdb, keys=names(introns)[unlist(l)],
                 columns=c("TXNAME", "GENEID"), keytype="TXID")
strictintrons$tx_name <- CharacterList(relist(txnegs$TXNAME, l))
strictintrons$tx_name <- sapply(strictintrons$tx_name, paste, collapse=":")
strictintrons$gene_id <- CharacterList(relist(txnegs$GENEID, l))
strictintrons$gene_id <- sapply(strictintrons$gene_id, paste, collapse=":")
names(strictintrons) <- paste(seqnames(strictintrons),
stopifnot(all(!duplicated(names(strictintrons)))) ## QC
GRanges object with 301900 ranges and 1 metadata column:
                                           seqnames        ranges strand |
                                              <Rle>     <IRanges>  <Rle> |
                chr1:12228:12612:+             chr1   12228-12612      + |
                chr1:12722:12974:+             chr1   12722-12974      + |
                chr1:13053:13220:+             chr1   13053-13220      + |
                chr1:30040:30266:+             chr1   30040-30266      + |
                chr1:30668:30975:+             chr1   30668-30975      + |
                               ...              ...           ...    ... .
  chrUn_GL000213v1:134117:134275:- chrUn_GL000213v1 134117-134275      - |
  chrUn_GL000213v1:134391:138766:- chrUn_GL000213v1 134391-138766      - |
    chrUn_GL000219v1:54294:78841:- chrUn_GL000219v1   54294-78841      - |
    chrUn_GL000219v1:78953:79936:- chrUn_GL000219v1   78953-79936      - |
    chrUn_GL000219v1:80029:83212:- chrUn_GL000219v1   80029-83212      - |
                chr1:12228:12612:+ ENST00000456328.2:ENST00000450305.2
                chr1:12722:12974:+ ENST00000456328.2:ENST00000450305.2
                chr1:13053:13220:+ ENST00000456328.2:ENST00000450305.2
                chr1:30040:30266:+                   ENST00000473358.1
                chr1:30668:30975:+ ENST00000473358.1:ENST00000469289.1
                               ...                                 ...
  chrUn_GL000213v1:134117:134275:- ENST00000616049.4:ENST00000616157.1
  chrUn_GL000213v1:134391:138766:- ENST00000616049.4:ENST00000616157.1
    chrUn_GL000219v1:54294:78841:-                   ENST00000612919.1
    chrUn_GL000219v1:78953:79936:-                   ENST00000612919.1
    chrUn_GL000219v1:80029:83212:-                   ENST00000612919.1
                chr1:12228:12612:+ 100287102:100287102
                chr1:12722:12974:+ 100287102:100287102
                chr1:13053:13220:+ 100287102:100287102
                chr1:30040:30266:+                  NA
                chr1:30668:30975:+               NA:NA
                               ...                 ...
  chrUn_GL000213v1:134117:134275:- 100288966:100288966
  chrUn_GL000213v1:134391:138766:- 100288966:100288966
    chrUn_GL000219v1:54294:78841:-              283788
    chrUn_GL000219v1:78953:79936:-              283788
    chrUn_GL000219v1:80029:83212:-              283788
  seqinfo: 595 sequences (1 circular) from hg38 genome

## if you want to export this annotation as a GTF file
export(strictintrons, "strictintrons.gtf")

## if you would use this annotation to count reads
## using 'summarizeOverlaps()' from the pkg 'GenomicAlignments'
## one may have to tune here the arguments when
## calling 'BamFileList()' and 'summarizeOverlaps()'
bfl <- BamFileList(bamfiles)
se <- summarizeOverlaps(strictintrons, bfl)




Login before adding your answer.

Traffic: 152 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6