Question

How to build and map to a "pre-mRNA" reference transcriptome for single nucleus RNA-seq data?

2

Entering edit mode

Stephanie Hicks ▴ 80

@stephanie-hicks-7843

Last seen 5.5 years ago

USA/Baltimore/Johns Hopkins

Hi,

What is the best way to create a reference transcriptome for both "pre-mRNA" and mature RNA in Bioconductor? I've got some single-nucleus RNA-seq (snRNA-seq) data (full-length transcripts from smart-seq2 plate-based protocol) and I would like to have a reference transcriptome that allows reads to map to both introns and exons as the protocol will capture pre-mRNA in the nucleus and mature RNA outside the nucleus sometimes.

At first, I just blindly tried out salmon for quantification, using an ensembl reference (cDNA), and tximeta to get the annotation automatically.

# download ensemble fastq cdna file 
wget ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

# create salmon index (3-5 mins)
salmon index -t Homo_sapiens.GRCh38.cdna.all.fa.gz -i ensemble.grch38_salmon-index-v0.14.1

Then, I remembered the cDNA file will have introns already cut out and this won't work. This paper on bioRxiv (https://www.biorxiv.org/content/biorxiv/early/2019/09/12/761429.full.pdf) points to CellRanger's suggestion (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references#premrna) of "listing each gene transcript locus as an exon. Thus, these intronic reads will be included in the UMI counts for each gene and barcode."

Should I follow these steps and create a custom GTF file? Any advice would be greatly appreciated.

Stephanie

tximeta ensembldb single-cell snRNA-seq tximport • 5.1k views

ADD COMMENT • link updated 6.3 years ago by Robert Castelo ★ 3.4k • written 6.3 years ago by Stephanie Hicks ▴ 80

0

Entering edit mode

This is perhaps a more general discussion point and not really specific to this bioconductor support forum. Anyways, I would start by just mapping the data to the complete reference genome using bwa. Following this I would evaluate how much of my reads that map to different areas of the genome and make my call on howto proceed based on that.

ADD REPLY • link 6.3 years ago thokall ▴ 160

1

Entering edit mode

That's fair. I wasn't sure if this was something that the authors of tximport, tximeta, ensembldb have come across and thought I would ask on the Bioc support site. I'm happy to withdraw my question if it's not relevant enough to this community.

ADD REPLY • link 6.3 years ago Stephanie Hicks ▴ 80

2

Entering edit mode

I think implicit in the question is the desire to do this in or with R / Bioconductor, so the question is appropriate here.

ADD REPLY • link 6.3 years ago Martin Morgan 25k

score 0 · Answer 1 · 2019-10-11

If you want the sequence, you can use the transcripts() function on a TxDb or EndDb to obtain a GRanges per transcript and then Biostrings::getSeq() to get the sequence. To obtain the matching genome, I’d look up reference genomes available with AnnotationHub:

ah <- AnnotationHub()
display(ah)

Adding on to this, some backend details: we've been working on how tximeta will identify the transcripts in cases where additional sequence is involved. For example, in this preprint, Rob Patro's group worked on adding decoy genomic sequence during quantification to improve estimation against the standard transcripts. As far as tximeta and identifying the transcript provenance, we will end up having multiple checksums for analyses like this one, where we hash the standard reference transcripts for use with reference source APIs like refget, and then also hash every sequence that goes into the index for reproducibility of the analysis. I think this is already implemented in latest Salmon for decoy genomic sequences, but we'd probably take a similar approach for ERCC spike-ins, viral sequence, fusion genes, etc.

score 0 · Answer 2 · 2019-10-11

Hi Stephanie ,

Just trying to extrapolate from your question, if I understand it correctly, you wan't to quantify snRNA-seq SMART-seq2 data; since the single-nuclei data is suppose to have high fraction of pre-mRNA molecules, you wan't the aligner / quantification tool to be aware of the full genome (or at least the intronic) region to be able to reduce the FP exonic mappings ? If my understanding is correct, then I think you can skip the separate GTF creation step all together. In the latest beta release of salmon (Stable to come out next week but feel free to try), you can give both the full Genome and transcriptome to the salmon index (just as you shared above) and the tool will take care of the FP. For more information you can checkout the details here, I am guessing within a week we are updating the preprint too.

Hope it helps, if this was actually the question ? Please feel free to correct or rephrase my understanding of the question.

score 0 · Answer 3 · 2019-10-11

I recently had to deal with intronic reads from a bulk total RNA sequencing experiment (see Fig. 1a from this preprint) and derived an intronic annotation for that purpose using Bioconductor. Our aim was to count strictly intronic reads. If this is also your aim, then the script below may be useful for you.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(rtracklayer)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
introns <- unlist(intronsByTranscript(txdb))
strictintrons <- setdiff(introns, exons(txdb))
ov <- findOverlaps(strictintrons, introns, type="within")
l <- split(subjectHits(ov), queryHits(ov))
txnegs <- select(txdb, keys=names(introns)[unlist(l)],
                 columns=c("TXNAME", "GENEID"), keytype="TXID")
strictintrons$tx_name <- CharacterList(relist(txnegs$TXNAME, l))
strictintrons$tx_name <- sapply(strictintrons$tx_name, paste, collapse=":")
strictintrons$gene_id <- CharacterList(relist(txnegs$GENEID, l))
strictintrons$gene_id <- sapply(strictintrons$gene_id, paste, collapse=":")
names(strictintrons) <- paste(seqnames(strictintrons),
                              start(strictintrons),
                              end(strictintrons),
                              strand(strictintrons),
                              sep=":")
stopifnot(all(!duplicated(names(strictintrons)))) ## QC
strictintrons
GRanges object with 301900 ranges and 1 metadata column:
                                           seqnames        ranges strand |
                                              <Rle>     <IRanges>  <Rle> |
                chr1:12228:12612:+             chr1   12228-12612      + |
                chr1:12722:12974:+             chr1   12722-12974      + |
                chr1:13053:13220:+             chr1   13053-13220      + |
                chr1:30040:30266:+             chr1   30040-30266      + |
                chr1:30668:30975:+             chr1   30668-30975      + |
                               ...              ...           ...    ... .
  chrUn_GL000213v1:134117:134275:- chrUn_GL000213v1 134117-134275      - |
  chrUn_GL000213v1:134391:138766:- chrUn_GL000213v1 134391-138766      - |
    chrUn_GL000219v1:54294:78841:- chrUn_GL000219v1   54294-78841      - |
    chrUn_GL000219v1:78953:79936:- chrUn_GL000219v1   78953-79936      - |
    chrUn_GL000219v1:80029:83212:- chrUn_GL000219v1   80029-83212      - |
                                                               tx_name
                                                           <character>
                chr1:12228:12612:+ ENST00000456328.2:ENST00000450305.2
                chr1:12722:12974:+ ENST00000456328.2:ENST00000450305.2
                chr1:13053:13220:+ ENST00000456328.2:ENST00000450305.2
                chr1:30040:30266:+                   ENST00000473358.1
                chr1:30668:30975:+ ENST00000473358.1:ENST00000469289.1
                               ...                                 ...
  chrUn_GL000213v1:134117:134275:- ENST00000616049.4:ENST00000616157.1
  chrUn_GL000213v1:134391:138766:- ENST00000616049.4:ENST00000616157.1
    chrUn_GL000219v1:54294:78841:-                   ENST00000612919.1
    chrUn_GL000219v1:78953:79936:-                   ENST00000612919.1
    chrUn_GL000219v1:80029:83212:-                   ENST00000612919.1
                                               gene_id
                                           <character>
                chr1:12228:12612:+ 100287102:100287102
                chr1:12722:12974:+ 100287102:100287102
                chr1:13053:13220:+ 100287102:100287102
                chr1:30040:30266:+                  NA
                chr1:30668:30975:+               NA:NA
                               ...                 ...
  chrUn_GL000213v1:134117:134275:- 100288966:100288966
  chrUn_GL000213v1:134391:138766:- 100288966:100288966
    chrUn_GL000219v1:54294:78841:-              283788
    chrUn_GL000219v1:78953:79936:-              283788
    chrUn_GL000219v1:80029:83212:-              283788
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

## if you want to export this annotation as a GTF file
export(strictintrons, "strictintrons.gtf")

## if you would use this annotation to count reads
## using 'summarizeOverlaps()' from the pkg 'GenomicAlignments'
## one may have to tune here the arguments when
## calling 'BamFileList()' and 'summarizeOverlaps()'
bfl <- BamFileList(bamfiles)
se <- summarizeOverlaps(strictintrons, bfl)

cheers,

robert.