Question: (Closed) intronsByTranscript and fiveUTRsByTranscript - not recognized annotations
0
17 months ago by
felix.ernst0 wrote:

Hi,

I am having a bit of trouble to get the introns and exons from a transcript. Have a look at this snippet.

library(GenomicFeatures)

file <- "test.gff3"
txdb <- makeTxDbFromGFF(file)

print(transcripts(txdb))
print(intronsByTranscript(txdb))
print(fiveUTRsByTranscript(txdb))
print(threeUTRsByTranscript(txdb))

intronsByTranscript returns to many results, because it includes the 5'/3' UTR regions, whereas fiveUTRsByTranscript and threeUTRsByTranscript do not contain any data. What would be the problem using this style of getting the data?

On the bigger picture: I want to shift coordinates from mapped reads so that the intron sequences are removed and global coordinates are converted to local transcript reads. I achieved this in part using the code:

transcript <- transcripts(txdb, filter = list(tx_name = "YAL030W_mRNA"))
names(transcript) <- GenomeInfoDb::seqnames(transcript)
data <- GAlignments() # this is read in from BamFile

reads <- reads[hits$xHits,] IRanges::ranges(reads) <- IRanges::ranges(hits)​ The missing bit is to shift the reads, so that the intron gap is closed. Is there any build function for this I missed? Thanks for any help and advice, Best regards, Felix test.gff3 contains the following text: ##gff-version 3 ##source-version rtracklayer 1.38.0 ##date 2018-02-02 chrI SGD chromosome 1 230218 . - . ID=chrI;Name=chrI; chrI SGD gene 87263 87857 . + . ID=YAL030W;Name=YAL030W;gene=SNC1 chrI SGD mRNA 87263 87857 . + . ID=YAL030W_mRNA;Name=YAL030W_mRNA;Parent=YAL030W; chrI SGD five_prime_UTR 87263 87285 . + . ID=YAL030W_5UTR;Name=YAL030W_5UTR;Parent=YAL030W_mRNA; chrI SGD CDS 87286 87387 . + 0 Name=YAL030W_CDS;Parent=YAL030W_mRNA; chrI SGD intron 87388 87500 . + . Name=YAL030W_intron;Parent=YAL030W_mRNA; chrI SGD CDS 87501 87752 . + 0 Name=YAL030W_CDS;Parent=YAL030W_mRNA; chrI SGD three_prime_UTR 87753 87857 . + . ID=YAL030W_3UTR;Name=YAL030W_3UTR;Parent=YAL030W_mRNA; genomicfeatures • 303 views ADD COMMENTlink modified 17 months ago by Michael Lawrence11k • written 17 months ago by felix.ernst0 TxDb expects UTR regions to be included in the exon annotation. Reminder: exon != CDS ADD REPLYlink modified 17 months ago • written 17 months ago by felix.ernst0 Answer: intronsByTranscript and fiveUTRsByTranscript - not recognized annotations 0 17 months ago by United States Michael Lawrence11k wrote: Not sure about the TxDb UTR issue. For mapping into the exons, extract the transcript using exonsBy() instead of transcripts(). Also, you'll want to be careful about assigning the ranges (start/end) back into the reads GRanges, because those coordinates will not be valid on the original seqnames. ADD COMMENTlink written 17 months ago by Michael Lawrence11k The issue with the UTRs seems to be that the GenomicFeatures package ignores the three/five_prime_utr features and instead relies on exon features (and their difference from the CDS features). The GFF3 spec says: UTRs, splice sites and translational start and stop sites. These are implied by the combination of exon and CDS and do not need to be explicitly annotated as part of the canonical gene. In the case of annotating predicted splice or translational start/stop sites independently of a particular gene, it is suggested that they be attached directly to the genomic sequence and not to a gene or a subpart of a gene. That implies to me that UTR annotations are optional, while exon annotations are required. ADD REPLYlink written 17 months ago by Michael Lawrence11k > exonsBy(txdb) GRangesList object of length 1:$1
GRanges object with 2 ranges and 3 metadata columns:
seqnames         ranges strand |   exon_id   exon_name exon_rank
<Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
[1]     chrI [87286, 87387]      + |         1 YAL030W_CDS         1
[2]     chrI [87501, 87752]      + |         2 YAL030W_CDS         2

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

That doesn't seem to work, since the exons ranges do not include the UTR sequences. The boundaries of the mRNA are not taken into account.

Am I doing something wrong or is it a bug?

edit: removing the UTR annotations does not change the result. That would have been very weird, since the sequence types *_prime_UTR are part of the SOFA specification and are also present in probably the most widely used genome annotation for H.sapiens hg*

The problem is not the UTR ranges (those are just ignored) but rather the absence of "exon" ranges. You'll need those for this to work. The GenomicFeatures maintainers might decide to support the utr + CDS way of specifying exons, but currently explicit "exon" ranges are required.

Thanks Michael. I checked again, You are right :/ :(