Problem converting transcript to genomic coordinates with Annotation Hub
Entering edit mode
nattzy94 ▴ 20
Last seen 16 hours ago


I am having a problem with AnnotationHub when trying to convert transcript coordinates to genome coordinates.

Some information first on the context of where the data comes from: I started with a GTF file assembled from RNA-Seq data using the Ensembl reference transcriptome (v. 99) as reference. I then translated these transcripts in 3 frames to generate a theoretical translatome. This protein database was then used to search mass spec data to identify proteins in my samples. I next mapped the identified proteins back to their transcript IDs and sequences and was able to get transcript coordinates for each protein sequence. Now, I would like to convert the transcript positions to genomic positions.

I used the Annotation Hub package with the following code:

    ah = AnnotationHub()
    ahDb <- query(ah, pattern = c("Homo sapiens", "EnsDb", 99))
    ahEdb <- ahDb[[1]]

    df_protein_coding['genome.start'] <- NA
    df_protein_coding['genome.end'] <- NA
    row.names(df_protein_coding) <- NULL

    for (i in 1:nrow(df_protein_coding)){

      start = df_protein_coding['tx_start'][[1]][i]
      end = df_protein_coding['tx_end'][[1]][i]
      chr = df_protein_coding[''][[1]][i]

      rng_tx <- IRanges(start = start, end = end, names = tx_id)
      edbx <- filter(ahEdb, filter = ~ seq_name == chr)

      rng_gnm <- transcriptToGenome(rng_tx, edbx)
      a <- GRangesList(rng_gnm)
      df_protein_coding['genome.start'][[1]][i] = start(a)[[1]]
      df_protein_coding['genome.end'][[1]][i] = end(a)[[1]]

However, I run into this error:

Error in df_protein_coding["genome.start"][[1]][i] <- start(a)[[1]] : 
  replacement has length zero
In addition: There were 24 warnings (use warnings() to see them)

The code seems to have worked for the first few entires. Here is the truncated version for the entries that worked:

header 1 2 3
transcript id ENST00000295379 ENST00000582693 ENST00000582693
transcript biotype protein_coding protein_coding protein_coding
gene id ENSG00000163217 ENSG00000265491 ENSG00000265491
gene symbol BMP10 RNF115 RNF115
Chromosome/scaffold name 2 1 1
protein match ENST00000295379_(+1)_26 ENST00000582693_(+3)_2 ENST00000582693_(+3)_2
tx_start 3019 51 51
tx_end 3183 344 344
genome.start 68863762 145823772 145823772
genome.end 68863926 145824045 145824045

The table has been transposed for easier formatting.

The entry that AnnotationHub has an issue with is a sequence from position 8698 to 8760 of ENST00000515408. Not sure where I have gone wrong here as it seems that the transcript coordinates are out of bounds from the one that AnnotationHub has. Does anyone have any suggestions on how to solve this? Would appreciate any help with this issue.

AnnotationHub • 584 views
Entering edit mode

It's hard to say what happened by just looking at the error message. Could you please also provide the last values of the variables start, end, tx_id and chr that caused the error?

As a side note, I'm not sure you need the edbx <- filter... call, since you're anyway mapping just the coordinates from a single transcript to genomic coordinates.

Entering edit mode

Thanks for the reply Johannes. The last value that caused the error has the following attributes:

transcript start: 8698

transcript end: 8760

tx_id: ENST00000515408

Chr: 5

gene id: ENSG00000172795

gene symbol: DCP2

Let me know if you need more information.

As a side note, I'm not sure you need the edbx <- filter... call, since you're anyway mapping just the coordinates from a single transcript to genomic coordinates.

Noted, I will remove this code.

Entering edit mode
Johannes Rainer ★ 1.9k
Last seen 1 day ago

The length of transcript ENST00000515408 is only 8029 nucleotides,

transcriptLengths(filter(ahEdb, filter = ~ tx_id == "ENST00000515408"))
                          tx_id         tx_name         gene_id nexon tx_len
ENST00000515408 ENST00000515408 ENST00000515408 ENSG00000172795    10   8029

but you try to map nucleotides 8698 to 8760 to the genome. And this results in an empty GRanges:

rng_tx <- IRanges(start = 8698, end = 8760, names = "ENST00000515408")
transcriptToGenome(rng_tx, ahEdb)
GRangesList object of length 1:
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  seqinfo: no sequences

Warning messages:
1: The within transcript/CDS coordinates are outside the region defined by the provided exons 
2: In transcriptToGenome(rng_tx, ahEdb) :
  1 transcript(s) could either not be found in the database or the specified range is outside the transcript's sequence

I would suggest you to re-check how you calculate the within-transcript coordinates. Also, always ensure that you're using the same Ensembl version for all your steps - could be that the 5' and 3' UTRs for some transcripts change over Ensembl releases (although I guess that's happening much less frequent than some years ago).

best, jo

Entering edit mode

Thanks Johannes, the transcript is indeed out of bounds. I think my sequence matches to the 3' UTR of the transcript. Is there any way to match this using AnnotationHub?

Entering edit mode

Actually, the sequence is outside the 3' UTR - what exactly do you mean with match this using AnnotationHub?

Entering edit mode

I mean does AnnotationHub support mapping of 5' or 3' UTR regions from their mRNA transcript coordinates to genomic coordinates? I understand these are non-coding regions so it might not be possible.

Entering edit mode

ensembldb allows you to map any coordinate within a transcript (including the 5' and 3' UTR) to genomic coordinates, it doesn't have to be within the coding region. It's a different thing if you map from transcript coordinates to protein coordinates - then you can obviously only map coordinates within the coding region, but for mappings of transcripts to genome it can be any position (as long as it is within the transcript).

Note also that AnnotationHub does not do the mapping, it's the ensembldb package that provides this functionality using transcript and protein annotations from EnsDb databases that you can get from AnnotationHub.

Entering edit mode

Hi Johannes,

Sorry to revive an old thread. I am facing the above problem as I am using a custom GTF assembled from in-house RNA-Seq data. In this GTF, the transcript, ENST00000515408, spans 10,091 bases whereas the one on ensemble spans 8,029 bases.

Is it possible to make a transcript database object using my own gtf instead of the ensembl gtf so that the transcript length is corrected for my use?

Edit: I managed to find a way using makeTxDbFromGFF to generate a txdb from my custom gtf and then use mapFromTranscripts to convert the coordinates. Apologies and thanks for your help!


Login before adding your answer.

Traffic: 233 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6