Question

Problem converting transcript to genomic coordinates with Annotation Hub

0

Entering edit mode

nattzy94 ▴ 20

@nattzy94-23466

Last seen 17 months ago

Singapore

Hi,

I am having a problem with AnnotationHub when trying to convert transcript coordinates to genome coordinates.

Some information first on the context of where the data comes from: I started with a GTF file assembled from RNA-Seq data using the Ensembl reference transcriptome (v. 99) as reference. I then translated these transcripts in 3 frames to generate a theoretical translatome. This protein database was then used to search mass spec data to identify proteins in my samples. I next mapped the identified proteins back to their transcript IDs and sequences and was able to get transcript coordinates for each protein sequence. Now, I would like to convert the transcript positions to genomic positions.

I used the Annotation Hub package with the following code:

    ah = AnnotationHub()
    ahDb <- query(ah, pattern = c("Homo sapiens", "EnsDb", 99))
    ahEdb <- ahDb[[1]]

    df_protein_coding['genome.start'] <- NA
    df_protein_coding['genome.end'] <- NA
    row.names(df_protein_coding) <- NULL

    for (i in 1:nrow(df_protein_coding)){

      start = df_protein_coding['tx_start'][[1]][i]
      end = df_protein_coding['tx_end'][[1]][i]
      tx_id=df_protein_coding['transcript.id'][[1]][i]
      chr = df_protein_coding['Chromosome.scaffold.name'][[1]][i]

      rng_tx <- IRanges(start = start, end = end, names = tx_id)
      edbx <- filter(ahEdb, filter = ~ seq_name == chr)

      rng_gnm <- transcriptToGenome(rng_tx, edbx)
      a <- GRangesList(rng_gnm)
      df_protein_coding['genome.start'][[1]][i] = start(a)[[1]]
      df_protein_coding['genome.end'][[1]][i] = end(a)[[1]]
    }

However, I run into this error:

Error in df_protein_coding["genome.start"][[1]][i] <- start(a)[[1]] : 
  replacement has length zero
In addition: There were 24 warnings (use warnings() to see them)

The code seems to have worked for the first few entires. Here is the truncated version for the entries that worked:

header	1	2	3
peptide	VINSNCTR	NGGGFGGR	NGGGFGGR
transcript id	ENST00000295379	ENST00000582693	ENST00000582693
transcript biotype	protein_coding	protein_coding	protein_coding
gene id	ENSG00000163217	ENSG00000265491	ENSG00000265491
gene symbol	BMP10	RNF115	RNF115
Chromosome/scaffold name	2	1	1
protein match	ENST00000295379_(+1)_26	ENST00000582693_(+3)_2	ENST00000582693_(+3)_2
tx_start	3019	51	51
tx_end	3183	344	344
genome.start	68863762	145823772	145823772
genome.end	68863926	145824045	145824045

The table has been transposed for easier formatting.

The entry that AnnotationHub has an issue with is a sequence from position 8698 to 8760 of ENST00000515408. Not sure where I have gone wrong here as it seems that the transcript coordinates are out of bounds from the one that AnnotationHub has. Does anyone have any suggestions on how to solve this? Would appreciate any help with this issue.

AnnotationHub • 1.8k views

ADD COMMENT • link 2.4 years ago • updated 23 months ago nattzy94 ▴ 20

0

Entering edit mode

It's hard to say what happened by just looking at the error message. Could you please also provide the last values of the variables start, end, tx_id and chr that caused the error?

As a side note, I'm not sure you need the edbx <- filter... call, since you're anyway mapping just the coordinates from a single transcript to genomic coordinates.

ADD REPLY • link 2.4 years ago Johannes Rainer ★ 2.0k

1

Entering edit mode

Thanks for the reply Johannes. The last value that caused the error has the following attributes:

transcript start: 8698

transcript end: 8760

tx_id: ENST00000515408

Chr: 5

gene id: ENSG00000172795

gene symbol: DCP2

Let me know if you need more information.

As a side note, I'm not sure you need the edbx <- filter... call, since you're anyway mapping just the coordinates from a single transcript to genomic coordinates.

Noted, I will remove this code.

ADD REPLY • link 2.4 years ago nattzy94 ▴ 20

score 2 · Accepted Answer · 2021-12-09

2

Entering edit mode

Johannes Rainer ★ 2.0k

@johannes-rainer-6987

Last seen 20 days ago

Italy

The length of transcript ENST00000515408 is only 8029 nucleotides,

transcriptLengths(filter(ahEdb, filter = ~ tx_id == "ENST00000515408"))
                          tx_id         tx_name         gene_id nexon tx_len
ENST00000515408 ENST00000515408 ENST00000515408 ENSG00000172795    10   8029

but you try to map nucleotides 8698 to 8760 to the genome. And this results in an empty GRanges:

rng_tx <- IRanges(start = 8698, end = 8760, names = "ENST00000515408")
transcriptToGenome(rng_tx, ahEdb)
GRangesList object of length 1:
$ENST00000515408
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

Warning messages:
1: The within transcript/CDS coordinates are outside the region defined by the provided exons 
2: In transcriptToGenome(rng_tx, ahEdb) :
  1 transcript(s) could either not be found in the database or the specified range is outside the transcript's sequence

I would suggest you to re-check how you calculate the within-transcript coordinates. Also, always ensure that you're using the same Ensembl version for all your steps - could be that the 5' and 3' UTRs for some transcripts change over Ensembl releases (although I guess that's happening much less frequent than some years ago).

best, jo

ADD COMMENT • link 2.4 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

Thanks Johannes, the transcript is indeed out of bounds. I think my sequence matches to the 3' UTR of the transcript. Is there any way to match this using AnnotationHub?

ADD REPLY • link 2.4 years ago nattzy94 ▴ 20

0

Entering edit mode

Actually, the sequence is outside the 3' UTR - what exactly do you mean with match this using AnnotationHub?

ADD REPLY • link 2.4 years ago Johannes Rainer ★ 2.0k

0

Entering edit mode

I mean does AnnotationHub support mapping of 5' or 3' UTR regions from their mRNA transcript coordinates to genomic coordinates? I understand these are non-coding regions so it might not be possible.

ADD REPLY • link 2.4 years ago nattzy94 ▴ 20

0

Entering edit mode

ensembldb allows you to map any coordinate within a transcript (including the 5' and 3' UTR) to genomic coordinates, it doesn't have to be within the coding region. It's a different thing if you map from transcript coordinates to protein coordinates - then you can obviously only map coordinates within the coding region, but for mappings of transcripts to genome it can be any position (as long as it is within the transcript).

Note also that AnnotationHub does not do the mapping, it's the ensembldb package that provides this functionality using transcript and protein annotations from EnsDb databases that you can get from AnnotationHub.

ADD REPLY • link 2.4 years ago Johannes Rainer ★ 2.0k

1

Entering edit mode

Hi Johannes,

Sorry to revive an old thread. I am facing the above problem as I am using a custom GTF assembled from in-house RNA-Seq data. In this GTF, the transcript, ENST00000515408, spans 10,091 bases whereas the one on ensemble spans 8,029 bases.

Is it possible to make a transcript database object using my own gtf instead of the ensembl gtf so that the transcript length is corrected for my use?

Edit: I managed to find a way using makeTxDbFromGFF to generate a txdb from my custom gtf and then use mapFromTranscripts to convert the coordinates. Apologies and thanks for your help!

ADD REPLY • link 23 months ago nattzy94 ▴ 20