Question

Many transcripts missing from tx2gene created from Ensembl GTF file

0

Entering edit mode

Nicholas • 0

@3611f731

Last seen 6 months ago

United States

I think I may know the answer to this question, but wanted to pose it anyways. I am interested in both protein coding and polyadenylated lncRNAs a set of samples. I used kallisto to generate a single index using both the cDNA and lncRNA FASTA files from Ensembl. After pseudoaligning my samples with this index, I created a tx2gene file from the GTF associated with my previously used fasta files:

gtf.hs.ensembl <- rtracklayer::import('Homo_sapiens.GRCh38.113.gtf') 

tx2gene.gtf <- mcols(gtf.hs.ensembl)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)

I then used tximport and got an output saying ~135,000 transcripts were missing from tx2gene:

> Txi_gene <- tximport(path_combined, 
+                      type = "kallisto", 
+                      tx2gene = tx2gene.gtf, 
+                      txOut = FALSE, 
+                      countsFromAbundance = "lengthScaledTPM",
+                      ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
transcripts missing from tx2gene: 135227
summarizing abundance
summarizing counts
summarizing length

Is this simply because most of the lncRNAs included in the FASTA files are not annotated with gene names? Should I summarize to the transcript level with txOut = TRUE?

kallisto tximport ensembl • 1.1k views

ADD COMMENT • link updated 6 months ago by Abbott • 0 • written 7 months ago by Nicholas • 0

score 0 · Answer 1 · 2025-06-20

0

Entering edit mode

Gaurav • 0

@14b5011f

Last seen 7 months ago

India

Read the replies of these posts:
https://support.bioconductor.org/p/9162008/#9162038
https://www.biostars.org/p/422750/

This code will help:
> txdb <- makeTxDbFromGFF(file="gencode.v47.annotation.gtf")
> saveDb(x=txdb, file = "gencode.v47.annotation.TxDb")
> k <- keys(txdb, keytype = "TXNAME")
> tx2gene <- select(txdb, k, "GENEID", "TXNAME")

ADD COMMENT • link 7 months ago Gaurav • 0

0

Entering edit mode

@Monkey Mart Thank you so much

ADD REPLY • link 7 months ago • updated 6 months ago Abbott • 0

0

Entering edit mode

I was under the impression that we want to summarize to gene names, not gene IDs. I didn't see a gene name key.

ADD REPLY • link 7 months ago Nicholas • 0