Many transcripts missing from tx2gene created from Ensembl GTF file
1
0
Entering edit mode
Nicholas • 0
@3611f731
Last seen 3 months ago
United States

I think I may know the answer to this question, but wanted to pose it anyways. I am interested in both protein coding and polyadenylated lncRNAs a set of samples. I used kallisto to generate a single index using both the cDNA and lncRNA FASTA files from Ensembl. After pseudoaligning my samples with this index, I created a tx2gene file from the GTF associated with my previously used fasta files:

gtf.hs.ensembl <- rtracklayer::import('Homo_sapiens.GRCh38.113.gtf') 

tx2gene.gtf <- mcols(gtf.hs.ensembl)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)

I then used tximport and got an output saying ~135,000 transcripts were missing from tx2gene:

> Txi_gene <- tximport(path_combined, 
+                      type = "kallisto", 
+                      tx2gene = tx2gene.gtf, 
+                      txOut = FALSE, 
+                      countsFromAbundance = "lengthScaledTPM",
+                      ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
transcripts missing from tx2gene: 135227
summarizing abundance
summarizing counts
summarizing length

Is this simply because most of the lncRNAs included in the FASTA files are not annotated with gene names? Should I summarize to the transcript level with txOut = TRUE?

kallisto tximport ensembl • 903 views
ADD COMMENT
0
Entering edit mode
Gaurav • 0
@14b5011f
Last seen 3 months ago
India
Read the replies of these posts:
https://support.bioconductor.org/p/9162008/#9162038
https://www.biostars.org/p/422750/

This code will help:
> txdb <- makeTxDbFromGFF(file="gencode.v47.annotation.gtf")
> saveDb(x=txdb, file = "gencode.v47.annotation.TxDb")
> k <- keys(txdb, keytype = "TXNAME")
> tx2gene <- select(txdb, k, "GENEID", "TXNAME")
ADD COMMENT
0
Entering edit mode

@Monkey Mart Thank you so much

ADD REPLY
0
Entering edit mode

I was under the impression that we want to summarize to gene names, not gene IDs. I didn't see a gene name key.

ADD REPLY

Login before adding your answer.

Traffic: 466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6