Question

Which GTF file to use for creation of tx2gene files?

0

Entering edit mode

Nicholas • 0

@3611f731

Last seen 5 months ago

United States

When psuedoaligning with kallisto, I use tximport to load the outputs into R. I usually use the Ensembl GTF file to build my tx2gene object. I was wondering which GTF file is better to use:

Homo_sapiens.GRCh38.113.gtf OR Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf

I tried running tximport with tx2gene objects created from both GTF files and had a much lower number of transcripts missing from tx2gene when using the Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf file.

Most of what I have seen documented uses the Homo_sapiens.GRCh38.113.gtf file, but is it not better to use the chr_patch_hapl_scaff since it seems to include more transcripts that kallisto mapped reads to?

Code showing difference in number of transcripts:

# building tx2gene from normal gtf file 
gtf.hs.ensembl <- rtracklayer::import('Homo_sapiens.GRCh38.113.gtf') 

tx2gene.gtf <- mcols(gtf.hs.ensembl)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)

# building tx2gene from chr_patch_hapl_scaff gtf file
gtf.hs.ensembl.hapl.scaff <- rtracklayer::import('Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf') 

tx2gene.gtf.hapl.scaff <- mcols(gtf.hs.ensembl.hapl.scaff)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)


> # running tximport with normal gtf
> Txi_gene <- tximport(path, 
+                      type = "kallisto", 
+                      tx2gene = tx2gene.gtf, 
+                      txOut = FALSE, 
+                      countsFromAbundance = "lengthScaledTPM",
+                      ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
transcripts missing from tx2gene: 25650
summarizing abundance
summarizing counts
summarizing length
> 
> # running tximport with chr_patch_hapl_scaff gtf
> Txi_gene <- tximport(path, 
+                      type = "kallisto", 
+                      tx2gene = tx2gene.gtf.hapl.scaff, 
+                      txOut = FALSE, 
+                      countsFromAbundance = "lengthScaledTPM",
+                      ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
transcripts missing from tx2gene: 7129
summarizing abundance
summarizing counts
summarizing length

tximport • 750 views

ADD COMMENT • link updated 6 months ago by Michael Love 43k • written 6 months ago by Nicholas • 0

score 0 · Answer 1 · 2025-06-20

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I use the top GTF and FASTA from GENCODE. I do not include haplotype chromosomes in the reference.

ADD COMMENT • link 6 months ago Michael Love 43k