Which GTF file to use for creation of tx2gene files?
1
0
Entering edit mode
Nicholas • 0
@3611f731
Last seen 12 weeks ago
United States

When psuedoaligning with kallisto, I use tximport to load the outputs into R. I usually use the Ensembl GTF file to build my tx2gene object. I was wondering which GTF file is better to use:

Homo_sapiens.GRCh38.113.gtf OR Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf

I tried running tximport with tx2gene objects created from both GTF files and had a much lower number of transcripts missing from tx2gene when using the Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf file.

Most of what I have seen documented uses the Homo_sapiens.GRCh38.113.gtf file, but is it not better to use the chr_patch_hapl_scaff since it seems to include more transcripts that kallisto mapped reads to?

Code showing difference in number of transcripts:

# building tx2gene from normal gtf file 
gtf.hs.ensembl <- rtracklayer::import('Homo_sapiens.GRCh38.113.gtf') 

tx2gene.gtf <- mcols(gtf.hs.ensembl)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)

# building tx2gene from chr_patch_hapl_scaff gtf file
gtf.hs.ensembl.hapl.scaff <- rtracklayer::import('Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf') 

tx2gene.gtf.hapl.scaff <- mcols(gtf.hs.ensembl.hapl.scaff)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)


> # running tximport with normal gtf
> Txi_gene <- tximport(path, 
+                      type = "kallisto", 
+                      tx2gene = tx2gene.gtf, 
+                      txOut = FALSE, 
+                      countsFromAbundance = "lengthScaledTPM",
+                      ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
transcripts missing from tx2gene: 25650
summarizing abundance
summarizing counts
summarizing length
> 
> # running tximport with chr_patch_hapl_scaff gtf
> Txi_gene <- tximport(path, 
+                      type = "kallisto", 
+                      tx2gene = tx2gene.gtf.hapl.scaff, 
+                      txOut = FALSE, 
+                      countsFromAbundance = "lengthScaledTPM",
+                      ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 
transcripts missing from tx2gene: 7129
summarizing abundance
summarizing counts
summarizing length
tximport • 638 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 6 days ago
United States

I use the top GTF and FASTA from GENCODE. I do not include haplotype chromosomes in the reference.

ADD COMMENT

Login before adding your answer.

Traffic: 681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6