When psuedoaligning with kallisto, I use tximport
to load the outputs into R. I usually use the Ensembl GTF file to build my tx2gene
object. I was wondering which GTF file is better to use:
Homo_sapiens.GRCh38.113.gtf OR Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf
I tried running tximport
with tx2gene objects created from both GTF files and had a much lower number of transcripts missing from tx2gene when using the Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf file.
Most of what I have seen documented uses the Homo_sapiens.GRCh38.113.gtf file, but is it not better to use the chr_patch_hapl_scaff since it seems to include more transcripts that kallisto mapped reads to?
Code showing difference in number of transcripts:
# building tx2gene from normal gtf file
gtf.hs.ensembl <- rtracklayer::import('Homo_sapiens.GRCh38.113.gtf')
tx2gene.gtf <- mcols(gtf.hs.ensembl)[,c(7,10)] %>%
as_tibble() %>%
dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
dplyr::select(target_id, gene_name) %>%
na.omit() %>%
distinct(target_id, .keep_all = T)
# building tx2gene from chr_patch_hapl_scaff gtf file
gtf.hs.ensembl.hapl.scaff <- rtracklayer::import('Homo_sapiens.GRCh38.113.chr_patch_hapl_scaff.gtf')
tx2gene.gtf.hapl.scaff <- mcols(gtf.hs.ensembl.hapl.scaff)[,c(7,10)] %>%
as_tibble() %>%
dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
dplyr::select(target_id, gene_name) %>%
na.omit() %>%
distinct(target_id, .keep_all = T)
> # running tximport with normal gtf
> Txi_gene <- tximport(path,
+ type = "kallisto",
+ tx2gene = tx2gene.gtf,
+ txOut = FALSE,
+ countsFromAbundance = "lengthScaledTPM",
+ ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
transcripts missing from tx2gene: 25650
summarizing abundance
summarizing counts
summarizing length
>
> # running tximport with chr_patch_hapl_scaff gtf
> Txi_gene <- tximport(path,
+ type = "kallisto",
+ tx2gene = tx2gene.gtf.hapl.scaff,
+ txOut = FALSE,
+ countsFromAbundance = "lengthScaledTPM",
+ ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
transcripts missing from tx2gene: 7129
summarizing abundance
summarizing counts
summarizing length