I think I may know the answer to this question, but wanted to pose it anyways. I am interested in both protein coding and polyadenylated lncRNAs a set of samples. I used kallisto
to generate a single index using both the cDNA and lncRNA FASTA files from Ensembl. After pseudoaligning my samples with this index, I created a tx2gene
file from the GTF associated with my previously used fasta files:
gtf.hs.ensembl <- rtracklayer::import('Homo_sapiens.GRCh38.113.gtf')
tx2gene.gtf <- mcols(gtf.hs.ensembl)[,c(7,10)] %>%
as_tibble() %>%
dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
dplyr::select(target_id, gene_name) %>%
na.omit() %>%
distinct(target_id, .keep_all = T)
I then used tximport
and got an output saying ~135,000 transcripts were missing from tx2gene
:
> Txi_gene <- tximport(path_combined,
+ type = "kallisto",
+ tx2gene = tx2gene.gtf,
+ txOut = FALSE,
+ countsFromAbundance = "lengthScaledTPM",
+ ignoreTxVersion = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
transcripts missing from tx2gene: 135227
summarizing abundance
summarizing counts
summarizing length
Is this simply because most of the lncRNAs included in the FASTA files are not annotated with gene names? Should I summarize to the transcript level with txOut = TRUE
?
@Monkey Mart Thank you so much
I was under the impression that we want to summarize to gene names, not gene IDs. I didn't see a gene name key.