Hello, I have performed the quantification of the transcripts following salmon workflow, but I do not understand why I am missing some transcripts in the tximort() call.
Here is my workflow: Transcriprome used for building the index:
hsa_GRC38h.fa.gz
ftp://ftp.ensembl.org/pub/release-96/fasta/homosapiens/cdna/Homosapiens.GRCh38.cdna.all.fa.gz
Built the index of the transcriptome using:
salmon index -t hsa_GRC38h.fa.gz -i hsas_GRCh38_index
Quantified the transcripts using:
salmon quant -i hsas_GRCh38_index -l A \
-1 ${samp}_L00{1,2,3,4}_R1_001.fastq.gz \
-2 ${samp}_L00{1,2,3,4}_R2_001.fastq.gz \
-p 8 --validateMappings --gcBias -o quants_gcBias/${i}_quant
then I moved to R version 3.6.0.
The quant.sf
files looks ok to me:
read_tsv(files[1])
# A tibble: 177,034 x 5
Name Length EffectiveLength TPM NumReads
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENST00000631435.1 12 13 0 0
2 ENST00000434970.2 9 10 0 0
3 ENST00000448914.1 13 13 0 0
4 ENST00000415118.1 8 9 0 0
5 ENST00000604446.1 23 24 0 0
6 ENST00000603693.1 19 19 0 0
7 ENST00000603935.1 31 32 0 0
8 ENST00000604102.1 31 32 0 0
9 ENST00000604838.1 17 17 0 0
10 ENST00000439842.1 11 12 0 0
# ... with 177,024 more rows
I build the tx2gene file from the GTF file of the same version of the transcriptome Homo_sapiens.GRCh38.96.gtf
ftp://ftp.ensembl.org/pub/release-96/gtf/homosapiens/Homosapiens.GRCh38.96.gtf.gz using the following:
Txdb <- makeTxDbFromGFF(file="../Homo_sapiens.GRCh38.96.gtf",
dataSource="ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz",
organism="Homo sapiens")
k <- keys(Txdb, keytype = "TXNAME")
tx2gene <- select(Txdb, k, "GENEID", "TXNAME")
head(tx2gene)
TXNAME GENEID
1 ENST00000456328 ENSG00000223972
2 ENST00000450305 ENSG00000223972
3 ENST00000473358 ENSG00000243485
4 ENST00000469289 ENSG00000243485
5 ENST00000607096 ENSG00000284332
6 ENST00000606857 ENSG00000268020
I then Import and I got the following:
txi <- tximport(files, type="salmon", tx2gene=tx2gene,ignoreTxVersion = T)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12
transcripts missing from tx2gene: 8508
summarizing abundance
summarizing counts
summarizing length
I am not sure why I am missign some transcripts.
also I have tried to characterize the identity of the missing transcript and I am worried I might lose some information
# load the input of salmon ranscript, remove the verison
input <- read_tsv(files[1])%>%
.[["Name"]]%>%
str_extract(pattern = "ENST\\d+")
# identify the transcript in the count but not in the lookup table
missing <- dplyr::setdiff(input,tx2gene$TXNAME)
length(missing)
[1] 8508
# get more info about the transcripts
missing_df <- getBM(attributes=c("ensembl_transcript_id","transcript_biotype","entrezgene_id","hgnc_symbol"),
filters="ensembl_transcript_id",
values=missing,
mart=ensembl)
# summarise the info
missing_df%>%
#group_by(gene_biotype)%>%
group_by(transcript_biotype)%>%
summarise(n = n())%>%
mutate(prop = n/sum(n))%>%
print(n=40)
# A tibble: 25 x 3
transcript_biotype n prop
<chr> <int> <dbl>
1 IG_C_gene 11 0.00122
2 IG_C_pseudogene 2 0.000221
3 IG_J_gene 5 0.000553
4 IG_J_pseudogene 1 0.000111
5 IG_V_gene 52 0.00575
6 IG_V_pseudogene 68 0.00752
7 lncRNA 1268 0.140
8 non_stop_decay 4 0.000443
9 nonsense_mediated_decay 468 0.0518
10 polymorphic_pseudogene 31 0.00343
11 processed_pseudogene 305 0.0338
12 protein_coding 4869 0.539
13 pseudogene 20 0.00221
14 retained_intron 1363 0.151
15 rRNA_pseudogene 4 0.000443
16 TR_C_gene 2 0.000221
17 TR_D_gene 1 0.000111
18 TR_J_gene 14 0.00155
19 TR_V_gene 37 0.00409
20 TR_V_pseudogene 11 0.00122
21 transcribed_processed_pseudogene 14 0.00155
22 transcribed_unitary_pseudogene 2 0.000221
23 transcribed_unprocessed_pseudogene 96 0.0106
24 unitary_pseudogene 2 0.000221
25 unprocessed_pseudogene 387 0.0428
here is my session info
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.40.0 GenomicFeatures_1.36.4 AnnotationDbi_1.45.1 tximportData_1.12.0
[5] tximport_1.12.3 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1
[9] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3 tibble_2.1.1
[13] ggplot2_3.1.1 tidyverse_1.2.1 DESeq2_1.24.0 SummarizedExperiment_1.14.0
[17] DelayedArray_0.10.0 BiocParallel_1.17.18 matrixStats_0.54.0 Biobase_2.43.1
[21] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 IRanges_2.17.6 S4Vectors_0.21.24
[25] BiocGenerics_0.29.2
loaded via a namespace (and not attached):
[1] nlme_3.1-139 bitops_1.0-6 lubridate_1.7.4 bit64_0.9-7 RColorBrewer_1.1-2
[6] progress_1.2.0 httr_1.4.0 tools_3.6.0 backports_1.1.4 utf8_1.1.4
[11] R6_2.4.0 rpart_4.1-15 Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2
[16] colorspace_1.4-1 nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3
[21] prettyunits_1.0.2 curl_3.3 bit_1.1-14 compiler_3.6.0 cli_1.1.0
[26] rvest_0.3.3 htmlTable_1.13.1 xml2_1.2.0 rtracklayer_1.44.1 scales_1.0.0
[31] checkmate_1.9.1 genefilter_1.66.0 digest_0.6.18 Rsamtools_2.0.0 foreign_0.8-71
[36] XVector_0.23.2 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6 readxl_1.3.1
[41] htmlwidgets_1.3 rlang_0.3.4 rstudioapi_0.10 RSQLite_2.1.1 generics_0.0.2
[46] jsonlite_1.6 acepack_1.4.1 RCurl_1.95-4.12 magrittr_1.5 GenomeInfoDbData_1.2.1
[51] Formula_1.2-3 Matrix_1.2-17 fansi_0.4.0 Rcpp_1.0.1 munsell_0.5.0
[56] stringi_1.4.3 zlibbioc_1.29.0 plyr_1.8.4 grid_3.6.0 blob_1.1.1
[61] crayon_1.3.4 lattice_0.20-38 Biostrings_2.51.5 haven_2.1.0 splines_3.6.0
[66] annotate_1.62.0 hms_0.4.2 locfit_1.5-9.1 knitr_1.22 pillar_1.3.1
[71] geneplotter_1.62.0 XML_3.98-1.19 glue_1.3.1 latticeExtra_0.6-28 modelr_0.1.4
[76] data.table_1.12.2 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.6
[81] xtable_1.8-4 broom_0.5.2 survival_2.44-1.1 GenomicAlignments_1.20.1 memoise_1.1.0
[86] cluster_2.0.8
Instead of generating the tx2gene file from GTF file or ensembldb, why not generate it from the reference transcriptome from which the index was generated. I encountered the same missing transcript problem while working with kallisto and this is a probable solution I found.
Great tip ag1805x. I added a modified version of this shell script to my koopa library, which outputs a tx2gene.csv file: https://github.com/acidgenomics/koopa/blob/master/bin/tx2gene-from-ensembl-fasta
Also, a bit unrelated but potentially relevant: bcbio-nextgen supports salmon and kallisto, and will output a correct tx2gene.csv mapping file automatically. Our bcbioRNASeq R package will parse this output automatically and load up at either transcript or gene-level in R, using tximport internally.
thank you very much for the feedback, I will run everything again using Gencode sets.