Question

tximport transcript missing

1

Entering edit mode

mr6518849 ▴ 10

@mr6518849-21434

Last seen 6.5 years ago

Hello, I have performed the quantification of the transcripts following salmon workflow, but I do not understand why I am missing some transcripts in the tximort() call.

Here is my workflow: Transcriprome used for building the index:

hsa_GRC38h.fa.gz ftp://ftp.ensembl.org/pub/release-96/fasta/homosapiens/cdna/Homosapiens.GRCh38.cdna.all.fa.gz

Built the index of the transcriptome using:

salmon index -t hsa_GRC38h.fa.gz -i hsas_GRCh38_index

Quantified the transcripts using:

salmon quant -i hsas_GRCh38_index -l A \
-1 ${samp}_L00{1,2,3,4}_R1_001.fastq.gz \
-2 ${samp}_L00{1,2,3,4}_R2_001.fastq.gz \
-p 8 --validateMappings --gcBias -o quants_gcBias/${i}_quant

then I moved to R version 3.6.0. The quant.sf files looks ok to me:

read_tsv(files[1])
# A tibble: 177,034 x 5
   Name              Length EffectiveLength   TPM NumReads
   <chr>              <dbl>           <dbl> <dbl>    <dbl>
 1 ENST00000631435.1     12              13     0        0
 2 ENST00000434970.2      9              10     0        0
 3 ENST00000448914.1     13              13     0        0
 4 ENST00000415118.1      8               9     0        0
 5 ENST00000604446.1     23              24     0        0
 6 ENST00000603693.1     19              19     0        0
 7 ENST00000603935.1     31              32     0        0
 8 ENST00000604102.1     31              32     0        0
 9 ENST00000604838.1     17              17     0        0
10 ENST00000439842.1     11              12     0        0
# ... with 177,024 more rows

I build the tx2gene file from the GTF file of the same version of the transcriptome Homo_sapiens.GRCh38.96.gtf ftp://ftp.ensembl.org/pub/release-96/gtf/homosapiens/Homosapiens.GRCh38.96.gtf.gz using the following:

Txdb <- makeTxDbFromGFF(file="../Homo_sapiens.GRCh38.96.gtf",
                        dataSource="ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz",
                        organism="Homo sapiens")
k <- keys(Txdb, keytype = "TXNAME")
tx2gene <- select(Txdb, k, "GENEID", "TXNAME")
head(tx2gene)
           TXNAME          GENEID
1 ENST00000456328 ENSG00000223972
2 ENST00000450305 ENSG00000223972
3 ENST00000473358 ENSG00000243485
4 ENST00000469289 ENSG00000243485
5 ENST00000607096 ENSG00000284332
6 ENST00000606857 ENSG00000268020

I then Import and I got the following:

txi <- tximport(files, type="salmon", tx2gene=tx2gene,ignoreTxVersion = T)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 
transcripts missing from tx2gene: 8508
summarizing abundance
summarizing counts
summarizing length

I am not sure why I am missign some transcripts.

also I have tried to characterize the identity of the missing transcript and I am worried I might lose some information

# load the input of salmon ranscript, remove the verison
input <- read_tsv(files[1])%>%
  .[["Name"]]%>%
  str_extract(pattern = "ENST\\d+")
# identify the transcript in the count but not in the lookup table
missing <- dplyr::setdiff(input,tx2gene$TXNAME)
length(missing)
[1] 8508
# get more info about the transcripts
missing_df <- getBM(attributes=c("ensembl_transcript_id","transcript_biotype","entrezgene_id","hgnc_symbol"),
      filters="ensembl_transcript_id",
      values=missing,
      mart=ensembl)
# summarise the info
missing_df%>%
  #group_by(gene_biotype)%>%
  group_by(transcript_biotype)%>%
  summarise(n = n())%>%
  mutate(prop = n/sum(n))%>%
  print(n=40)
# A tibble: 25 x 3
   transcript_biotype                     n     prop
   <chr>                              <int>    <dbl>
 1 IG_C_gene                             11 0.00122 
 2 IG_C_pseudogene                        2 0.000221
 3 IG_J_gene                              5 0.000553
 4 IG_J_pseudogene                        1 0.000111
 5 IG_V_gene                             52 0.00575 
 6 IG_V_pseudogene                       68 0.00752 
 7 lncRNA                              1268 0.140   
 8 non_stop_decay                         4 0.000443
 9 nonsense_mediated_decay              468 0.0518  
10 polymorphic_pseudogene                31 0.00343 
11 processed_pseudogene                 305 0.0338  
12 protein_coding                      4869 0.539   
13 pseudogene                            20 0.00221 
14 retained_intron                     1363 0.151   
15 rRNA_pseudogene                        4 0.000443
16 TR_C_gene                              2 0.000221
17 TR_D_gene                              1 0.000111
18 TR_J_gene                             14 0.00155 
19 TR_V_gene                             37 0.00409 
20 TR_V_pseudogene                       11 0.00122 
21 transcribed_processed_pseudogene      14 0.00155 
22 transcribed_unitary_pseudogene         2 0.000221
23 transcribed_unprocessed_pseudogene    96 0.0106  
24 unitary_pseudogene                     2 0.000221
25 unprocessed_pseudogene               387 0.0428

here is my session info

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.40.0              GenomicFeatures_1.36.4      AnnotationDbi_1.45.1        tximportData_1.12.0        
 [5] tximport_1.12.3             forcats_0.4.0               stringr_1.4.0               dplyr_0.8.0.1              
 [9] purrr_0.3.2                 readr_1.3.1                 tidyr_0.8.3                 tibble_2.1.1               
[13] ggplot2_3.1.1               tidyverse_1.2.1             DESeq2_1.24.0               SummarizedExperiment_1.14.0
[17] DelayedArray_0.10.0         BiocParallel_1.17.18        matrixStats_0.54.0          Biobase_2.43.1             
[21] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0         IRanges_2.17.6              S4Vectors_0.21.24          
[25] BiocGenerics_0.29.2        

loaded via a namespace (and not attached):
 [1] nlme_3.1-139             bitops_1.0-6             lubridate_1.7.4          bit64_0.9-7              RColorBrewer_1.1-2      
 [6] progress_1.2.0           httr_1.4.0               tools_3.6.0              backports_1.1.4          utf8_1.1.4              
[11] R6_2.4.0                 rpart_4.1-15             Hmisc_4.2-0              DBI_1.0.0                lazyeval_0.2.2          
[16] colorspace_1.4-1         nnet_7.3-12              withr_2.1.2              tidyselect_0.2.5         gridExtra_2.3           
[21] prettyunits_1.0.2        curl_3.3                 bit_1.1-14               compiler_3.6.0           cli_1.1.0               
[26] rvest_0.3.3              htmlTable_1.13.1         xml2_1.2.0               rtracklayer_1.44.1       scales_1.0.0            
[31] checkmate_1.9.1          genefilter_1.66.0        digest_0.6.18            Rsamtools_2.0.0          foreign_0.8-71          
[36] XVector_0.23.2           base64enc_0.1-3          pkgconfig_2.0.2          htmltools_0.3.6          readxl_1.3.1            
[41] htmlwidgets_1.3          rlang_0.3.4              rstudioapi_0.10          RSQLite_2.1.1            generics_0.0.2          
[46] jsonlite_1.6             acepack_1.4.1            RCurl_1.95-4.12          magrittr_1.5             GenomeInfoDbData_1.2.1  
[51] Formula_1.2-3            Matrix_1.2-17            fansi_0.4.0              Rcpp_1.0.1               munsell_0.5.0           
[56] stringi_1.4.3            zlibbioc_1.29.0          plyr_1.8.4               grid_3.6.0               blob_1.1.1              
[61] crayon_1.3.4             lattice_0.20-38          Biostrings_2.51.5        haven_2.1.0              splines_3.6.0           
[66] annotate_1.62.0          hms_0.4.2                locfit_1.5-9.1           knitr_1.22               pillar_1.3.1            
[71] geneplotter_1.62.0       XML_3.98-1.19            glue_1.3.1               latticeExtra_0.6-28      modelr_0.1.4            
[76] data.table_1.12.2        cellranger_1.1.0         gtable_0.3.0             assertthat_0.2.1         xfun_0.6                
[81] xtable_1.8-4             broom_0.5.2              survival_2.44-1.1        GenomicAlignments_1.20.1 memoise_1.1.0           
[86] cluster_2.0.8

deseq2 salmon tximport • 6.3k views

ADD COMMENT • link 6.5 years ago mr6518849 ▴ 10

score 2 · Answer 1 · 2019-07-23

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Some of the transcripts in Ensembl's FASTA files are not present in the GTF. I've seen this as well. I don't have any particular recommendation. I tend to work with Gencode FASTA and GTF for this reason, and Gencode doesn't have duplicate IDs for genes that fall in regions that have multiple haplotypes.

If you want to keep these extra transcripts after gene summarization, you can put their ID's in as rows of tx2gene so they pass through from transcript to gene level.

ADD COMMENT • link 6.5 years ago Michael Love 43k

2

Entering edit mode

Instead of generating the tx2gene file from GTF file or ensembldb, why not generate it from the reference transcriptome from which the index was generated. I encountered the same missing transcript problem while working with kallisto and this is a probable solution I found.

ADD REPLY • link 6.4 years ago Arindam ▴ 80

0

Entering edit mode

Great tip ag1805x. I added a modified version of this shell script to my koopa library, which outputs a tx2gene.csv file: https://github.com/acidgenomics/koopa/blob/master/bin/tx2gene-from-ensembl-fasta

ADD REPLY • link 6.3 years ago Michael Steinbaugh ▴ 90

0

Entering edit mode

Also, a bit unrelated but potentially relevant: bcbio-nextgen supports salmon and kallisto, and will output a correct tx2gene.csv mapping file automatically. Our bcbioRNASeq R package will parse this output automatically and load up at either transcript or gene-level in R, using tximport internally.

ADD REPLY • link 6.3 years ago Michael Steinbaugh ▴ 90

0

Entering edit mode

thank you very much for the feedback, I will run everything again using Gencode sets.

ADD REPLY • link 6.5 years ago mr6518849 ▴ 10