Separate exogenous from endogenous transcripts using Salmon RNAseq DTU
Entering edit mode
ksong ▴ 10
Last seen 24 days ago
United States

Dear friends,

We are trying to use Salmon for DTU analysis. We want to separate exogenous from endogenous transcripts by following this post and this paper

We are focusing on a gene called ASCL1 (endo-ASCL1). We transduced cells with lentiviral vector containing ASCL1 ORF only (Lenti-ASCL1). There should not be any 5'3' UTR on this ASCL1. We added Lenti-ASCL1 to genecode reference.

However, in the result, we still see large amount of Lenti-ASCL1 and endo-ASCL1 transcripts in some untranduced cells, especially when endo-ASCL1 is high. Looks like high endo-ASCL1 can bring the Lenti-ASCL1 level high as well even though the sample is not transduced.

Have you seen this before? What would be a better way to deconvolute ASCL1 quantification using Salmon? to better separate endo-ASCL1 from Lenti-ASCL1? Are there any parameters we should tune to achieve this?

Thank you so much for your help!


Here is the code we ran.

#in linux shell
#get the reference geneome for salmon

#add lenti ASCL1 to this genome
cat ASCL1_transcript.fa >> gencode.v39.transcripts.fa 

#use salmon to index transcrpt reference
salmon index -t gencode.v39.transcripts.fa -i gencode.index

#run salmon
for fn in $files;
echo "Processing sample ${samp}"
salmon quant -i gencode.index -l IU \
         -1 ${sample_dir}/$fn"_R1_001.fastq.gz" \
         -2 ${sample_dir}/$fn"_R2_001.fastq.gz" \
         -p 20 --gcBias -o ${workdir}/quant/${fn}_quant

#in R

library(tidyverse, exclude = "select")
#DTU analysis
samples<-data.frame(sample_id = c("1_S1_quant",  "5_S5_quant", "6_S6_quant"),
                    condition = c('s1_no_lenti-ASCL1',

files <- file.path("dtu/quant",
                   samples$sample_id, "quant.sf")
names(files) <- samples$sample_id

#Use tximport to read in transcript counts
txi <- tximport(files, type="salmon", txOut=T,
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]
rownames(cts)<-sub("\\|.*", "", rownames(cts))

gtf <- "gencode.v39.annotation.gtf.gz"
txdb.filename <- "gencode.v39.annotation.sqlite"

  expr = {
  error = function(e){ 
    txdb <- makeTxDbFromGFF(gtf)
    saveDb(txdb, txdb.filename)

txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")

ASCL1_transcript<-data.frame(GENEID = "ENSG00000139352.4", TXNAME = "FU-ASCL1-CGW")
txdf<-rbind(txdf, ASCL1_transcript)

tab <- table(txdf$GENEID)
txdf$ntx <- tab[match(txdf$GENEID, names(tab))]
txdf <- txdf[match(rownames(cts),txdf$TXNAME),]
all(rownames(cts) == txdf$TXNAME)

counts <- data.frame(gene_id=txdf$GENEID,

#get ascl1
ascl1_counts<-counts[counts$gene_id == "ENSG00000139352.4",] %>% 
  .[,-(1:2)] %>% 
  (function(x, add=1){x + add}) %>% 
  log2() %>%
  t %>% %>% 


       sample ENST00000266744.4 FU-ASCL1-CGW
1 X1_S1_quant          3.459432      0.00000
2 X5_S5_quant         12.715446      7.85213
3 X6_S6_quant         13.197477     10.85678

the ENST00000266744.4 should be endo-ASCL1 transcript FU-ASCL1-CGW is the lenti-ASCL1 transcript

X1_S1_quant is the cell line with low endo ASCL1. The lenti-ASCL1 has zero counts, which should be correct.

X5_S5_quant is the cell line with high endo ASCL1. The problem is we don't know why the lenti-ASCL1 has non zero value. And the value is not low.

X6_S6_quant is the cell line with both high endo-ASCL1 and high lenti-ASCL.

sessionInfo( )

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reshape2_1.4.4              GenomicFeatures_1.44.2      tximport_1.20.0             rnaseqDTU_1.12.0            devtools_2.4.3             
 [6] usethis_2.1.5               rafalib_1.0.0               edgeR_3.34.1                limma_3.48.3                stageR_1.14.0              
[11] DEXSeq_1.38.0               RColorBrewer_1.1-2          AnnotationDbi_1.54.1        DESeq2_1.32.0               SummarizedExperiment_1.22.0
[16] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4         IRanges_2.26.0              S4Vectors_0.30.2            MatrixGenerics_1.4.3       
[21] matrixStats_0.61.0          Biobase_2.52.0              BiocGenerics_0.38.0         BiocParallel_1.26.2         DRIMSeq_1.20.0             
[26] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.8                 purrr_0.3.4                 readr_2.1.2                
[31] tidyr_1.2.0                 tibble_3.1.6                ggplot2_3.3.5               tidyverse_1.3.1            

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3         rjson_0.2.21             hwriter_1.3.2            ellipsis_0.3.2           rprojroot_2.0.2          XVector_0.32.0          
  [7] fs_1.5.2                 rstudioapi_0.13          remotes_2.4.2            bit64_4.0.5              fansi_1.0.2              lubridate_1.8.0         
 [13] xml2_1.3.3               splines_4.1.2            cachem_1.0.6             geneplotter_1.70.0       pkgload_1.2.4            jsonlite_1.8.0          
 [19] Rsamtools_2.8.0          broom_0.7.12             annotate_1.70.0          dbplyr_2.1.1             png_0.1-7                compiler_4.1.2          
 [25] httr_1.4.2               backports_1.4.1          assertthat_0.2.1         Matrix_1.4-0             fastmap_1.1.0            cli_3.2.0               
 [31] prettyunits_1.1.1        tools_4.1.2              gtable_0.3.0             glue_1.6.2               GenomeInfoDbData_1.2.6   rappdirs_0.3.3          
 [37] Rcpp_1.0.8.3             cellranger_1.1.0         vctrs_0.3.8              Biostrings_2.60.2        rtracklayer_1.52.1       brio_1.1.3              
 [43] ps_1.6.0                 testthat_3.1.2           rvest_1.0.2              lifecycle_1.0.1          restfulr_0.0.13          statmod_1.4.36          
 [49] XML_3.99-0.9             zlibbioc_1.38.0          scales_1.1.1             vroom_1.5.7              hms_1.1.1                yaml_2.3.5              
 [55] curl_4.3.2               memoise_2.0.1            biomaRt_2.48.3           stringi_1.7.6            RSQLite_2.2.10           genefilter_1.74.1       
 [61] BiocIO_1.2.0             desc_1.4.1               filelock_1.0.2           pkgbuild_1.3.1           rlang_1.0.2              pkgconfig_2.0.3         
 [67] bitops_1.0-7             lattice_0.20-45          GenomicAlignments_1.28.0 bit_4.0.4                tidyselect_1.1.2         processx_3.5.2          
 [73] plyr_1.8.6               magrittr_2.0.2           R6_2.5.1                 generics_0.1.2           DelayedArray_0.18.0      DBI_1.1.2               
 [79] pillar_1.7.0             haven_2.4.3              withr_2.5.0              survival_3.3-1           KEGGREST_1.32.0          RCurl_1.98-1.6          
 [85] modelr_0.1.8             crayon_1.5.0             utf8_1.2.2               BiocFileCache_2.0.0      tzdb_0.2.0               progress_1.2.2          
 [91] locfit_1.5-9.5           grid_4.1.2               readxl_1.3.1             blob_1.2.2               callr_3.7.0              reprex_2.0.1            
 [97] digest_0.6.29            xtable_1.8-4             munsell_0.5.0            sessioninfo_1.2.2
rnaseqDTU • 340 views
Entering edit mode

What sequence are you using for the transgene transcript you are adding to the reference? Is it just the CDS? The transgene will have a 5' and 3' UTR of some sort, even if its just a Kozak sequence on the 5' end and a SV40 polyA site on the 3' end. Salmon might have a hard time distinguishing CDS from CDS+UTR, particularly if the UTR is not quite correctly annotated in the reference (which is not uncommon). For example if the normally used UTR is shorter than the annotated one (which is often the case), then salmon might have to split the difference between assigning the read to the isoform with the UTR and without. This might be particularly problematic if the UTR contains sequence that looks like sequence in other genome locations that reads could be assigned to. If you look at a read alignment for samples given high transgene expression that shouldn't have the transgene, do you see a consistent, expression level across the UTR, or do you see that coverage is higher across the CDS?

Entering edit mode

Hi Dr Sudbery, yes I found looking at the coverage in IGV helps a lot. I did find some differences between CDS and UTR and between samples. The conclusion is more clear when I normalized the data by sequencing depth using deepTools. Thank you so much.

Entering edit mode
Last seen 1 day ago
United States

Hi Kai,

These read counts are pretty low right? 7-10 read counts for the lenti transcript. These are below the range that is filtered by many pipelines, so it’s hard to draw conclusions about what is present or not.

I’d recommend running Salmon with 30 bootstraps or Gibbs replicates (options numBootstraps or numGibbsSamples in Salmon). Then import with tximeta(…). It will allow you to visualize the quantification uncertainty per transcript.

Entering edit mode

Hi Michael, Thank you for your reply. Sorry for the confusion. The counts have already been log2 transformed. That is why the counts look low. We can try numBootstraps or numGibbsSamples.

Entering edit mode

I see, yes, the inferential uncertainty may help explain. Also we have been working in our past couple of papers on showing that it’s quite important to use the inferential uncertainty for sub-gene-level analysis. The point estimates can mislead.


Login before adding your answer.

Traffic: 456 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6