Hi,
I have an issue using tximport with kallisto output. I never had this issue before, and I can't figure out why it is happening. When applying it on a large dataset (over 3000 transcriptome mapped cells from single-cell RNAseq data), some genes from a subset of the cells (1/4 of the cells) have NA values. When I checked the counts of these genes in the corresponding abundance.tsv files, some of them seem to be expressed at high levels (around 1000 counts). Many of these genes also have only 1 transcript.
The code I'm running is the following:
txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv)
I also tried importing with rhdf5, and it didn't change in anything.
My sessionInfo() is the following:
sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.1.1 tximport_1.4.0 magrittr_1.5 loaded via a namespace (and not attached): [1] zlibbioc_1.22.0 compiler_3.4.1 plyr_1.8.4 R6_2.2.2 hms_0.3 tools_3.4.1 [7] rhdf5_2.20.0 reshape2_1.4.2 tibble_1.3.4 Rcpp_0.12.13 stringi_1.1.5 stringr_1.2.0 [13] rlang_0.1.4
Any help to fix this issue is highly appreciated!
Thanks for your help!
Best,
Leon

“I also tried importing with
rhdf5, and it didn't change in anything.”When you tried this, did you specify ‘importer’? If so, can you try not specifying importer, but using our built-in methods for importing kallisto h5 files?
I first didn't specify
importer. I had previously used an earlier release oftximport(which had thereaderargument). However, this time with the latest release, thereaderargument was giving me an error. Therefore, I tried to executetximportwith the default arguments, and a message popped telling me that I have to importrhdf5so thattximportfunctions in its default settings. So I did that and this is when I realised that I hadNAvalues. Suspecting that probably something wrong happened with theh5files, I tried usingread_tsvas an alternative approach and got the exact same output.Ok so it's probably not to do with importer, but if you want to import much faster you can install rhdf5:
biocLite("rhdf5")And then try without specifying 'importer'. You will still get NA's with this approach, and I'll comment again with more to diagnose this.
Indeed, and as I mentioned earlier, when I tried using
rhdf5without specifyingimporter, I always gotNAvalues. However, regarding the speed, I didn't feel that it was loading faster. But this was just my feeling because I didn't test for speed explicitly.I wonder if this could have to do with the automatic determination of the column types by read_tsv. By default I think it uses the first 1,000 values to determine the column type, and if these values are all 0 it may try to read the whole column as integers. This caused problems for me once (albeit not exactly the same problem as yours). Could you try to specify another importer function, e.g.
```
read_tsv2 <- function(...) readr::read_tsv(..., progress = FALSE,
col_types = list(
target_id = col_character(),
length = col_integer(),
eff_length = col_double(),
est_counts = col_double(),
tpm = col_double()
))
```
Thanks Charlotte, if this is the problem, using rhdf5 should avoid the NA. This is the importer that is used if the rhdf5 package is detected.
Another curious thing is that Leon reports no NA when txOut=TRUE, which wouldn’t happen i think if the importer was the issue.
True, but that was only for 50 cells, for which txOut=FALSE also worked, so I guess it is not yet clear whether there is a set of cells for which txOut=TRUE works but txOut=FALSE doesn't.
I'm wondering, if it's the importer's fault, why would more or less files make a difference?
Dear Michael and Charlotte,
Thank you for your much appreciated help. The solution Charlotte provided solved my issue. Also, regarding importing
h5files, I did a mistake when trying to load these files, since I have specifiedabundance.tsvfiles totximportinstead ofh5files. Directly loadingh5files using the default settings oftximportdoes not giveNAvalues. My complete script for this is the following:# Load magritrr if(!require(magrittr)) { install.packages("magrittr") library(magrittr) } # Load tximport if(!require(tximport)) { source("https://bioconductor.org/biocLite.R") biocLite("tximport") library(tximport) } # Load rhdf5 if(!require(rhdf5)) { source("https://bioconductor.org/biocLite.R") biocLite("rhdf5") library(rhdf5) } # Load readr if(!require(readr)) { install.packages("readr") library(readr) } # Set working directory setwd("~") # Load transcript to genename table tx2gene <- read.table("../hGRCh28_transcriptome/Homo_sapiens.GRCh38.cdna.all.idtable.txt", header = T) %>% subset(select = -GENEID) # List directory where kallisto files are located dir <- list.dirs(path = ".", recursive = F, full.names = F) %>% grep("kallisto_output", ., value = T) # Set working directory to kallisto folder setwd(dir) # List directories in kallisto folder accessions <- list.dirs(path = ".", recursive = F, full.names = F) #### OLD VERSION #### # List abundance files from accession directories abundance.files <- file.path(".", accessions, "abundance.tsv") # Name files by sample names names(abundance.files) <- sub(pattern = "./", replacement = "", x = abundance.files) %>% sub(pattern = "-kallisto_output/abundance.tsv", replacement = "", x = .) # Wrap kallisto counts in a tximport object using default settings txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene) # Wrap kallisto counts in a tximport object using read_tsv txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv) #### NEW VERSION with rhdf5 #### # List abundance files from accession directories abundance.files <- file.path(".", accessions, "abundance.h5") # Name files by sample names names(abundance.files) <- sub(pattern = "./", replacement = "", x = abundance.files) %>% sub(pattern = "-kallisto_output/abundance.h5", replacement = "", x = .) # Wrap kallisto counts in a tximport object using default settings txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene) #### NEW VERSION with read_tsv2 #### # Define new read_tsv function # Adapted from Charlotte # https://support.bioconductor.org/p/103126/#103146 read_tsv2 <- function(...) readr::read_tsv(..., progress = FALSE, col_types = list( target_id = col_character(), length = col_integer(), eff_length = col_double(), est_counts = col_double(), tpm = col_double() )) # List abundance files from accession directories abundance.files <- file.path(".", accessions, "abundance.tsv") # Name files by sample names names(abundance.files) <- sub(pattern = "./", replacement = "", x = abundance.files) %>% sub(pattern = "-kallisto_output/abundance.tsv", replacement = "", x = .) # Wrap kallisto counts in a tximport object using the new read_tsv2 function txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv2)Please let me know if I need to provide further information so that others that might face a similar issue be able to troubleshoot it quite straightforwardly.
Best,
Leon
Thank Leon for the useful feedback here.
We can either have a specific read_tsv importer for each method 'type' that explicitly sets the col_types, or I could set guess_max=Inf so that read_tsv uses the full file for setting col_types. The first approach should be faster, so I'll probably go with this.
At this point I can debug on my own now that we have the source of the NAs, and will put a fix into the devel branch.
The transcripts FASTA file was directly downloaded from Ensembl. The file can be found here. The file corresponds to Homo_sapiens.GRCh38.cdna.all.fa.gz. No
Rcode was used to generate thetx2genefile.What we did is to retrieve from the FASTA file (which is exactly the same file that was used to generate the kallisto index) all transcript IDs. These IDs were then used to retrieve gene IDs and gene names from Biomart.What we did is to retrieve from the FASTA file (which is exactly the same file that was used to generate the kallisto index) all transcript and gene IDs, as well as gene names. A custom madepythonscript was used to retrieve all these information. Thetx2genefile that was provided to the code includes only transcript IDs and gene names. The issue is not at all related to thetx2genefile, because all transcripts have a corresponding gene name. Moreover, the same genes that showNAvalues in some cells have actual count values in other cells.# Load magritrr if(!require(magrittr)) { install.packages("magrittr") library(magrittr) } # Load transcript to genename table and remove gene ID column tx2gene <- read.table("../hGRCh28_transcriptome/Homo_sapiens.GRCh38.cdna.all.idtable.txt", header = T) %>% subset(select = -GENEID) # Check tx2gene table head(tx2gene) TXNAME GENENAME 1 ENST00000434970.2 TRDD2 2 ENST00000448914.1 TRDD3 3 ENST00000415118.1 TRDD1 4 ENST00000631435.1 AC239618.6 5 ENST00000632684.1 AC245427.8 6 ENST00000454908.1 IGHD1-1 # Check NAs in gene names tx2gene$GENENAME %>% is.na() %>% any() [1] FALSEBest,
Leon
Any NA if you set txOut=TRUE?
When I set
txOut = TRUE, I get noNAvalues. However, I didn't run this code on all my cells, since this is time consuming. I first identified all cells havingNAvalues, then selected 50 of them and ran the code. Now if I settxOut = FALSEby using only this same subset of cells (i.e. the 50 selected cells), I still don't getNAvalues. Any hints on what could have happened when taking all 3000 cells together?Best,
Leon
My guess is that the
tx2genedoesn't line up with the transcripts in the FASTA file, and hence the quantifications. Biomart is convenient but using it makes it sometimes difficult to know which version you got. Can you try this instead:wget ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz
db <- ensDbFromGtf("Homo_sapiens.GRCh38.90.gtf.gz") txs <- transcripts(db, return.type="DataFrame")Then construct a tx2gene table from 'txs' above.
Dear Michael,
I still don't understand why you think this is related to
tx2gene. Many cells from the same dataset (3/4 of the cells) and also across multiple other datasets having fewer cells don't show this problem at all (using the sametx2genefile). Also, and as I mentioned it C: tximport returns NA counts, when usingtximportwith only a subset of the cells showingNAvalues, this problem doe not emerge at all (i.e. the same genes that previously hadNAvalues now show actual counts). Also, when I checked many of these genes individually by tracing back their transcript ID (which was unique in some cases), and then searched for this transcript in the correspondingabundance.tsvfiles, I actually found count values and notNA. The transcript ID matched perfectly the ID in theabundance.tsvfiles! In my view, everything I tried points to the fact that it is not a problem of concordance between the files, but an issue with the internal workings oftximport, that might not support very large datasets. Doestximporttry to correct the counts of cells (or samples more generally) having a high variance in gene expression? Also, do you think that this issue might be fixed any time soon?Thank you for your help!
Best,
Leon
I’d be happy to fix the issue but we need to identify the source of the NA. tximport is simply summing the counts from transcripts associated with a gene in tx2gene. Can you please email me three objects saved in an RData file: an example of some columns that contain NA values, an example of the output for the same samples when txOut=TRUE, and tx2gene? I’ll report back here if I find anything.
In version 1.7.3, I hard code the
col_typesto avoid this error:https://github.com/mikelove/tximport/commit/f80fcaac7411ae590688c237088072a313772668
Thanks for your bug report