Search
Question: tximport returns NA counts
0
6 months ago by
leonfodoulian0 wrote:

Hi,

I have an issue using tximport with kallisto output. I never had this issue before, and I can't figure out why it is happening. When applying it on a large dataset (over 3000 transcriptome mapped cells from single-cell RNAseq data), some genes from a subset of the cells (1/4 of the cells) have NA values. When I checked the counts of these genes in the corresponding abundance.tsv files, some of them seem to be expressed at high levels (around 1000 counts). Many of these genes also have only 1 transcript.

The code I'm running is the following:

txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv)

I also tried importing with rhdf5, and it didn't change in anything.

My sessionInfo() is the following:

sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-apple-darwin15.6.0 (64-bit)

Running under: macOS Sierra 10.12.6

Matrix products: default

BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib

LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:

[1] C

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

loaded via a namespace (and not attached):

[1] zlibbioc_1.22.0 compiler_3.4.1  plyr_1.8.4      R6_2.2.2        hms_0.3         tools_3.4.1

[7] rhdf5_2.20.0    reshape2_1.4.2  tibble_1.3.4    Rcpp_0.12.13    stringi_1.1.5   stringr_1.2.0

[13] rlang_0.1.4  

Any help to fix this issue is highly appreciated!

Best,

Leon

modified 6 months ago • written 6 months ago by leonfodoulian0
1

“I also tried importing with rhdf5, and it didn't change in anything.”

When you tried this, did you specify ‘importer’? If so, can you try not specifying importer, but using our built-in methods for importing kallisto h5 files?

I first didn't specify importer. I had previously used an earlier release of tximport (which had the reader argument). However, this time with the latest release, the reader argument was giving me an error. Therefore, I tried to execute tximport with the default arguments, and a message popped telling me that I have to import rhdf5 so that tximport functions in its default settings. So I did that and this is when I realised that I had NA values. Suspecting that probably something wrong happened with the h5 files, I tried using read_tsv as an alternative approach and got the exact same output.

Ok so it's probably not to do with importer, but if you want to import much faster you can install rhdf5:

biocLite("rhdf5")

And then try without specifying 'importer'. You will still get NA's with this approach, and I'll comment again with more to diagnose this.

Indeed, and as I mentioned earlier, when I tried using rhdf5 without specifying importer, I always got NA values. However, regarding the speed, I didn't feel that it was loading faster. But this was just my feeling because I didn't test for speed explicitly.

1

I wonder if this could have to do with the automatic determination of the column types by read_tsv. By default I think it uses the first 1,000 values to determine the column type, and if these values are all 0 it may try to read the whole column as integers. This caused problems for me once (albeit not exactly the same problem as yours). Could you try to specify another importer function, e.g.



col_types = list(
target_id = col_character(),
length = col_integer(),
eff_length = col_double(),
est_counts = col_double(),
tpm = col_double()
))



Thanks Charlotte, if this is the problem, using rhdf5 should avoid the NA. This is the importer that is used if the rhdf5 package is detected.

Another curious thing is that Leon reports no NA when txOut=TRUE, which wouldn’t happen i think if the importer was the issue.

True, but that was only for 50 cells, for which txOut=FALSE also worked, so I guess it is not yet clear whether there is a set of cells for which txOut=TRUE works but txOut=FALSE doesn't.

I'm wondering, if it's the importer's fault, why would more or less files make a difference?

Dear Michael and Charlotte,

Thank you for your much appreciated help. The solution Charlotte provided solved my issue. Also, regarding importing h5 files, I did a mistake when trying to load these files, since I have specified abundance.tsv files to tximport instead of h5 files. Directly loading h5 files using the default settings of tximport does not give NA values. My complete script for this is the following:

# Load magritrr
if(!require(magrittr)) {
install.packages("magrittr")
library(magrittr)
}

if(!require(tximport)) {
source("https://bioconductor.org/biocLite.R")
biocLite("tximport")
library(tximport)
}

if(!require(rhdf5)) {
source("https://bioconductor.org/biocLite.R")
biocLite("rhdf5")
library(rhdf5)
}

}

# Set working directory
setwd("~")

# Load transcript to genename table

# List directory where kallisto files are located
dir <- list.dirs(path = ".", recursive = F, full.names = F) %>% grep("kallisto_output", ., value = T)

# Set working directory to kallisto folder
setwd(dir)

# List directories in kallisto folder
accessions <- list.dirs(path = ".", recursive = F, full.names = F)

#### OLD VERSION ####

# List abundance files from accession directories
abundance.files <- file.path(".", accessions, "abundance.tsv")

# Name files by sample names
names(abundance.files) <- sub(pattern = "./", replacement = "", x = abundance.files) %>% sub(pattern = "-kallisto_output/abundance.tsv", replacement = "", x = .)

# Wrap kallisto counts in a tximport object using default settings
txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene)

# Wrap kallisto counts in a tximport object using read_tsv
txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv)

#### NEW VERSION with rhdf5 ####

# List abundance files from accession directories
abundance.files <- file.path(".", accessions, "abundance.h5")

# Name files by sample names
names(abundance.files) <- sub(pattern = "./", replacement = "", x = abundance.files) %>% sub(pattern = "-kallisto_output/abundance.h5", replacement = "", x = .)

# Wrap kallisto counts in a tximport object using default settings
txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene)

#### NEW VERSION with read_tsv2 ####

# https://support.bioconductor.org/p/103126/#103146

col_types = list(
target_id = col_character(),
length = col_integer(),
eff_length = col_double(),
est_counts = col_double(),
tpm = col_double()
))

# List abundance files from accession directories
abundance.files <- file.path(".", accessions, "abundance.tsv")

# Name files by sample names
names(abundance.files) <- sub(pattern = "./", replacement = "", x = abundance.files) %>% sub(pattern = "-kallisto_output/abundance.tsv", replacement = "", x = .)

# Wrap kallisto counts in a tximport object using the new read_tsv2 function
txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv2)

Please let me know if I need to provide further information so that others that might face a similar issue be able to troubleshoot it quite straightforwardly.

Best,

Leon

Thank Leon for the useful feedback here.

We can either have a specific read_tsv importer for each method 'type' that explicitly sets the col_types, or I could set guess_max=Inf so that read_tsv uses the full file for setting col_types. The first approach should be faster, so I'll probably go with this.

At this point I can debug on my own now that we have the source of the NAs, and will put a fix into the devel branch.

The transcripts FASTA file was directly downloaded from Ensembl. The file can be found here. The file corresponds to Homo_sapiens.GRCh38.cdna.all.fa.gz. No R code was used to generate the tx2gene file. What we did is to retrieve from the FASTA file (which is exactly the same file that was used to generate the kallisto index) all transcript IDs. These IDs were then used to retrieve gene IDs and gene names from Biomart. What we did is to retrieve from the FASTA file (which is exactly the same file that was used to generate the kallisto index) all transcript and gene IDs, as well as gene names. A custom made python script was used to retrieve all these information. The tx2gene file that was provided to the code includes only transcript IDs and gene names. The issue is not at all related to the tx2gene file, because all transcripts have a corresponding gene name. Moreover, the same genes that show NA values in some cells have actual count values in other cells.

# Load magritrr
if(!require(magrittr)) {
install.packages("magrittr")
library(magrittr)
}

# Load transcript to genename table and remove gene ID column

# Check tx2gene table

TXNAME   GENENAME
1 ENST00000434970.2      TRDD2
2 ENST00000448914.1      TRDD3
3 ENST00000415118.1      TRDD1
4 ENST00000631435.1 AC239618.6
5 ENST00000632684.1 AC245427.8
6 ENST00000454908.1    IGHD1-1

# Check NAs in gene names

tx2gene\$GENENAME %>% is.na() %>% any()
[1] FALSE

Best,

Leon

Any NA if you set txOut=TRUE?

When I set txOut = TRUE, I get no NA values. However, I didn't run this code on all my cells, since this is time consuming. I first identified all cells having NA values, then selected 50 of them and ran the code. Now if I set txOut = FALSE by using only this same subset of cells (i.e. the 50 selected cells), I still don't get NA values. Any hints on what could have happened when taking all 3000 cells together?

Best,

Leon

My guess is that the tx2gene doesn't line up with the transcripts in the FASTA file, and hence the quantifications. Biomart is convenient but using it makes it sometimes difficult to know which version you got. Can you try this instead:

db <- ensDbFromGtf("Homo_sapiens.GRCh38.90.gtf.gz")
txs <- transcripts(db, return.type="DataFrame")

Then construct a tx2gene table from 'txs' above.

Dear Michael,

I still don't understand why you think this is related to tx2gene. Many cells from the same dataset (3/4 of the cells) and also across multiple other datasets having fewer cells don't show this problem at all (using the same tx2gene file). Also, and as I mentioned it C: tximport returns NA counts, when using tximport with only a subset of the cells showing NA values, this problem doe not emerge at all (i.e. the same genes that previously had NA values now show actual counts). Also, when I checked many of these genes individually by tracing back their transcript ID (which was unique in some cases), and then searched for this transcript in the corresponding abundance.tsv files, I actually found count values and not NA. The transcript ID matched perfectly the ID in the abundance.tsv files! In my view, everything I tried points to the fact that it is not a problem of concordance between the files, but an issue with the internal workings of tximport, that might not support very large datasets. Does tximport try to correct the counts of cells (or samples more generally) having a high variance in gene expression? Also, do you think that this issue might be fixed any time soon?

Best,

Leon

I’d be happy to fix the issue but we need to identify the source of the NA. tximport is simply summing the counts from transcripts associated with a gene in tx2gene. Can you please email me three objects saved in an RData file: an example of some columns that contain NA values, an example of the output for the same samples when txOut=TRUE, and tx2gene? I’ll report back here if I find anything.

In version 1.7.3, I hard code the col_types to avoid this error:

https://github.com/mikelove/tximport/commit/f80fcaac7411ae590688c237088072a313772668

0
6 months ago by
Michael Love17k
United States
Michael Love17k wrote:

Can you give more information about where the transcripts FASTA file came from and provide the data source and all of the R code you used to make tx2gene?