Hi all,
I just wanted to point out a slight problem I ran into when trying to use tximport to read in data from Salmon, particularly using the faster read_tsv() function from the readr package. I happen to have a sample that only has integer values in NumReads column of the quant.sf output from Salmon in the first 1181 rows before the first decimal value, and when I tried reading in the data using tximport( , reader = read_tsv)
I got a warning about parsing failures and "no trailing characters", and the resulting counts had NA values for this sample for all transcripts with non-integer values. The other samples that read in fine had the first non-integer values within their first 200 rows and had their data read in just fine with read_tsv. I tried using the default reader of read.delim, and that was able to read in the problem sample, although it was 3-4 times slower (but still less than 2 seconds, so not an issue). The help for read_tsv says it guesses the data type using only the first 1000 rows, and you can manually specify using the col_types argument. Perhaps you could add in the ability in tximport to pass arguments to the reader used? The default reader works fine, so it's not really needed. I mainly just wanted to document this problem on the support site.
Cheers,
Jenny
# non-reproducible example... > setwd("D:/Statistics/Pandey/Aug2016/SalmonTest/") > > options(stringsAsFactors=F) > library(tximport) > library(readr) > > > targets <- data.frame(sample = dir(path = "test1b"), ID = paste0("s", 10:13)) > targets sample ID 1 10VTA_TCCGGAGA-GCCTCTAT_L00M_ s10 2 11VTA_TCCGGAGA-AGGATAGG_L00M_ s11 3 12VTA_TCCGGAGA-TCAGAGCC_L00M_ s12 4 13VTA_TCCGGAGA-CTTCGCCT_L00M_ s13 > > files.1b <- file.path("test1b", targets$sample, "quant.sf") > names(files.1b) <- paste0(targets$ID, ".1b") > files.1b s10.1b "test1b/10VTA_TCCGGAGA-GCCTCTAT_L00M_/quant.sf" s11.1b "test1b/11VTA_TCCGGAGA-AGGATAGG_L00M_/quant.sf" s12.1b "test1b/12VTA_TCCGGAGA-TCAGAGCC_L00M_/quant.sf" s13.1b "test1b/13VTA_TCCGGAGA-CTTCGCCT_L00M_/quant.sf" > > #Read in using read_tsv and note the time > system.time( + tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = read_tsv) + ) reading in files 1 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_integer() ) Warning: 37398 parsing failures. row col expected actual 1122 NumReads no trailing characters .56607 1123 NumReads no trailing characters .43393 1125 NumReads no trailing characters .325 1132 NumReads no trailing characters .989 1133 NumReads no trailing characters .385 .... ........ ...................... ...... See problems(...) for more details. 2 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) 3 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) 4 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) user system elapsed 0.42 0.05 0.47 > > head(tx.readr$counts) s10.1b s11.1b s12.1b s13.1b NM_001000000.1 1 0.00000 0 0 NM_001000001.1 0 3.60026 11 31 NM_001000002.1 0 0.00000 0 0 NM_001000003.1 0 0.00000 0 0 NM_001000004.1 0 1.00000 0 0 NM_001000005.1 0 1.00000 0 0 > tx.readr$counts[1121:1130,] s10.1b s11.1b s12.1b s13.1b NM_001001426.1 1 1.0000 1.0000 1.00000 NM_001001427.1 NA 11.9058 25.6043 8.26917 NM_001001428.1 NA 29.6531 21.5647 3.59620 NM_001001429.1 0 2.0000 3.0000 0.00000 NM_001001504.2 NA 828.8000 732.6420 1636.88000 NM_001001505.1 0 0.0000 0.0000 0.00000 NM_001001506.1 0 0.0000 0.0000 0.00000 NM_001001507.1 0 1.0000 0.0000 6.00000 NM_001001508.2 1783 2780.0000 2930.0000 1800.00000 NM_001001509.1 2 5.0000 5.0000 5.00000 > > apply(tx.readr$counts, 2, function(x) sum(is.na(x))) s10.1b s11.1b s12.1b s13.1b 37398 0 0 0 > > > #Read in using default read.delim > system.time( + tx.delim <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = read.delim) + ) reading in files 1 2 3 4 user system elapsed 1.36 0.01 1.38 > tx.delim$counts[1121:1130,] s10.1b s11.1b s12.1b s13.1b NM_001001426.1 1.00000 1.0000 1.0000 1.00000 NM_001001427.1 9.56607 11.9058 25.6043 8.26917 NM_001001428.1 7.43393 29.6531 21.5647 3.59620 NM_001001429.1 0.00000 2.0000 3.0000 0.00000 NM_001001504.2 468.32500 828.8000 732.6420 1636.88000 NM_001001505.1 0.00000 0.0000 0.0000 0.00000 NM_001001506.1 0.00000 0.0000 0.0000 0.00000 NM_001001507.1 0.00000 1.0000 0.0000 6.00000 NM_001001508.2 1783.00000 2780.0000 2930.0000 1800.00000 NM_001001509.1 2.00000 5.0000 5.0000 5.00000 > apply(tx.delim$counts, 2, function(x) sum(is.na(x))) s10.1b s11.1b s12.1b s13.1b 0 0 0 0 > > > #Still have the problem if it's not the first sample read in? > > system.time( + tx.readr2 <- tximport(files.1b[c(2:4,1)], type = "salmon", txOut = TRUE, reader = read_tsv) + ) reading in files 1 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) 2 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) 3 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) 4 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_integer() ) Warning: 37398 parsing failures. row col expected actual 1122 NumReads no trailing characters .56607 1123 NumReads no trailing characters .43393 1125 NumReads no trailing characters .325 1132 NumReads no trailing characters .989 1133 NumReads no trailing characters .385 .... ........ ...................... ...... See problems(...) for more details. user system elapsed 0.30 0.02 0.31 > sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.0.0 tximport_1.1.4 loaded via a namespace (and not attached): [1] assertthat_0.1 tools_3.3.1 tibble_1.2 Rcpp_0.12.7
Thanks, Mike - I didn't think of that! Then all I'd have to do is
tximport( , reader = my_read_tsv)
. I knew I'd learn something by posting this.And I spoke to soon. I tried it, but it doesn't work:
Looks like internally you are re-defining the reader function with specific options for kallisto/salmon/sailfish/rsem:
I did figure out how to pass on your additional arguments by adding
, ...
to my function:In version 1.7.3, I hard code the
col_types
to avoid this error:https://github.com/mikelove/tximport/commit/f80fcaac7411ae590688c237088072a313772668
Thanks - I'll try it out soon!