Question

Problem with tximport(reader = read_tsv) and workaround

3

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 6 months ago

United States

Hi all,

I just wanted to point out a slight problem I ran into when trying to use tximport to read in data from Salmon, particularly using the faster read_tsv() function from the readr package. I happen to have a sample that only has integer values in NumReads column of the quant.sf output from Salmon in the first 1181 rows before the first decimal value, and when I tried reading in the data using tximport( , reader = read_tsv) I got a warning about parsing failures and "no trailing characters", and the resulting counts had NA values for this sample for all transcripts with non-integer values. The other samples that read in fine had the first non-integer values within their first 200 rows and had their data read in just fine with read_tsv. I tried using the default reader of read.delim, and that was able to read in the problem sample, although it was 3-4 times slower (but still less than 2 seconds, so not an issue). The help for read_tsv says it guesses the data type using only the first 1000 rows, and you can manually specify using the col_types argument. Perhaps you could add in the ability in tximport to pass arguments to the reader used? The default reader works fine, so it's not really needed. I mainly just wanted to document this problem on the support site.

Cheers,

Jenny

# non-reproducible example...

> setwd("D:/Statistics/Pandey/Aug2016/SalmonTest/")
>
> options(stringsAsFactors=F)
> library(tximport)
> library(readr)
>
>
> targets <- data.frame(sample = dir(path = "test1b"), ID = paste0("s", 10:13))
> targets
                         sample  ID
1 10VTA_TCCGGAGA-GCCTCTAT_L00M_ s10
2 11VTA_TCCGGAGA-AGGATAGG_L00M_ s11
3 12VTA_TCCGGAGA-TCAGAGCC_L00M_ s12
4 13VTA_TCCGGAGA-CTTCGCCT_L00M_ s13
>
> files.1b <- file.path("test1b", targets$sample, "quant.sf")
> names(files.1b) <- paste0(targets$ID, ".1b")
> files.1b
                                         s10.1b
"test1b/10VTA_TCCGGAGA-GCCTCTAT_L00M_/quant.sf"
                                         s11.1b
"test1b/11VTA_TCCGGAGA-AGGATAGG_L00M_/quant.sf"
                                         s12.1b
"test1b/12VTA_TCCGGAGA-TCAGAGCC_L00M_/quant.sf"
                                         s13.1b
"test1b/13VTA_TCCGGAGA-CTTCGCCT_L00M_/quant.sf"
>
> #Read in using read_tsv and note the time
> system.time(
+     tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = read_tsv)
+ )
reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_integer()
)
Warning: 37398 parsing failures.
row      col               expected actual
1122 NumReads no trailing characters .56607
1123 NumReads no trailing characters .43393
1125 NumReads no trailing characters .325
1132 NumReads no trailing characters .989
1133 NumReads no trailing characters .385
.... ........ ...................... ......
See problems(...) for more details.

2 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
3 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
4 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)

   user  system elapsed
   0.42    0.05    0.47
>
> head(tx.readr$counts)
               s10.1b  s11.1b s12.1b s13.1b
NM_001000000.1      1 0.00000      0      0
NM_001000001.1      0 3.60026     11     31
NM_001000002.1      0 0.00000      0      0
NM_001000003.1      0 0.00000      0      0
NM_001000004.1      0 1.00000      0      0
NM_001000005.1      0 1.00000      0      0
> tx.readr$counts[1121:1130,]
               s10.1b    s11.1b    s12.1b     s13.1b
NM_001001426.1      1    1.0000    1.0000    1.00000
NM_001001427.1     NA   11.9058   25.6043    8.26917
NM_001001428.1     NA   29.6531   21.5647    3.59620
NM_001001429.1      0    2.0000    3.0000    0.00000
NM_001001504.2     NA  828.8000  732.6420 1636.88000
NM_001001505.1      0    0.0000    0.0000    0.00000
NM_001001506.1      0    0.0000    0.0000    0.00000
NM_001001507.1      0    1.0000    0.0000    6.00000
NM_001001508.2   1783 2780.0000 2930.0000 1800.00000
NM_001001509.1      2    5.0000    5.0000    5.00000
>
> apply(tx.readr$counts, 2, function(x) sum(is.na(x)))
s10.1b s11.1b s12.1b s13.1b
37398      0      0      0
>
>
> #Read in using default read.delim
> system.time(
+     tx.delim <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = read.delim)
+ )
reading in files
1 2 3 4
   user  system elapsed
   1.36    0.01    1.38
> tx.delim$counts[1121:1130,]
                   s10.1b    s11.1b    s12.1b     s13.1b
NM_001001426.1    1.00000    1.0000    1.0000    1.00000
NM_001001427.1    9.56607   11.9058   25.6043    8.26917
NM_001001428.1    7.43393   29.6531   21.5647    3.59620
NM_001001429.1    0.00000    2.0000    3.0000    0.00000
NM_001001504.2  468.32500  828.8000  732.6420 1636.88000
NM_001001505.1    0.00000    0.0000    0.0000    0.00000
NM_001001506.1    0.00000    0.0000    0.0000    0.00000
NM_001001507.1    0.00000    1.0000    0.0000    6.00000
NM_001001508.2 1783.00000 2780.0000 2930.0000 1800.00000
NM_001001509.1    2.00000    5.0000    5.0000    5.00000
> apply(tx.delim$counts, 2, function(x) sum(is.na(x)))
s10.1b s11.1b s12.1b s13.1b
     0      0      0      0
>
>
> #Still have the problem if it's not the first sample read in?
>
> system.time(
+     tx.readr2 <- tximport(files.1b[c(2:4,1)], type = "salmon", txOut = TRUE, reader = read_tsv)
+ )
reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
2 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
3 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
4 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_integer()
)
Warning: 37398 parsing failures.
row      col               expected actual
1122 NumReads no trailing characters .56607
1123 NumReads no trailing characters .43393
1125 NumReads no trailing characters .325
1132 NumReads no trailing characters .989
1133 NumReads no trailing characters .385
.... ........ ...................... ......
See problems(...) for more details.

   user  system elapsed
   0.30    0.02    0.31
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                        
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.0.0    tximport_1.1.4

loaded via a namespace (and not attached):
[1] assertthat_0.1 tools_3.3.1    tibble_1.2     Rcpp_0.12.7

tximport readr read_tsv • 3.6k views

ADD COMMENT • link updated 8.8 years ago by Michael Love 43k • written 8.8 years ago by Jenny Drnevich ★ 2.0k

score 2 · Answer 1 · 2016-09-27

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 18 days ago

United States

Thanks Jenny for the report. Useful for others would come across the same problem.

Note that the easiest workaround is to skip readr::read_tsv:

txi <- tximport(files, type="salmon", ..., importer=read.delim)

Alternative you can write your own function based on read_tsv, first look up how many rows there are:

> nrow(read.delim(files[1]))
[1] 36736

Then:

my_read_tsv <- function(x) readr::read_tsv(x, guess_max=36736)
txi <- tximport(files, type="salmon", ..., importer=my_read_tsv)

ADD COMMENT • link 8.8 years ago • updated 8.2 years ago Michael Love 43k

0

Entering edit mode

Thanks, Mike - I didn't think of that! Then all I'd have to do is tximport( , reader = my_read_tsv) . I knew I'd learn something by posting this.

ADD REPLY • link 8.8 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

And I spoke to soon. I tried it, but it doesn't work:

> my_read_tsv <- function(x) {
+     read_tsv(x, guess_max = 1200)
+ }

> tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = my_read_tsv)

Error in reader(x, comment = "#") : unused argument (comment = "#")

Looks like internally you are re-defining the reader function with specific options for kallisto/salmon/sailfish/rsem:

if (type %in% c("salmon", "sailfish")) {
        #other lines...
        importer <- function(x) reader(x, comment = "#")
    }

I did figure out how to pass on your additional arguments by adding , ... to my function:

> my_read_tsv <- function(x, ...) {
+     read_tsv(x, guess_max = 1200, ...)
+ }

> tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = my_read_tsv)

reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
2 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
3 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
4 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)

ADD REPLY • link 8.8 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

In version 1.7.3, I hard code the col_types to avoid this error:

https://github.com/mikelove/tximport/commit/f80fcaac7411ae590688c237088072a313772668