Question: Problem with tximport(reader = read_tsv) and workaround
2
gravatar for Jenny Drnevich
2.8 years ago by
Jenny Drnevich1.9k
United States
Jenny Drnevich1.9k wrote:

Hi all,

I just wanted to point out a slight problem I ran into when trying to use tximport to read in data from Salmon, particularly using the faster read_tsv() function from the readr package. I happen to have a sample that only has integer values in NumReads column of the quant.sf output from Salmon in the first 1181 rows before the first decimal value, and when I tried reading in the data using tximport(  , reader = read_tsv) I got a warning about parsing failures and "no trailing characters", and the resulting counts had NA values for this sample for all transcripts with non-integer values. The other samples that read in fine had the first non-integer values within their first 200 rows and had their data read in just fine with read_tsv. I tried using the default reader of read.delim, and that was able to read in the problem sample, although it was 3-4 times slower (but still less than 2 seconds, so not an issue). The help for read_tsv says it guesses the data type using only the first 1000 rows, and you can manually specify using the col_types argument. Perhaps you could add in the ability in tximport to pass arguments to the reader used? The default reader works fine, so it's not really needed. I mainly just wanted to document this problem on the support site.

Cheers, 

Jenny

# non-reproducible example...

> setwd("D:/Statistics/Pandey/Aug2016/SalmonTest/")
>
> options(stringsAsFactors=F)
> library(tximport)
> library(readr)
>
>
> targets <- data.frame(sample = dir(path = "test1b"), ID = paste0("s", 10:13))
> targets
                         sample  ID
1 10VTA_TCCGGAGA-GCCTCTAT_L00M_ s10
2 11VTA_TCCGGAGA-AGGATAGG_L00M_ s11
3 12VTA_TCCGGAGA-TCAGAGCC_L00M_ s12
4 13VTA_TCCGGAGA-CTTCGCCT_L00M_ s13
>
> files.1b <- file.path("test1b", targets$sample, "quant.sf")
> names(files.1b) <- paste0(targets$ID, ".1b")
> files.1b
                                         s10.1b
"test1b/10VTA_TCCGGAGA-GCCTCTAT_L00M_/quant.sf"
                                         s11.1b
"test1b/11VTA_TCCGGAGA-AGGATAGG_L00M_/quant.sf"
                                         s12.1b
"test1b/12VTA_TCCGGAGA-TCAGAGCC_L00M_/quant.sf"
                                         s13.1b
"test1b/13VTA_TCCGGAGA-CTTCGCCT_L00M_/quant.sf"
>
> #Read in using read_tsv and note the time
> system.time(
+     tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = read_tsv)
+ )
reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_integer()
)
Warning: 37398 parsing failures.
row      col               expected actual
1122 NumReads no trailing characters .56607
1123 NumReads no trailing characters .43393
1125 NumReads no trailing characters .325
1132 NumReads no trailing characters .989
1133 NumReads no trailing characters .385
.... ........ ...................... ......
See problems(...) for more details.

2 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
3 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
4 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)

   user  system elapsed
   0.42    0.05    0.47
>
> head(tx.readr$counts)
               s10.1b  s11.1b s12.1b s13.1b
NM_001000000.1      1 0.00000      0      0
NM_001000001.1      0 3.60026     11     31
NM_001000002.1      0 0.00000      0      0
NM_001000003.1      0 0.00000      0      0
NM_001000004.1      0 1.00000      0      0
NM_001000005.1      0 1.00000      0      0
> tx.readr$counts[1121:1130,]
               s10.1b    s11.1b    s12.1b     s13.1b
NM_001001426.1      1    1.0000    1.0000    1.00000
NM_001001427.1     NA   11.9058   25.6043    8.26917
NM_001001428.1     NA   29.6531   21.5647    3.59620
NM_001001429.1      0    2.0000    3.0000    0.00000
NM_001001504.2     NA  828.8000  732.6420 1636.88000
NM_001001505.1      0    0.0000    0.0000    0.00000
NM_001001506.1      0    0.0000    0.0000    0.00000
NM_001001507.1      0    1.0000    0.0000    6.00000
NM_001001508.2   1783 2780.0000 2930.0000 1800.00000
NM_001001509.1      2    5.0000    5.0000    5.00000
>
> apply(tx.readr$counts, 2, function(x) sum(is.na(x)))
s10.1b s11.1b s12.1b s13.1b
37398      0      0      0
>
>
> #Read in using default read.delim
> system.time(
+     tx.delim <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = read.delim)
+ )
reading in files
1 2 3 4
   user  system elapsed
   1.36    0.01    1.38
​> tx.delim$counts[1121:1130,]
                   s10.1b    s11.1b    s12.1b     s13.1b
NM_001001426.1    1.00000    1.0000    1.0000    1.00000
NM_001001427.1    9.56607   11.9058   25.6043    8.26917
NM_001001428.1    7.43393   29.6531   21.5647    3.59620
NM_001001429.1    0.00000    2.0000    3.0000    0.00000
NM_001001504.2  468.32500  828.8000  732.6420 1636.88000
NM_001001505.1    0.00000    0.0000    0.0000    0.00000
NM_001001506.1    0.00000    0.0000    0.0000    0.00000
NM_001001507.1    0.00000    1.0000    0.0000    6.00000
NM_001001508.2 1783.00000 2780.0000 2930.0000 1800.00000
NM_001001509.1    2.00000    5.0000    5.0000    5.00000
> apply(tx.delim$counts, 2, function(x) sum(is.na(x)))
s10.1b s11.1b s12.1b s13.1b
     0      0      0      0
>
>
> #Still have the problem if it's not the first sample read in?
>
> system.time(
+     tx.readr2 <- tximport(files.1b[c(2:4,1)], type = "salmon", txOut = TRUE, reader = read_tsv)
+ )
reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
2 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
3 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
4 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_integer()
)
Warning: 37398 parsing failures.
row      col               expected actual
1122 NumReads no trailing characters .56607
1123 NumReads no trailing characters .43393
1125 NumReads no trailing characters .325
1132 NumReads no trailing characters .989
1133 NumReads no trailing characters .385
.... ........ ...................... ......
See problems(...) for more details.

   user  system elapsed
   0.30    0.02    0.31
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                        
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.0.0    tximport_1.1.4

loaded via a namespace (and not attached):
[1] assertthat_0.1 tools_3.3.1    tibble_1.2     Rcpp_0.12.7 
tximport readr read_tsv • 1.2k views
ADD COMMENTlink modified 2.8 years ago by Michael Love24k • written 2.8 years ago by Jenny Drnevich1.9k
Answer: Problem with tximport(reader = read_tsv) and workaround
2
gravatar for Michael Love
2.8 years ago by
Michael Love24k
United States
Michael Love24k wrote:

Thanks Jenny for the report. Useful for others would come across the same problem. 

Note that the easiest workaround is to skip readr::read_tsv:

txi <- tximport(files, type="salmon", ..., importer=read.delim)

Alternative you can write your own function based on read_tsv, first look up how many rows there are:

> nrow(read.delim(files[1]))
[1] 36736

Then:

my_read_tsv <- function(x) readr::read_tsv(x, guess_max=36736)
txi <- tximport(files, type="salmon", ..., importer=my_read_tsv)

 

ADD COMMENTlink modified 2.2 years ago • written 2.8 years ago by Michael Love24k

Thanks, Mike - I didn't think of that! Then all I'd have to do is tximport( , reader = my_read_tsv) . I knew I'd learn something by posting this.

ADD REPLYlink written 2.8 years ago by Jenny Drnevich1.9k

And I spoke to soon. I tried it, but it doesn't work:

> my_read_tsv <- function(x) {
+     read_tsv(x, guess_max = 1200)
+ }

> tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = my_read_tsv)

Error in reader(x, comment = "#") : unused argument (comment = "#")

Looks like internally you are re-defining the reader function with specific options for kallisto/salmon/sailfish/rsem:

if (type %in% c("salmon", "sailfish")) {
        #other lines...
        importer <- function(x) reader(x, comment = "#")
    }

I did figure out how to pass on your additional arguments by adding , ... to my function:

> my_read_tsv <- function(x, ...) {
+     read_tsv(x, guess_max = 1200, ...)
+ }

> tx.readr <- tximport(files.1b, type = "salmon", txOut = TRUE, reader = my_read_tsv)

reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
2 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
3 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
4 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Jenny Drnevich1.9k

In version 1.7.3, I hard code the col_types to avoid this error:

https://github.com/mikelove/tximport/commit/f80fcaac7411ae590688c237088072a313772668

ADD REPLYlink written 19 months ago by Michael Love24k

Thanks - I'll try it out soon!

ADD REPLYlink written 19 months ago by Jenny Drnevich1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 129 users visited in the last hour