support for reading direct RNA-Seq FASTQs
1
0
Entering edit mode
@kentriemondy-14219
Last seen 2 days ago
Denver, University of Colorado Anschutz…

I'm working with Nanopore direct RNA-seq data which generates FASTQs with U bases. Is there a Bioconductor package that supports reading these files? I've tried using methods from ShortRead, but get errors due to the U bases. Thanks in advance.

suppressPackageStartupMessages(library(ShortRead))

fq <- tempfile()
u_fq_txt <- paste(c("@readid", 
                     "UCGA",
                     "+",
                     "]]]]"), 
                   collapse = "\n")
writeLines(u_fq_txt, fq)

strm <- FastqStreamer(fq)
yield(strm)
#> Error in x$yield(...): _DNAencode(): invalid DNAString input character: 'U' (byte value 85)

readFastq(fq)
#> Error: Input/Output
#>   file(s):
#>     /var/folders/r9/g3c47jrj40gc14d8qsqx7src0000gn/T//RtmpwZ8EOi/file4dee4320c214
#>   message: invalid character '


t_fq_txt <- paste(c("@readid", 
                    "TCGA",
                    "+",
                    "]]]]"), 
                  collapse = "\n")
writeLines(t_fq_txt, fq)

strm <- FastqStreamer(fq)
yield(strm)
#> class: ShortReadQ
#> length: 1 reads; width: 4 cycles

readFastq(fq)
#> class: ShortReadQ
#> length: 1 reads; width: 4 cycles

unlink(fq)
ShortRead nanopore • 838 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 5 minutes ago
United States

What's the use case? If you just need to read the sequences into R, you could use Biostrings, in particular readRNAStringSet

> fq <- tempfile()
> u_fq_txt <- paste(c("@readid", 
                     "UCGA",
                     "+",
                     "]]]]"), 
                   collapse = "\n")
> writeLines(u_fq_txt, fq)
> readRNAStringSet(fq, "fastq")
RNAStringSet object of length 1:
    width seq                                               names               
[1]     4 UCGA                                              readid
ADD COMMENT
0
Entering edit mode

Or alternatively

> z <- readRNAStringSet(fq, "fastq", with.qualities = TRUE)

> zz <- QualityScaledRNAStringSet(z, PhredQuality(mcols(z)$qualities))

> zz
  A QualityScaledRNAStringSet instance containing:

RNAStringSet object of length 1:
    width seq                                               names               
[1]     4 UCGA                                              readid

PhredQuality object of length 1:
    width seq
[1]     4 ]]]]
ADD REPLY
0
Entering edit mode

The original use case was to concatenate multiple fastqs into a single fastq, while also converting the U's to T's for compatibility with downstream tools. The streaming functionality of FastqStreamer seemed like a good approach to keep the memory usage low while converting each fastq. I could use unix tools ( e.g. cat and awk), but was curious to see how to do it using bioconductor tools.

Your response helped point me in the right direction. I used a BStringSet initially to allow for Us to be converted to Ts, and subsequently coerced to a DNAStringSet. I could then generate a ShortReadQ object and write records to disk with writeFastq. Probably not the most efficient but worked for my initial use case.

Here's my approach, that allows for limiting the # of lines read at a time, in case it is useful to anyone.

suppressPackageStartupMessages(library(ShortRead))

fq <- tempfile()
outfq <- tempfile()
u_fq_txt <- paste(c("@readid", 
                    "UCGA",
                    "+",
                    "]]]]"), 
                  collapse = "\n")
writeLines(u_fq_txt, fq)

convert_rna_fastq <- function(fn, outfn, n_per_chunk = 100000){
  if(file.exists(outfn)){
    unlink(outfn)
  }
  stopifnot(n_per_chunk %% 4 == 0)

  con <- file(description=fn, open="r")   
  data <- readLines(con, n = n_per_chunk)

  repeat {
    if (length(data) == 0)
      break

    if(length(data) %% 4 != 0){
      stop("something wrong with fastq file?")
    }

    n_lines <- length(data)
    id_lines <- seq(1, n_lines, by = 4)
    seq_lines <- seq(2, n_lines, by = 4)
    qual_lines <- seq(4, n_lines, by = 4)

    ids <- BStringSet(data[id_lines], start = 2)

    sread <- BStringSet(data[seq_lines])
    sread <- chartr("U", "T", sread)
    sread <- as(sread, "DNAStringSet")

    qual <- FastqQuality(data[qual_lines])

    fastq <- new("ShortReadQ", id = ids, sread = sread, quality = qual)

    writeFastq(fastq, outfn, mode='a')

    # last chunk was final chunk
    if (length(data) != n_per_chunk)  
      break
    # read next chunk
    data <- tryCatch(readLines(con, n = n_per_chunk), error=function(e) e)
  }

  close(con)
}

convert_rna_fastq(fq, outfq)

fq_records <- readFastq(outfq)
sread(fq_records)
#> DNAStringSet object of length 1:
#>     width seq
#> [1]     4 TCGA

Created on 2022-04-20 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)

ADD REPLY

Login before adding your answer.

Traffic: 846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6