problem code step

Question

translate errors in fastq file reading

0

Entering edit mode

Brett • 0

@41936eb9

Last seen 2.1 years ago

United States

We were running a script with another postdoc in 2021 that was working fine. Now we have new sequence and the fasta file isn't being read. When we open in CLC or sublime, I don't see a big formatting difference. The file is definitely there and when we replace with the old sequencing file in the same location, everything works. So we believe it is a file format issue. Please let me know what more I can show to help narrow down this issue. Is anyone have an idea of what might be throwing this error? Thanks for your patience! -Brett

problem code step

> writeXStringSet(translate(trimLRPatterns(Lpattern= Lpattern, subject = readDNAStringSet
+ ("\\Users\\VMIAdmin\\Desktop\\600_sequencing_fastq\\22-0391_S1_L001_R1_001.fastq",
+ format="fastq", nrec= -1L, skip=0L, seek.first.rec = TRUE, use.names = TRUE), max.Lmismatch = 10.0), if.fuzzy.codon = "solve", genetic.code = mygencodeTAGB),
+ "\\Users\\VMIAdmin\\Desktop\\dsDNAtest2",
+ append=FALSE, compress=FALSE, compression_level = NA, format = "fasta")`

error

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'translate': error in evaluating the argument 'subject' in selecting a method for function 'trimLRPatterns': reading FASTQ file \Users\VMIAdmin\Desktop\600_sequencing_fastq\22-0391_S1_L001_R1_001.fastq: no FASTQ record found

sessionInfo() R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets [7] methods base

other attached packages: [1] Biostrings_2.64.1 GenomeInfoDb_1.32.4 XVector_0.36.0
[4] IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0

loaded via a namespace (and not attached): [1] zlibbioc_1.42.0 compiler_4.2.1
[3] tools_4.2.1 GenomeInfoDbData_1.2.8 [5] RCurl_1.98-1.8 crayon_1.5.1
[7] bitops_1.0-7 ```

translate biostrings Biostrings • 1.6k views

ADD COMMENT • link 2.1 years ago Brett • 0

score 0 · Answer 1 · 2022-10-27

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen just now

United States

When asking a question, your best bet is to try to restrict what you are doing to the simplest possible code that still produces the error. And your code is not only super messy, but needlessly complex. I mean you have nested four function calls in one line! And your error is then also nested, but it appears that the main issue is at the end.

reading FASTQ file \Users\VMIAdmin\Desktop\600_sequencing_fastq\22-0391_S1_L001_R1_001.fastq: no FASTQ record found

Which seems to be a pretty clear error message?

ADD COMMENT • link 2.1 years ago James W. MacDonald 67k

0

Entering edit mode

I get that the error is clear but not the solution. Again, it works on sequences generated last year but not this year. Did something change in illumina fastq file generation in that time?

ADD REPLY • link 2.1 years ago Brett • 0

0

Entering edit mode

Not that I know of. But the error says there's not FASTQ record. Did you look at the FASTQ file? Are there FASTQ records? Are some missing? Is it actually FASTA?

ADD REPLY • link 2.1 years ago James W. MacDonald 67k

0

Entering edit mode

The new sequencing files are in the same format as the 2021 files when opened in sublime.

ADD REPLY • link 2.1 years ago Brett • 0

0

Entering edit mode

how does biostring determine whether the file is a legit FASTQ file?

ADD REPLY • link 2.1 years ago Brett • 0

0

Entering edit mode

You are setting seek.first.rec = TRUE, which means you are doing

seek.first.rec: 'TRUE' or 'FALSE' (the default). If 'TRUE', then the
          reading function starts by setting the file position
          indicator at the beginning of the first line in the file that
          looks like the beginning of a FASTA (if 'format' is
          '"fasta"') or FASTQ (if 'format' is '"fastq"') record. More
          precisely this is the first line in the file that starts with
          a '>' (for FASTA) or a '@' (for FASTQ). An error is raised if
          no such line is found.

          Normal parsing then starts from there, and everything happens
          like if the file actually started there. In particular it
          will be an error if this first record is not a valid FASTA or
          FASTQ record.

          Using 'seek.first.rec=TRUE' is useful for example to parse
          GFF3 files with embedded FASTA data.

Which isn't the default. And you can see why you get an error, as apparently your FASTQ file is missing the @.

ADD REPLY • link 2.1 years ago James W. MacDonald 67k

0

Entering edit mode

first line of 2021 file:

@M02623:14:000000000-DBT7J:1:1101:16805:1416 1:N:0:TAAGGCGT TCTCACTCCTTTTTTCCTTTTCCTCTTCTTTCTTCACCTTTTTTTTTTTTCTTTTTTTTTTTTTTGTCTCTTTCTTTTTTCCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGTTTCCTTTTTCTTTT + 1>>A1@DDB@DFFGCF1FGFF3BAADE1D13ADE22A11BFA1A//EA/>011111//>///>///B22222122111B<01>2212?1>?A?---<<:-:9;-9-9-9---;99@;==--;---;--9-----/--/-///9/://////

First line of 2022 file:

@M02623:21:000000000-KHNYN:1:1101:11377:1806 1:N:0:TATGGCGA TTTGTCCTGTGCAGCTTCTGGCTTCAACATTTCTTATCCCTCTCTGCGTCCGGCCCCCCGTCCCCCCCTTCACTGCCTTGCTTTTTTTTCTCCTTCTTCTCTTTCTTCTTTTTTTTCCTTTTTCGTCTTTCTCCGTTTCTCTCTTTTCGCTTTCACATCCAACAACACCTCCTACCTACATTTCTTCTGCTTAACTGCTTTGTTCTCTGCCGTCTTTTTTTTTATCTTTTCTTCTTTTTTTCTTATTCTTCTTTTTTTTCTTTTTCTTTTCTTTTTCTTCTGCTTTGTTCTACTGTGTTCATGGTACTCTTGTCACCGTCTCCTCGTTCTCGTGCGGTGGTTGCGGTTCCGTCGGTGCTTGCTCCCCCGTTTCCTGTTCCTCCGTTTTCCCTTTTCCCTTCTTCCTTTCTTTCTTTTTCCTTTTTTTTTTCTTTTTTTTTCCCTTTCCTTT + CCC@CFFE@F9E,@EFFFF9,CFFE,,,CFEEEF,<,,,;,,,;,6+8++++++7@+++8,+,,+677,,,,<,,,:9,,,<,5,8@@,B,9BA5?E<B,,<@5C,5B?,BA+8+,,,,<,,3+38,+,,,3,+:3,,:3,:,,,3+++++7,8,383,,,,4,6,565,762,6,,,,,,,5,,26>,,,,+4<,,,,,1,14,,,41232+29)2+++++02:0/:7.)19):9*2=):=49@)94(04,=6)982<9).-6:).4(6:)6))6)6))(.6)-.))))..))))))-)..))..)).(4,,(.61((43:40(((,(-(((((,((-,,(()((((((-)),)),((((((()))))-))))((,)((-))-.)))))))(.))))))))-))-))),())))),(,(((().)))(((,()))-))-)-)

ADD REPLY • link 2.1 years ago Brett • 0

0

Entering edit mode

so the solution was to open the .gz files in CLC and export as fastq. The decompression of illumina fastq.gz files did not yield a file usable in biostrings.

ADD REPLY • link 2.1 years ago Brett • 0