We were running a script with another postdoc in 2021 that was working fine. Now we have new sequence and the fasta file isn't being read. When we open in CLC or sublime, I don't see a big formatting difference. The file is definitely there and when we replace with the old sequencing file in the same location, everything works. So we believe it is a file format issue. Please let me know what more I can show to help narrow down this issue. Is anyone have an idea of what might be throwing this error? Thanks for your patience! -Brett
problem code step
> writeXStringSet(translate(trimLRPatterns(Lpattern= Lpattern, subject = readDNAStringSet
+ ("\\Users\\VMIAdmin\\Desktop\\600_sequencing_fastq\\22-0391_S1_L001_R1_001.fastq",
+ format="fastq", nrec= -1L, skip=0L, seek.first.rec = TRUE, use.names = TRUE), max.Lmismatch = 10.0), if.fuzzy.codon = "solve", genetic.code = mygencodeTAGB),
+ "\\Users\\VMIAdmin\\Desktop\\dsDNAtest2",
+ append=FALSE, compress=FALSE, compression_level = NA, format = "fasta")`
error
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'translate': error in evaluating the argument 'subject' in selecting a method for function 'trimLRPatterns': reading FASTQ file \Users\VMIAdmin\Desktop\600_sequencing_fastq\22-0391_S1_L001_R1_001.fastq: no FASTQ record found
sessionInfo() R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages: [1] stats4 stats graphics grDevices utils datasets [7] methods base
other attached packages:
[1] Biostrings_2.64.1 GenomeInfoDb_1.32.4 XVector_0.36.0
[4] IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.42.0 compiler_4.2.1
[3] tools_4.2.1 GenomeInfoDbData_1.2.8
[5] RCurl_1.98-1.8 crayon_1.5.1
[7] bitops_1.0-7 ```
I get that the error is clear but not the solution. Again, it works on sequences generated last year but not this year. Did something change in illumina fastq file generation in that time?
Not that I know of. But the error says there's not FASTQ record. Did you look at the FASTQ file? Are there FASTQ records? Are some missing? Is it actually FASTA?
The new sequencing files are in the same format as the 2021 files when opened in sublime.
how does biostring determine whether the file is a legit FASTQ file?
You are setting
seek.first.rec = TRUE
, which means you are doingWhich isn't the default. And you can see why you get an error, as apparently your FASTQ file is missing the @.
first line of 2021 file:
@M02623:14:000000000-DBT7J:1:1101:16805:1416 1:N:0:TAAGGCGT TCTCACTCCTTTTTTCCTTTTCCTCTTCTTTCTTCACCTTTTTTTTTTTTCTTTTTTTTTTTTTTGTCTCTTTCTTTTTTCCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGTTTCCTTTTTCTTTT + 1>>A1@DDB@DFFGCF1FGFF3BAADE1D13ADE22A11BFA1A//EA/>011111//>///>///B22222122111B<01>2212?1>?A?---<<:-:9;-9-9-9---;99@;==--;---;--9-----/--/-///9/://////
First line of 2022 file:
@M02623:21:000000000-KHNYN:1:1101:11377:1806 1:N:0:TATGGCGA TTTGTCCTGTGCAGCTTCTGGCTTCAACATTTCTTATCCCTCTCTGCGTCCGGCCCCCCGTCCCCCCCTTCACTGCCTTGCTTTTTTTTCTCCTTCTTCTCTTTCTTCTTTTTTTTCCTTTTTCGTCTTTCTCCGTTTCTCTCTTTTCGCTTTCACATCCAACAACACCTCCTACCTACATTTCTTCTGCTTAACTGCTTTGTTCTCTGCCGTCTTTTTTTTTATCTTTTCTTCTTTTTTTCTTATTCTTCTTTTTTTTCTTTTTCTTTTCTTTTTCTTCTGCTTTGTTCTACTGTGTTCATGGTACTCTTGTCACCGTCTCCTCGTTCTCGTGCGGTGGTTGCGGTTCCGTCGGTGCTTGCTCCCCCGTTTCCTGTTCCTCCGTTTTCCCTTTTCCCTTCTTCCTTTCTTTCTTTTTCCTTTTTTTTTTCTTTTTTTTTCCCTTTCCTTT + CCC@CFFE@F9E,@EFFFF9,CFFE,,,CFEEEF,<,,,;,,,;,6+8++++++7@+++8,+,,+677,,,,<,,,:9,,,<,5,8@@,B,9BA5?E<B,,<@5C,5B?,BA+8+,,,,<,,3+38,+,,,3,+:3,,:3,:,,,3+++++7,8,383,,,,4,6,565,762,6,,,,,,,5,,26>,,,,+4<,,,,,1,14,,,41232+29)2+++++02:0/:7.)19):9*2=):=49@)94(04,=6)982<9).-6:).4(6:)6))6)6))(.6)-.))))..))))))-)..))..)).(4,,(.61((43:40(((,(-(((((,((-,,(()((((((-)),)),((((((()))))-))))((,)((-))-.)))))))(.))))))))-))-))),())))),(,(((().)))(((,()))-))-)-)
so the solution was to open the .gz files in CLC and export as fastq. The decompression of illumina fastq.gz files did not yield a file usable in biostrings.