how to deal with fasta "line is too long"

0

Entering edit mode

wang peter ★ 2.0k

@wang-peter-4647

Last seen 9.6 years ago

hi all sorry to disturb you i forgot how to deal with too long fasta sequences ? i remembered a person told me to use linux command line? thank you in advances -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253too long

• 2.7k views

ADD COMMENT • link updated 12.0 years ago by Marcus Davy ▴ 390 • written 12.0 years ago by wang peter ★ 2.0k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

Hi, Shan Gao. Please follow the suggestions in the posting guide. You seem to be asking about an error that you received, but you do not provide code, the error itself, the output of traceback() if applicable, and your sessionInfo(). Without relevant information, it is impossible to provide you an answer. Sean On Mon, Apr 9, 2012 at 8:50 AM, wang peter <wng.peter at="" gmail.com=""> wrote: > hi all > sorry to disturb you > > i forgot how to deal with too long fasta sequences ? > i remembered a person told me to use linux command line? > > thank you in advances > > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253too long > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Sean Davis 21k

0

Entering edit mode

Marcus Davy ▴ 390

@marcus-davy-5153

Last seen 6.1 years ago

Sounds like you have Fasta files which do not contain newlines, use the linux command 'fold' to fix this. fold [malformedFile] > [newFile] >From memory, read.DNAStringSet() will fail if the file is larger than 20,000 characters and contains no newline feeds. Marcus On Tue, Apr 10, 2012 at 12:50 AM, wang peter <wng.peter@gmail.com> wrote: > hi all > sorry to disturb you > > i forgot how to deal with too long fasta sequences ï¼ > i remembered a person told me to use linux command line? > > thank you in advances > > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839@cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253too long > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.0 years ago Marcus Davy ▴ 390

0

Entering edit mode

Quick example generating a fasta file that does not contain newlines to illustrate; library(Biostrings) set.seed(42) n <- 20000 dna <- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="") ## Create a fasta file that does not contain newlines write(">test sequence", "test.fasta") write(dna, "test.fasta", append=TRUE) ## n=20,000 bases or above will fail try(read.DNAStringSet("test.fasta")) Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, use.names, : reading FASTA file test.fasta: cannot read line 2, line is too long n <- 20000-1 dna <- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="") write(">test sequence", "test.fasta") write(dna, "test.fasta", append=TRUE) ## 19999 bases or less will load read.DNAStringSet("test.fasta") A DNAStringSet instance of length 1 width seq names [1] 19999 GCTCCTTGGACCGCTCACTGCTC...TTAGATTCACCTTGGCATGAAGT test sequence Marcus sessionInfo() R version 2.14.1 (2011-12-22) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New Zealand.1252 [3] LC_MONETARY=English_New Zealand.1252 LC_NUMERIC=C [5] LC_TIME=English_New Zealand.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.7 [7] IRanges_1.12.6 nlme_3.1-103 NGS_0.9.4 [10] lattice_0.20-6 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.1 [5] hwriter_1.3 RCurl_1.91-1.1 rtracklayer_1.14.4 tools_2.14.1 [9] XML_3.9-4.1 zlibbioc_1.0.1 On Tue, Apr 10, 2012 at 6:17 AM, Marcus Davy <mdavy86@gmail.com> wrote: > Sounds like you have Fasta files which do not contain newlines, use the > linux command 'fold' > to fix this. > > fold [malformedFile] > [newFile] > > From memory, read.DNAStringSet() will fail if the file is larger than > 20,000 characters > and contains no newline feeds. > > Marcus > > > On Tue, Apr 10, 2012 at 12:50 AM, wang peter <wng.peter@gmail.com> wrote: > >> hi all >> sorry to disturb you >> >> i forgot how to deal with too long fasta sequences ï¼ >> i remembered a person told me to use linux command line? >> >> thank you in advances >> >> -- >> shan gao >> Room 231(Dr.Fei lab) >> Boyce Thompson Institute >> Cornell University >> Tower Road, Ithaca, NY 14853-1801 >> Office phone: 1-607-254-1267(day) >> Official email:sg839@cornell.edu >> Facebook:http://www.facebook.com/profile.php?id=100001986532253too long >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Marcus Davy ▴ 390

0

Entering edit mode

On 04/09/2012 03:44 PM, Marcus Davy wrote: > Quick example generating a fasta file that does not contain newlines to > illustrate; > > library(Biostrings) > > set.seed(42) > n<- 20000 > dna<- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="") > > ## Create a fasta file that does not contain newlines > write(">test sequence", "test.fasta") > write(dna, "test.fasta", append=TRUE) > > ## n=20,000 bases or above will fail > try(read.DNAStringSet("test.fasta")) > Error in .Call2("read_fasta_in_XStringSet", efp_list, nrec, skip, > use.names, : > reading FASTA file test.fasta: cannot read line 2, line is too long It would be good to know the original use case; functions are written for different purposes, and for instance library(Rsamtools) fa = FaFile("test.fasta") indexFa(fa) (param = scanFaIndex(fa)) and finally scanFa(fa, param=param) > scanFa(fa, param=param) A DNAStringSet instance of length 1 width seq names [1] 20000 CCTCGGGAGGTGCTTCCATGCAC...ATTCTGTCTGGCATCACTAGGCC test One might use scanFa to read (ranges) of long (e.g., genome-scale) fasta files, whereas read.DNAStringSet or ShortRead::readFasta are more suited for large collections of shorter sequences. Martin > > n<- 20000-1 > dna<- paste(sample(c("A","T","G","C"), n, replace=TRUE), collapse="") > > write(">test sequence", "test.fasta") > write(dna, "test.fasta", append=TRUE) > > ## 19999 bases or less will load > read.DNAStringSet("test.fasta") > A DNAStringSet instance of length 1 > width seq > names > [1] 19999 GCTCCTTGGACCGCTCACTGCTC...TTAGATTCACCTTGGCATGAAGT test sequence > > Marcus > > > sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New > Zealand.1252 > [3] LC_MONETARY=English_New Zealand.1252 > LC_NUMERIC=C > [5] LC_TIME=English_New Zealand.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.7 > [7] IRanges_1.12.6 nlme_3.1-103 NGS_0.9.4 > [10] lattice_0.20-6 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 > grid_2.14.1 > [5] hwriter_1.3 RCurl_1.91-1.1 rtracklayer_1.14.4 > tools_2.14.1 > [9] XML_3.9-4.1 zlibbioc_1.0.1 > > On Tue, Apr 10, 2012 at 6:17 AM, Marcus Davy<mdavy86 at="" gmail.com=""> wrote: > >> Sounds like you have Fasta files which do not contain newlines, use the >> linux command 'fold' >> to fix this. >> >> fold [malformedFile]> [newFile] >> >> From memory, read.DNAStringSet() will fail if the file is larger than >> 20,000 characters >> and contains no newline feeds. >> >> Marcus >> >> >> On Tue, Apr 10, 2012 at 12:50 AM, wang peter<wng.peter at="" gmail.com=""> wrote: >> >>> hi all >>> sorry to disturb you >>> >>> i forgot how to deal with too long fasta sequences ??? >>> i remembered a person told me to use linux command line? >>> >>> thank you in advances >>> >>> -- >>> shan gao >>> Room 231(Dr.Fei lab) >>> Boyce Thompson Institute >>> Cornell University >>> Tower Road, Ithaca, NY 14853-1801 >>> Office phone: 1-607-254-1267(day) >>> Official email:sg839 at cornell.edu >>> Facebook:http://www.facebook.com/profile.php?id=100001986532253too long >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 12.0 years ago Martin Morgan 25k

0

Entering edit mode

thank you so much for your help i got that linux command line to limit fasta sequence length = 60 that is fold -w 60 inputfile > outputfile -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253

ADD REPLY • link 12.0 years ago wang peter ★ 2.0k

0

Entering edit mode

In my case it was auxillary information, fragments of two vectors were expected to be present in certain locations of assemblies derived from randomly tagged barcoded BAC ends in a multiplexed Illumina NGS experiment. A fasta file sent to me contained some vector sequence where one of them was 27Kb long, and it did not seemlessly load using read.DNAStringSet(). I noticed it was missing the newlines in the body of the format every 70-80 bases which is also useful for readibility.From fasta format description information I have found "It is recommended that all lines of text be shorter than 80 characters", so the format I was provided was not to the recommended standard. My solution at the time was to reinsert the newlines using the linux wrap command which solved the loading problem. As long as there is a newline within 20Kb of sequence in a fasta the function read.DNAStringSet() will work, or as mentioned use FaFile() without the need to correct the fasta format. n <- 20000-1 dna <- c(paste(sample(c("A","T","G","C"), n, replace=T), collapse=""), "\n", paste(sample(c("A","T","G","C"), n, replace=T), collapse="")) write(">test sequence", "test.fasta") write(dna, "test.fasta", append=TRUE) read.DNAStringSet("test.fasta") A DNAStringSet instance of length 1 width seq names [1] 39998 TCAAGCGCATCGGGATCGAGGGT...GACATTGCGTCGTATCGATGTTT test sequence Marcus On Tue, Apr 10, 2012 at 2:52 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 04/09/2012 03:44 PM, Marcus Davy wrote: > >> Quick example generating a fasta file that does not contain newlines to >> illustrate; >> >> library(Biostrings) >> >> set.seed(42) >> n<- 20000 >> dna<- paste(sample(c("A","T","G","C"**), n, replace=TRUE), collapse="") >> >> ## Create a fasta file that does not contain newlines >> write(">test sequence", "test.fasta") >> write(dna, "test.fasta", append=TRUE) >> >> ## n=20,000 bases or above will fail >> try(read.DNAStringSet("test.**fasta")) >> Error in .Call2("read_fasta_in_**XStringSet", efp_list, nrec, skip, >> use.names, : >> reading FASTA file test.fasta: cannot read line 2, line is too long >> > > It would be good to know the original use case; functions are written for > different purposes, and for instance > > library(Rsamtools) > fa = FaFile("test.fasta") > indexFa(fa) > (param = scanFaIndex(fa)) > > and finally > > scanFa(fa, param=param) > > > scanFa(fa, param=param) > > A DNAStringSet instance of length 1 > width seq names > [1] 20000 CCTCGGGAGGTGCTTCCATGCAC...**ATTCTGTCTGGCATCACTAGGCC test > > One might use scanFa to read (ranges) of long (e.g., genome-scale) fasta > files, whereas read.DNAStringSet or ShortRead::readFasta are more suited > for large collections of shorter sequences. > > Martin > > >> n<- 20000-1 >> dna<- paste(sample(c("A","T","G","C"**), n, replace=TRUE), collapse="") >> >> write(">test sequence", "test.fasta") >> write(dna, "test.fasta", append=TRUE) >> >> ## 19999 bases or less will load >> read.DNAStringSet("test.fasta"**) >> A DNAStringSet instance of length 1 >> width seq >> names >> [1] 19999 GCTCCTTGGACCGCTCACTGCTC...**TTAGATTCACCTTGGCATGAAGT test >> sequence >> >> Marcus >> >> >> sessionInfo() >> R version 2.14.1 (2011-12-22) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New >> Zealand.1252 >> [3] LC_MONETARY=English_New Zealand.1252 >> LC_NUMERIC=C >> [5] LC_TIME=English_New Zealand.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.7 >> [7] IRanges_1.12.6 nlme_3.1-103 NGS_0.9.4 >> [10] lattice_0.20-6 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 >> grid_2.14.1 >> [5] hwriter_1.3 RCurl_1.91-1.1 rtracklayer_1.14.4 >> tools_2.14.1 >> [9] XML_3.9-4.1 zlibbioc_1.0.1 >> >> On Tue, Apr 10, 2012 at 6:17 AM, Marcus Davy<mdavy86@gmail.com> wrote: >> >> Sounds like you have Fasta files which do not contain newlines, use the >>> linux command 'fold' >>> to fix this. >>> >>> fold [malformedFile]> [newFile] >>> >>> From memory, read.DNAStringSet() will fail if the file is larger than >>> 20,000 characters >>> and contains no newline feeds. >>> >>> Marcus >>> >>> >>> On Tue, Apr 10, 2012 at 12:50 AM, wang peter<wng.peter@gmail.com> >>> wrote: >>> >>> hi all >>>> sorry to disturb you >>>> >>>> i forgot how to deal with too long fasta sequences Ã¯Â¼Å¸ >>>> >>>> i remembered a person told me to use linux command line? >>>> >>>> thank you in advances >>>> >>>> -- >>>> shan gao >>>> Room 231(Dr.Fei lab) >>>> Boyce Thompson Institute >>>> Cornell University >>>> Tower Road, Ithaca, NY 14853-1801 >>>> Office phone: 1-607-254-1267(day) >>>> Official email:sg839@cornell.edu >>>> Facebook:http://www.facebook.**com/profile.php?id=**1000019865322 53too<http: www.facebook.com="" profile.php?id="100001986532253too">long >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>> >>> >>> >>> >> [[alternative HTML version deleted]] >> >> >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]]

ADD REPLY • link 12.0 years ago Marcus Davy ▴ 390

Login before adding your answer.