obtain DNA sequence

0

Entering edit mode

Biddie, Simon NIH/NCI [F] ▴ 20

@biddie-simon-nihnci-f-3654

Last seen 9.6 years ago

Dear All, I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. I have the following style matrix: Chr Start Stop 1 chr9 79466420 79466570 2 chr6 50495860 50496010 3 chr8 19687900 19688050 4 chrX 90313740 90313890 5 chr4 117732780 117732930 6 chr11 4090400 4090550 I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > library(BSgenome.Mmusculus.UCSC.mm9) > seq1 = subseq(Mmusculus$chr9,79466420,79466570) > as(seq1, "character") How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. Thank you for any help, Simon > sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 [3] Biostrings_2.10.22 IRanges_1.0.16 [5] R.utils_1.1.3 R.oo_1.4.6 [7] R.methodsS3_1.0.3 loaded via a namespace (and not attached): [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 [[alternative HTML version deleted]]

BSgenome BSgenome BSgenome BSgenome • 1.0k views

ADD COMMENT • link updated 14.6 years ago by Hervé Pagès 16k • written 14.7 years ago by Biddie, Simon NIH/NCI [F] ▴ 20

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Simon, Below is code that meets the needs of your explicit question mymat <- <<the matrix="" you="" have="" below="">> uniqueChr <- unique(mymat[,"Chr"]) extractedDNA <- character(nrow(mymat)) for (chr in uniqueChr) { selected <- which(mymat[,"Chr"] == chr) extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], mymat[selected,"Start"], mymat[selected,"End"])) } The question I have for you is have you tried using the IRanges framework to represent your ranges? It would make this type of processing easier to perform. There is also write functions such as write.XStringSet and write.XStringViews that provide export functionality without requiring you to coerce the DNA sequences into character vectors. Patrick Biddie, Simon (NIH/NCI) [F] wrote: > Dear All, > > I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. > > I have the following style matrix: > > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > > >> library(BSgenome.Mmusculus.UCSC.mm9) >> > > >> seq1 = subseq(Mmusculus$chr9,79466420,79466570) >> > > >> as(seq1, "character") >> > > How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. > > Thank you for any help, > > Simon > > >> sessionInfo() >> > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 > [3] Biostrings_2.10.22 IRanges_1.0.16 > [5] R.utils_1.1.3 R.oo_1.4.6 > [7] R.methodsS3_1.0.3 > > loaded via a namespace (and not attached): > [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.7 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hi Patrick, Thanks for your response. I will look into IRanges and Xstring. I also tried your code, however it gives me the following error: > mymat Chr Start Stop 1 chr9 79466420 79466570 2 chr6 50495860 50496010 3 chr8 19687900 19688050 4 chrX 90313740 90313890 5 chr4 117732780 117732930 6 chr11 4090400 4090550 > uniqueChr <- unique(mymat[,"Chr"]) > extractedDNA <- character(nrow(mymat)) > for (chr in uniqueChr) { + selected <- which(mymat[,"Chr"] == chr) + extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], + mymat[selected,"Start"], mymat[selected,"End"])) + } Error in newViews(subject, start = start, end = end, names = names, Class = "XStringViews") : 'start' and 'end' must be numeric vectors In addition: Warning message: In Views(Mmusculus[[chr]], mymat[selected, "Start"], mymat[selected, : masks were dropped Simon -----Original Message----- From: Patrick Aboyoun [mailto:paboyoun@fhcrc.org] Sent: Tuesday, September 01, 2009 2:21 PM To: Biddie, Simon (NIH/NCI) [F] Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] obtain DNA sequence Simon, Below is code that meets the needs of your explicit question mymat <- <<the matrix="" you="" have="" below="">> uniqueChr <- unique(mymat[,"Chr"]) extractedDNA <- character(nrow(mymat)) for (chr in uniqueChr) { selected <- which(mymat[,"Chr"] == chr) extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], mymat[selected,"Start"], mymat[selected,"End"])) } The question I have for you is have you tried using the IRanges framework to represent your ranges? It would make this type of processing easier to perform. There is also write functions such as write.XStringSet and write.XStringViews that provide export functionality without requiring you to coerce the DNA sequences into character vectors. Patrick Biddie, Simon (NIH/NCI) [F] wrote: > Dear All, > > I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. > > I have the following style matrix: > > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > > >> library(BSgenome.Mmusculus.UCSC.mm9) >> > > >> seq1 = subseq(Mmusculus$chr9,79466420,79466570) >> > > >> as(seq1, "character") >> > > How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. > > Thank you for any help, > > Simon > > >> sessionInfo() >> > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 > [3] Biostrings_2.10.22 IRanges_1.0.16 > [5] R.utils_1.1.3 R.oo_1.4.6 > [7] R.methodsS3_1.0.3 > > loaded via a namespace (and not attached): > [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago Biddie, Simon NIH/NCI [F] ▴ 20

0

Entering edit mode

Simon, I had a typo in my code and should have used Stop for the column name rather than End. Try mymat <- <<the matrix="" you="" have="" below="">> uniqueChr <- unique(mymat[,"Chr"]) extractedDNA <- character(nrow(mymat)) for (chr in uniqueChr) { selected <- which(mymat[,"Chr"] == chr) extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], mymat[selected,"Start"], mymat[selected,"Stop"])) } Patrick Biddie, Simon (NIH/NCI) [F] wrote: > Hi Patrick, > > Thanks for your response. I will look into IRanges and Xstring. > I also tried your code, however it gives me the following error: > > >> mymat >> > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > >> uniqueChr <- unique(mymat[,"Chr"]) >> extractedDNA <- character(nrow(mymat)) >> for (chr in uniqueChr) { >> > + selected <- which(mymat[,"Chr"] == chr) > + extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], > + mymat[selected,"Start"], mymat[selected,"End"])) > + } > > Error in newViews(subject, start = start, end = end, names = names, Class = "XStringViews") : > 'start' and 'end' must be numeric vectors > In addition: Warning message: > In Views(Mmusculus[[chr]], mymat[selected, "Start"], mymat[selected, : > masks were dropped > > > Simon > > -----Original Message----- > From: Patrick Aboyoun [mailto:paboyoun at fhcrc.org] > Sent: Tuesday, September 01, 2009 2:21 PM > To: Biddie, Simon (NIH/NCI) [F] > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] obtain DNA sequence > > Simon, > Below is code that meets the needs of your explicit question > > mymat <- <<the matrix="" you="" have="" below="">> > uniqueChr <- unique(mymat[,"Chr"]) > extractedDNA <- character(nrow(mymat)) > for (chr in uniqueChr) { > selected <- which(mymat[,"Chr"] == chr) > extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], > mymat[selected,"Start"], mymat[selected,"End"])) > } > > The question I have for you is have you tried using the IRanges > framework to represent your ranges? It would make this type of > processing easier to perform. There is also write functions such as > write.XStringSet and write.XStringViews that provide export > functionality without requiring you to coerce the DNA sequences into > character vectors. > > > > Patrick > > > > Biddie, Simon (NIH/NCI) [F] wrote: > >> Dear All, >> >> I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. >> >> I have the following style matrix: >> >> Chr Start Stop >> 1 chr9 79466420 79466570 >> 2 chr6 50495860 50496010 >> 3 chr8 19687900 19688050 >> 4 chrX 90313740 90313890 >> 5 chr4 117732780 117732930 >> 6 chr11 4090400 4090550 >> >> I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: >> >> >> >>> library(BSgenome.Mmusculus.UCSC.mm9) >>> >>> >> >> >>> seq1 = subseq(Mmusculus$chr9,79466420,79466570) >>> >>> >> >> >>> as(seq1, "character") >>> >>> >> How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. >> >> Thank you for any help, >> >> Simon >> >> >> >>> sessionInfo() >>> >>> >> R version 2.8.1 (2008-12-22) >> i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> other attached packages: >> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 >> [3] Biostrings_2.10.22 IRanges_1.0.16 >> [5] R.utils_1.1.3 R.oo_1.4.6 >> [7] R.methodsS3_1.0.3 >> >> loaded via a namespace (and not attached): >> [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > >

ADD REPLY • link 14.7 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 9 hours ago

Seattle, WA, United States

Hi Simon, The getSeq() function from the BSgenome package is provided for that purpose: myseqs <- data.frame( Chr=c("chr9", "chr6", "chr8", "chrX", "chr4", "chr11"), Start=c(79466420, 50495860, 19687900, 90313740, 117732780, 4090400), Stop=c(79466570, 50496010, 19688050, 90313890, 117732930, 4090550)) > myseqs Chr Start Stop 1 chr9 79466420 79466570 2 chr6 50495860 50496010 3 chr8 19687900 19688050 4 chrX 90313740 90313890 5 chr4 117732780 117732930 6 chr11 4090400 4090550 > getSeq(Mmusculus, myseqs$Chr, start=myseqs$Start, end=myseqs$Stop) [1] "CTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTC TGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCCAAGTGCTGGGATTAACGGTGTGCACCA CCACTGCCTGGC" [2] "TTACTGTCACCCTCAGAATCACGTGTTCAGATATCCAGCTTCCGGGTGACAAACCCACAAAATTCTCTT TTTTCTTTAACCTTACTCTCTCCAACACTTGACCTTTCTTTGTTTATTCCTTCTGGAGTGCCCAGGTCCT TATGCATTATGA" [3] "GGTAGGTAAGTAATGGTCACCTATTCTCTTTCTATCTGGTATGTCTGCAGGTTGACAGGCTGGTGCCTG CCCTTCAACCCAGGAAGCAGAGCTTGTGTTCAATCATTATTGCACATTAACAAGGAAAAAAATGCCTTGT TGGATTCTTAAA" [4] "TCAAAATGGCAAGAAAAACACTTAAGTTTTTATTACTCAGGGCTCACAGCAGCTAAAAGGTTTCAGCAA TATTATATGGCATACAAATTGCAGATTTAACTTGGTTGAGGAGCGTCCCCATGCAATCACCATAATATTT TATTGTAGAATA" [5] "TTCAAAACGTCCTCCTGCTTCCTCTGTGGTGACCAGCTATGACTCGGGGCATCCCTCCTCAAGGCCTTA GTGTTTTGGCTTTGCTCAGTTTCCATGAGGCCTGACCAACCCCTAGGAGTCTCCTCTTTCTGCCTCCTGC TACCTGGATGCA" [6] "AGCCTGCTCTGTAGGGAACCTTTAGTGGGCTTGAAGTGTTCCCTGACTGCTCTTGAGCACTGGCCAAAA GCAAGAAAGCAGCTAGCCCATGAATGGCCCTGTGGGTGGCACAGGCACAGGCAGTGAAACCCCAAGAAGA CCAGGTATAATG" See ?getSeq for more information about this function. Cheers, H. Biddie, Simon (NIH/NCI) [F] wrote: > Dear All, > > I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. > > I have the following style matrix: > > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > >> library(BSgenome.Mmusculus.UCSC.mm9) > >> seq1 = subseq(Mmusculus$chr9,79466420,79466570) > >> as(seq1, "character") > > How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. > > Thank you for any help, > > Simon > >> sessionInfo() > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 > [3] Biostrings_2.10.22 IRanges_1.0.16 > [5] R.utils_1.1.3 R.oo_1.4.6 > [7] R.methodsS3_1.0.3 > > loaded via a namespace (and not attached): > [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.6 years ago Hervé Pagès 16k

Login before adding your answer.