obtain DNA sequence
2
0
Entering edit mode
@biddie-simon-nihnci-f-3654
Last seen 9.6 years ago
Dear All, I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. I have the following style matrix: Chr Start Stop 1 chr9 79466420 79466570 2 chr6 50495860 50496010 3 chr8 19687900 19688050 4 chrX 90313740 90313890 5 chr4 117732780 117732930 6 chr11 4090400 4090550 I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > library(BSgenome.Mmusculus.UCSC.mm9) > seq1 = subseq(Mmusculus$chr9,79466420,79466570) > as(seq1, "character") How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. Thank you for any help, Simon > sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 [3] Biostrings_2.10.22 IRanges_1.0.16 [5] R.utils_1.1.3 R.oo_1.4.6 [7] R.methodsS3_1.0.3 loaded via a namespace (and not attached): [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 [[alternative HTML version deleted]]
BSgenome BSgenome BSgenome BSgenome • 1.0k views
ADD COMMENT
0
Entering edit mode
Patrick Aboyoun ★ 1.6k
@patrick-aboyoun-6734
Last seen 9.6 years ago
United States
Simon, Below is code that meets the needs of your explicit question mymat <- <<the matrix="" you="" have="" below="">> uniqueChr <- unique(mymat[,"Chr"]) extractedDNA <- character(nrow(mymat)) for (chr in uniqueChr) { selected <- which(mymat[,"Chr"] == chr) extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], mymat[selected,"Start"], mymat[selected,"End"])) } The question I have for you is have you tried using the IRanges framework to represent your ranges? It would make this type of processing easier to perform. There is also write functions such as write.XStringSet and write.XStringViews that provide export functionality without requiring you to coerce the DNA sequences into character vectors. Patrick Biddie, Simon (NIH/NCI) [F] wrote: > Dear All, > > I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. > > I have the following style matrix: > > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > > >> library(BSgenome.Mmusculus.UCSC.mm9) >> > > >> seq1 = subseq(Mmusculus$chr9,79466420,79466570) >> > > >> as(seq1, "character") >> > > How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. > > Thank you for any help, > > Simon > > >> sessionInfo() >> > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 > [3] Biostrings_2.10.22 IRanges_1.0.16 > [5] R.utils_1.1.3 R.oo_1.4.6 > [7] R.methodsS3_1.0.3 > > loaded via a namespace (and not attached): > [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Patrick, Thanks for your response. I will look into IRanges and Xstring. I also tried your code, however it gives me the following error: > mymat Chr Start Stop 1 chr9 79466420 79466570 2 chr6 50495860 50496010 3 chr8 19687900 19688050 4 chrX 90313740 90313890 5 chr4 117732780 117732930 6 chr11 4090400 4090550 > uniqueChr <- unique(mymat[,"Chr"]) > extractedDNA <- character(nrow(mymat)) > for (chr in uniqueChr) { + selected <- which(mymat[,"Chr"] == chr) + extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], + mymat[selected,"Start"], mymat[selected,"End"])) + } Error in newViews(subject, start = start, end = end, names = names, Class = "XStringViews") : 'start' and 'end' must be numeric vectors In addition: Warning message: In Views(Mmusculus[[chr]], mymat[selected, "Start"], mymat[selected, : masks were dropped Simon -----Original Message----- From: Patrick Aboyoun [mailto:paboyoun@fhcrc.org] Sent: Tuesday, September 01, 2009 2:21 PM To: Biddie, Simon (NIH/NCI) [F] Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] obtain DNA sequence Simon, Below is code that meets the needs of your explicit question mymat <- <<the matrix="" you="" have="" below="">> uniqueChr <- unique(mymat[,"Chr"]) extractedDNA <- character(nrow(mymat)) for (chr in uniqueChr) { selected <- which(mymat[,"Chr"] == chr) extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], mymat[selected,"Start"], mymat[selected,"End"])) } The question I have for you is have you tried using the IRanges framework to represent your ranges? It would make this type of processing easier to perform. There is also write functions such as write.XStringSet and write.XStringViews that provide export functionality without requiring you to coerce the DNA sequences into character vectors. Patrick Biddie, Simon (NIH/NCI) [F] wrote: > Dear All, > > I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. > > I have the following style matrix: > > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > > >> library(BSgenome.Mmusculus.UCSC.mm9) >> > > >> seq1 = subseq(Mmusculus$chr9,79466420,79466570) >> > > >> as(seq1, "character") >> > > How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. > > Thank you for any help, > > Simon > > >> sessionInfo() >> > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 > [3] Biostrings_2.10.22 IRanges_1.0.16 > [5] R.utils_1.1.3 R.oo_1.4.6 > [7] R.methodsS3_1.0.3 > > loaded via a namespace (and not attached): > [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Simon, I had a typo in my code and should have used Stop for the column name rather than End. Try mymat <- <<the matrix="" you="" have="" below="">> uniqueChr <- unique(mymat[,"Chr"]) extractedDNA <- character(nrow(mymat)) for (chr in uniqueChr) { selected <- which(mymat[,"Chr"] == chr) extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], mymat[selected,"Start"], mymat[selected,"Stop"])) } Patrick Biddie, Simon (NIH/NCI) [F] wrote: > Hi Patrick, > > Thanks for your response. I will look into IRanges and Xstring. > I also tried your code, however it gives me the following error: > > >> mymat >> > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > >> uniqueChr <- unique(mymat[,"Chr"]) >> extractedDNA <- character(nrow(mymat)) >> for (chr in uniqueChr) { >> > + selected <- which(mymat[,"Chr"] == chr) > + extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], > + mymat[selected,"Start"], mymat[selected,"End"])) > + } > > Error in newViews(subject, start = start, end = end, names = names, Class = "XStringViews") : > 'start' and 'end' must be numeric vectors > In addition: Warning message: > In Views(Mmusculus[[chr]], mymat[selected, "Start"], mymat[selected, : > masks were dropped > > > Simon > > -----Original Message----- > From: Patrick Aboyoun [mailto:paboyoun at fhcrc.org] > Sent: Tuesday, September 01, 2009 2:21 PM > To: Biddie, Simon (NIH/NCI) [F] > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] obtain DNA sequence > > Simon, > Below is code that meets the needs of your explicit question > > mymat <- <<the matrix="" you="" have="" below="">> > uniqueChr <- unique(mymat[,"Chr"]) > extractedDNA <- character(nrow(mymat)) > for (chr in uniqueChr) { > selected <- which(mymat[,"Chr"] == chr) > extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], > mymat[selected,"Start"], mymat[selected,"End"])) > } > > The question I have for you is have you tried using the IRanges > framework to represent your ranges? It would make this type of > processing easier to perform. There is also write functions such as > write.XStringSet and write.XStringViews that provide export > functionality without requiring you to coerce the DNA sequences into > character vectors. > > > > Patrick > > > > Biddie, Simon (NIH/NCI) [F] wrote: > >> Dear All, >> >> I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. >> >> I have the following style matrix: >> >> Chr Start Stop >> 1 chr9 79466420 79466570 >> 2 chr6 50495860 50496010 >> 3 chr8 19687900 19688050 >> 4 chrX 90313740 90313890 >> 5 chr4 117732780 117732930 >> 6 chr11 4090400 4090550 >> >> I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: >> >> >> >>> library(BSgenome.Mmusculus.UCSC.mm9) >>> >>> >> >> >>> seq1 = subseq(Mmusculus$chr9,79466420,79466570) >>> >>> >> >> >>> as(seq1, "character") >>> >>> >> How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. >> >> Thank you for any help, >> >> Simon >> >> >> >>> sessionInfo() >>> >>> >> R version 2.8.1 (2008-12-22) >> i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> other attached packages: >> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 >> [3] Biostrings_2.10.22 IRanges_1.0.16 >> [5] R.utils_1.1.3 R.oo_1.4.6 >> [7] R.methodsS3_1.0.3 >> >> loaded via a namespace (and not attached): >> [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > >
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 9 hours ago
Seattle, WA, United States
Hi Simon, The getSeq() function from the BSgenome package is provided for that purpose: myseqs <- data.frame( Chr=c("chr9", "chr6", "chr8", "chrX", "chr4", "chr11"), Start=c(79466420, 50495860, 19687900, 90313740, 117732780, 4090400), Stop=c(79466570, 50496010, 19688050, 90313890, 117732930, 4090550)) > myseqs Chr Start Stop 1 chr9 79466420 79466570 2 chr6 50495860 50496010 3 chr8 19687900 19688050 4 chrX 90313740 90313890 5 chr4 117732780 117732930 6 chr11 4090400 4090550 > getSeq(Mmusculus, myseqs$Chr, start=myseqs$Start, end=myseqs$Stop) [1] "CTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTC TGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCCAAGTGCTGGGATTAACGGTGTGCACCA CCACTGCCTGGC" [2] "TTACTGTCACCCTCAGAATCACGTGTTCAGATATCCAGCTTCCGGGTGACAAACCCACAAAATTCTCTT TTTTCTTTAACCTTACTCTCTCCAACACTTGACCTTTCTTTGTTTATTCCTTCTGGAGTGCCCAGGTCCT TATGCATTATGA" [3] "GGTAGGTAAGTAATGGTCACCTATTCTCTTTCTATCTGGTATGTCTGCAGGTTGACAGGCTGGTGCCTG CCCTTCAACCCAGGAAGCAGAGCTTGTGTTCAATCATTATTGCACATTAACAAGGAAAAAAATGCCTTGT TGGATTCTTAAA" [4] "TCAAAATGGCAAGAAAAACACTTAAGTTTTTATTACTCAGGGCTCACAGCAGCTAAAAGGTTTCAGCAA TATTATATGGCATACAAATTGCAGATTTAACTTGGTTGAGGAGCGTCCCCATGCAATCACCATAATATTT TATTGTAGAATA" [5] "TTCAAAACGTCCTCCTGCTTCCTCTGTGGTGACCAGCTATGACTCGGGGCATCCCTCCTCAAGGCCTTA GTGTTTTGGCTTTGCTCAGTTTCCATGAGGCCTGACCAACCCCTAGGAGTCTCCTCTTTCTGCCTCCTGC TACCTGGATGCA" [6] "AGCCTGCTCTGTAGGGAACCTTTAGTGGGCTTGAAGTGTTCCCTGACTGCTCTTGAGCACTGGCCAAAA GCAAGAAAGCAGCTAGCCCATGAATGGCCCTGTGGGTGGCACAGGCACAGGCAGTGAAACCCCAAGAAGA CCAGGTATAATG" See ?getSeq for more information about this function. Cheers, H. Biddie, Simon (NIH/NCI) [F] wrote: > Dear All, > > I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help. > > I have the following style matrix: > > Chr Start Stop > 1 chr9 79466420 79466570 > 2 chr6 50495860 50496010 > 3 chr8 19687900 19688050 > 4 chrX 90313740 90313890 > 5 chr4 117732780 117732930 > 6 chr11 4090400 4090550 > > I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually: > >> library(BSgenome.Mmusculus.UCSC.mm9) > >> seq1 = subseq(Mmusculus$chr9,79466420,79466570) > >> as(seq1, "character") > > How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo. > > Thank you for any help, > > Simon > >> sessionInfo() > R version 2.8.1 (2008-12-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5 > [3] Biostrings_2.10.22 IRanges_1.0.16 > [5] R.utils_1.1.3 R.oo_1.4.6 > [7] R.methodsS3_1.0.3 > > loaded via a namespace (and not attached): > [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6