how to modify a Biostrings object

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 9.6 years ago

I wish to introduce some mutations into a segment of Hg18 chromosome 9, as part of a simulation I wish to run. I am having difficulty figuring out how to modify the BSgenome-related data structures. I create my (wild-type) variables like this: chr9 <- Hsapiens [['chr9']] hg18.9q32 <<- subseq (chr9, 113900001, 116700000) How would one create a mutant variant of hg18.9q32? My wish is to incrementally -- one gene at a time -- introduce SNPs (and later, indels and translocations). Sometimes these changes are intentionally random, sometimes I want them at very specific places, to 'spike in' an interesting mutation. I see that one can injectSNPs into a genome: Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP. 20071016") I could find no documentation for the SNPlocs data structure. hg18.9q32 appears to be a MaskedDNAString, apparently derived from Biostring and XString, all of which seem to be immutable. Is there any way to do this? Let me apologize in advance if this is documented and I missed it. I have combed the vignettes and the help pages and found nothing. Many thanks, - Paul

SNPlocs SNPlocs • 2.5k views

ADD COMMENT • link updated 15.2 years ago by Hervé Pagès 16k • written 15.2 years ago by Paul Shannon ★ 1.1k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Paul, Herve Pages is the author of this material and probably has a more clever response than mine, but the replaceLetterAt function (see help(replaceLetterAt) for details) in the Biostrings package has a method for DNAString instances that you can leverage. > library(BSgenome.Hsapiens.UCSC.hg18) > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <- subseq (chr9, 113900001, 116700000) > unmasked(hg18.9q32) 2800000-letter "DNAString" instance seq: TGGCTGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT > replaceLetterAt(unmasked(hg18.9q32), c(1, 3, 5), "ACG") 2800000-letter "DNAString" instance seq: AGCCGGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT > new.hg18.9q32 <- replaceLetterAt(unmasked(hg18.9q32), c(1, 3, 5), "ACG") > masks(new.hg18.9q32) <- masks(hg18.9q32) > new.hg18.9q32 2800000-letter "MaskedDNAString" instance (# for masking) seq: AGCCGGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT masks: maskedwidth maskedratio active names desc 1 0 0.0000000 TRUE AGAPS assembly gaps 2 0 0.0000000 TRUE AMB intra-contig ambiguities (empty) 3 1407815 0.5027911 FALSE RM RepeatMasker 4 21329 0.0076175 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 1408618 0.5030779 all active masks together: maskedwidth maskedratio 0 0 Patrick Quoting Paul Shannon <pshannon at="" systemsbiology.org="">: > I wish to introduce some mutations into a segment of Hg18 chromosome 9, > as part of a simulation I wish to run. I am having difficulty figuring > out how to modify the BSgenome-related data structures. > > I create my (wild-type) variables like this: > > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <<- subseq (chr9, 113900001, 116700000) > > How would one create a mutant variant of hg18.9q32? My wish is to > incrementally -- one gene at a time -- introduce SNPs (and later, > indels and translocations). Sometimes these changes are intentionally > random, sometimes I want them at very specific places, to 'spike in' an > interesting mutation. > > I see that one can injectSNPs into a genome: > > Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016") > > I could find no documentation for the SNPlocs data structure. > > hg18.9q32 appears to be a MaskedDNAString, apparently derived from > Biostring and XString, all of which seem to be immutable. > > Is there any way to do this? > > Let me apologize in advance if this is documented and I missed it. I > have combed the vignettes and the help pages and found nothing. > > Many thanks, > > - Paul > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.2 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Paul, Herve Pages is the author of this material and probably has a more clever response than mine, but the replaceLetterAt function (see help(replaceLetterAt) for details) in the Biostrings package has a method for DNAString instances that you can leverage. > library(BSgenome.Hsapiens.UCSC.hg18) > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <- subseq (chr9, 113900001, 116700000) > unmasked(hg18.9q32) 2800000-letter "DNAString" instance seq: TGGCTGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT > replaceLetterAt(unmasked(hg18.9q32), c(1, 3, 5), "ACG") 2800000-letter "DNAString" instance seq: AGCCGGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT > new.hg18.9q32 <- replaceLetterAt(unmasked(hg18.9q32), c(1, 3, 5), "ACG") > masks(new.hg18.9q32) <- masks(hg18.9q32) > new.hg18.9q32 2800000-letter "MaskedDNAString" instance (# for masking) seq: AGCCGGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT masks: maskedwidth maskedratio active names desc 1 0 0.0000000 TRUE AGAPS assembly gaps 2 0 0.0000000 TRUE AMB intra-contig ambiguities (empty) 3 1407815 0.5027911 FALSE RM RepeatMasker 4 21329 0.0076175 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 1408618 0.5030779 all active masks together: maskedwidth maskedratio 0 0 Patrick Quoting Paul Shannon <pshannon at="" systemsbiology.org="">: > I wish to introduce some mutations into a segment of Hg18 chromosome 9, > as part of a simulation I wish to run. I am having difficulty figuring > out how to modify the BSgenome-related data structures. > > I create my (wild-type) variables like this: > > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <<- subseq (chr9, 113900001, 116700000) > > How would one create a mutant variant of hg18.9q32? My wish is to > incrementally -- one gene at a time -- introduce SNPs (and later, > indels and translocations). Sometimes these changes are intentionally > random, sometimes I want them at very specific places, to 'spike in' an > interesting mutation. > > I see that one can injectSNPs into a genome: > > Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016") > > I could find no documentation for the SNPlocs data structure. > > hg18.9q32 appears to be a MaskedDNAString, apparently derived from > Biostring and XString, all of which seem to be immutable. > > Is there any way to do this? > > Let me apologize in advance if this is documented and I missed it. I > have combed the vignettes and the help pages and found nothing. > > Many thanks, > > - Paul > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.2 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Paul, Herve Pages is the author of this material and probably has a more clever response than mine, but the replaceLetterAt function (see help(replaceLetterAt) for details) in the Biostrings package has a method for DNAString instances that you can leverage. > library(BSgenome.Hsapiens.UCSC.hg18) > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <- subseq (chr9, 113900001, 116700000) > unmasked(hg18.9q32) 2800000-letter "DNAString" instance seq: TGGCTGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT > replaceLetterAt(unmasked(hg18.9q32), c(1, 3, 5), "ACG") 2800000-letter "DNAString" instance seq: AGCCGGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT > new.hg18.9q32 <- replaceLetterAt(unmasked(hg18.9q32), c(1, 3, 5), "ACG") > masks(new.hg18.9q32) <- masks(hg18.9q32) > new.hg18.9q32 2800000-letter "MaskedDNAString" instance (# for masking) seq: AGCCGGCACCACAGCCATCCCACTAGGGGGAGCTGT...GTGACTGGAGCGTGGAATGTGGAGTTAGGCA AACTT masks: maskedwidth maskedratio active names desc 1 0 0.0000000 TRUE AGAPS assembly gaps 2 0 0.0000000 TRUE AMB intra-contig ambiguities (empty) 3 1407815 0.5027911 FALSE RM RepeatMasker 4 21329 0.0076175 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 1408618 0.5030779 all active masks together: maskedwidth maskedratio 0 0 Patrick Quoting Paul Shannon <pshannon at="" systemsbiology.org="">: > I wish to introduce some mutations into a segment of Hg18 chromosome 9, > as part of a simulation I wish to run. I am having difficulty figuring > out how to modify the BSgenome-related data structures. > > I create my (wild-type) variables like this: > > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <<- subseq (chr9, 113900001, 116700000) > > How would one create a mutant variant of hg18.9q32? My wish is to > incrementally -- one gene at a time -- introduce SNPs (and later, > indels and translocations). Sometimes these changes are intentionally > random, sometimes I want them at very specific places, to 'spike in' an > interesting mutation. > > I see that one can injectSNPs into a genome: > > Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016") > > I could find no documentation for the SNPlocs data structure. > > hg18.9q32 appears to be a MaskedDNAString, apparently derived from > Biostring and XString, all of which seem to be immutable. > > Is there any way to do this? > > Let me apologize in advance if this is documented and I missed it. I > have combed the vignettes and the help pages and found nothing. > > Many thanks, > > - Paul > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.2 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Paul, There is very little support for modifying sequences (XString objects) in Biostrings so far. Patrick already mentioned replaceLetterAt() which is limited to replacing some letters in the original sequence by new letters. The letters to replace are specified by position, and each letter to replace is replaced by 1 letter so the length of the sequence is conserved. An "in place" version of replaceLetterAt() is in fact what is used behind the scene when you inject SNPs from a SNPlocs package with injectSNPs() (?replaceLetterAt gives some details about this). The SNPs contained in the SNPlocs.Hsapiens.dbSNP.20071016 package are accessible via 2 functions: getSNPcount() and getSNPlocs(), both defined in the SNPlocs pkg. The 1st function returns the nb of known SNPs per chromosome: > library(SNPlocs.Hsapiens.dbSNP.20071016) > getSNPcount() chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 694713 673384 563502 580624 512891 583597 480870 441049 377816 449366 451615 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 423277 330632 265729 242233 256253 224656 237958 190296 253012 118161 149667 chrX chrY 308827 6054 and the 2nd function returns a data frame describing the SNPs for a given chromosome: > chr9snps <- getSNPlocs("chr9") > dim(chr9snps) [1] 377816 3 > head(chr9snps) RefSNP_id alleles_as_ambig loc 1 9407282 R 42963 2 2260177 M 44415 3 10217501 S 44444 4 9299000 R 44551 5 10217133 K 44553 6 10217138 K 44692 It contains 1 SNP per row: the 1st column is the RefSNP id assigned by dbSNP to the SNP, the 2nd col is the IUPAC ambiguity letter describing the alleles for the SNP, and the 3rd col is its position in the chromosome. You could for example extract region 9q32 and then inject the SNPs that belong to this region with: > hg18.9q32 <- unmasked(subseq(Hsapiens$chr9, 113900001, 116700000)) > snp_is_in_range <- 113900001 <= chr9snps$loc & chr9snps$loc <= 116700000 > hg18.9q32.snps <- chr9snps[snp_is_in_range, ] > dim(hg18.9q32.snps) [1] 8994 3 > head(hg18.9q32.snps) RefSNP_id alleles_as_ambig loc 293622 12350562 M 113900013 293623 10124666 W 113900343 293624 3739698 K 113900698 293625 3739697 Y 113900837 293626 12377671 Y 113901284 293627 10119477 R 113901385 > hg18.9q32.withsnps <- replaceLetterAt(hg18.9q32, hg18.9q32.snps$loc - 113900001 + 1, hg18.9q32.snps$alleles_as_ambig, if.not.extending="merge") Note that you get the same thing if you inject the SNPs genome-wide first and then extract region 9q32: > hg18.withsnps <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016") > hg18.9q32.withsnps2 <- unmasked(subseq(hg18.withsnps$chr9, 113900001, 116700000)) > hg18.9q32.withsnps2 == hg18.9q32.withsnps [1] TRUE Now if you want to inject SNPs incrementally, one region at a time, in chr9, you can do: injectSNPsInRegion <- function(x, start, end, snps) { ii <- start <= snps$loc & snps$loc <= end snps2 <- snps[ii, ] replaceLetterAt(x, snps2$loc - start + 1, snps2$alleles_as_ambig, if.not.extending="merge") } > chr9 <- unmasked(Hsapiens$chr9) > chr9withsnps <- injectSNPsInRegion(chr9, 55001, 95000, chr9snps) If you put this in a loop, e.g.: chr9withsnps <- chr9 for (...) { get the start/end of a gene in chr9 from somewhere ... chr9withsnps <- injectSNPsInRegion(chr9withsnps, start, end, chr9snps) ... do some analysis on chr9withsnps } don't try to keep all the intermediate versions of chr9withsnps otherwise you'll run into memory problems. In the above code the old 'chr9withsnps' is replaced by the new 'chr9withsnps' at each iteration so the memory used by the old 'chr9withsnps' can be garbage-collected. More facilities for modifying XString objects will need to be added to the Biostrings package if you want to be able to delete or insert regions in the chromosome. For example I could add a replacement method for subseq() that could be used like this: > subseq(chr9, 201, 300) <- "" # deletion of region 201-300 > subseq(chr9, 201, 300) <- "ACCACGTAATG" # replacement of region 201-300 # by a shorter sequence > subseq(chr9, 201, 205) <- "ACCACGTAATG" # replacement of region 201-205 # by a longer sequence > subseq(chr9, 201, 200) <- "ACCACGTAATG" # insertion of ACCACGTAATG between # position 200 and 201 The last form looks a little bit strange but is consistent with the general behaviour of subseq() and with the representation of empty ranges (see ?Ranges). Let me know if that would be useful. H. Paul Shannon wrote: > I wish to introduce some mutations into a segment of Hg18 chromosome 9, > as part of a simulation I wish to run. I am having difficulty figuring > out how to modify the BSgenome-related data structures. > > I create my (wild-type) variables like this: > > chr9 <- Hsapiens [['chr9']] > hg18.9q32 <<- subseq (chr9, 113900001, 116700000) > > How would one create a mutant variant of hg18.9q32? My wish is to > incrementally -- one gene at a time -- introduce SNPs (and later, indels > and translocations). Sometimes these changes are intentionally random, > sometimes I want them at very specific places, to 'spike in' an > interesting mutation. > > I see that one can injectSNPs into a genome: > > Hsapiens.snp <- injectSNPs (Hsapiens, "SNPlocs.Hsapiens.dbSNP.20071016") > > I could find no documentation for the SNPlocs data structure. > > hg18.9q32 appears to be a MaskedDNAString, apparently derived from > Biostring and XString, all of which seem to be immutable. > > Is there any way to do this? > > Let me apologize in advance if this is documented and I missed it. I > have combed the vignettes and the help pages and found nothing. > > Many thanks, > > - Paul > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 15.2 years ago Hervé Pagès 16k

Login before adding your answer.