insert Ns for repeat masked regions
3
0
Entering edit mode
rna seq ▴ 90
@rna-seq-4145
Last seen 8.2 years ago
Hello List, I am trying to retrieve a sequence of ~1000 nts using the getseq() function from the BSgenomes package I would like to replace the repeat masked regions with Ns using something similar to the inject snps function from the SNPlocs.Hsapiens.dbSNP package. So far I can grab sequence from the genome either masked: getSeq(Hsapiens, "chr21", 33665196, 33665435, as.character=FALSE) or unmasked: getSeq(hg19snp, "chr21", 33665196, 33665435, as.character=FALSE) The problem is that the masked function returns a gap: TCCCAGGATGTGACATTGTTTGCCAGTGCAGAGGC...GGAGCTTTGGAAGAAGAGAGAGTTGACTACGG AAA and I would like the gap to be filled with Ns? Thanks in advance, Pete [[alternative HTML version deleted]]
• 1.3k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 12 hours ago
Seattle, WA, United States
Hi Pete, The getSeq() function always drops the masks. See the man page: Note that the masks in ?x?, if any, are always ignored. In other words, masked regions in the genome are extracted in the same way as unmasked regions (this is achieved by dropping the masks before extraction). If you want to extract sequences with their masks on, you will need to extract them one at a time by: (1) loading the corresponding chromosome, (2) activate/deactivate the masks you want, (3) use subseq(): > library(BSgenome.Hsapiens.UCSC.hg19) > chr1 <- Hsapiens$chr1 # this loads the sequence in memory > active(masks(chr1)) AGAPS AMB RM TRF TRUE TRUE FALSE FALSE > active(masks(chr1))["RM"] <- TRUE # activate the RepeatMasker mask > active(masks(chr1)) AGAPS AMB RM TRF TRUE TRUE TRUE FALSE > myseq <- subseq(chr1, start=530000, end=530999) > myseq 1000-letter "MaskedDNAString" instance (# for masking) seq: ##############################CCGGTT...CCGGCCGTGTAGGCCCCTCTAGAG CCAAGACGCTGC masks: maskedwidth maskedratio active names desc 1 0 0.000 TRUE AGAPS assembly gaps 2 0 0.000 TRUE AMB intra-contig ambiguities (empty) 3 594 0.594 TRUE RM RepeatMasker 4 0 0.000 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 594 0.594 all active masks together: maskedwidth maskedratio 594 0.594 > unmasked(myseq) 1000-letter "DNAString" instance seq: TAAAAGTACCCTGTCAAAGGGTGAGCATTTCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAG CCAAGACGCTGC Then you can use injectHardMask() on 'myseq' to replace the masked regions with Ns: > injectHardMask(myseq, "N") 1000-letter "DNAString" instance seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAG CCAAGACGCTGC If you need to extract sequences from many different chromosomes you'll need to pay attention to memory management. Ideally you want to load one chromosome at a time (this is a costly operation), and, when you are done with it, you want to make it a candidate for garbage collection by removing explicitly any object holding a reference to it. In the above example, both 'chr1' and 'myseq' are holding a reference to the chr1 sequence that was loaded in memory (in the case of 'myseq' this is because subseq() doesn't copy the sequence data) so they both need to be removed with 'rm(chr1)' and 'rm(myseq)'. Keeping the result of injectHardMask() around is ok because injectHardMask() does copy the sequence data. See the PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE subsection in the examples section of the man page for BSgenome objects for more information and tips about this. Cheers, H. ----- Original Message ----- From: "rna seq" <rna.seeker@gmail.com> To: "BioC List" <bioconductor at="" stat.math.ethz.ch=""> Sent: Tuesday, March 1, 2011 2:50:39 PM Subject: [BioC] insert Ns for repeat masked regions Hello List, I am trying to retrieve a sequence of ~1000 nts using the getseq() function from the BSgenomes package I would like to replace the repeat masked regions with Ns using something similar to the inject snps function from the SNPlocs.Hsapiens.dbSNP package. So far I can grab sequence from the genome either masked: getSeq(Hsapiens, "chr21", 33665196, 33665435, as.character=FALSE) or unmasked: getSeq(hg19snp, "chr21", 33665196, 33665435, as.character=FALSE) The problem is that the masked function returns a gap: TCCCAGGATGTGACATTGTTTGCCAGTGCAGAGGC...GGAGCTTTGGAAGAAGAGAGAGTTGACTACGG AAA and I would like the gap to be filled with Ns? Thanks in advance, Pete [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 12 hours ago
Seattle, WA, United States
Hi Pete, The getSeq() function always drops the masks. See the man page: Note that the masks in ?x?, if any, are always ignored. In other words, masked regions in the genome are extracted in the same way as unmasked regions (this is achieved by dropping the masks before extraction). If you want to extract sequences with their masks on, you will need to extract them one at a time by: (1) loading the corresponding chromosome, (2) activate/deactivate the masks you want, (3) use subseq(): > library(BSgenome.Hsapiens.UCSC.hg19) > chr1 <- Hsapiens$chr1 # this loads the sequence in memory > active(masks(chr1)) AGAPS AMB RM TRF TRUE TRUE FALSE FALSE > active(masks(chr1))["RM"] <- TRUE # activate the RepeatMasker mask > active(masks(chr1)) AGAPS AMB RM TRF TRUE TRUE TRUE FALSE > myseq <- subseq(chr1, start=530000, end=530999) > myseq 1000-letter "MaskedDNAString" instance (# for masking) seq: ##############################CCGGTT...CCGGCCGTGTAGGCCCCTCTAGAG CCAAGACGCTGC masks: maskedwidth maskedratio active names desc 1 0 0.000 TRUE AGAPS assembly gaps 2 0 0.000 TRUE AMB intra-contig ambiguities (empty) 3 594 0.594 TRUE RM RepeatMasker 4 0 0.000 FALSE TRF Tandem Repeats Finder [period<=12] all masks together: maskedwidth maskedratio 594 0.594 all active masks together: maskedwidth maskedratio 594 0.594 > unmasked(myseq) 1000-letter "DNAString" instance seq: TAAAAGTACCCTGTCAAAGGGTGAGCATTTCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAG CCAAGACGCTGC Then you can use injectHardMask() on 'myseq' to replace the masked regions with Ns: > injectHardMask(myseq, "N") 1000-letter "DNAString" instance seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCGGTT...CCGGCCGTGTAGGCCCCTCTAGAG CCAAGACGCTGC If you need to extract sequences from many different chromosomes you'll need to pay attention to memory management. Ideally you want to load one chromosome at a time (this is a costly operation), and, when you are done with it, you want to make it a candidate for garbage collection by removing explicitly any object holding a reference to it. In the above example, both 'chr1' and 'myseq' are holding a reference to the chr1 sequence that was loaded in memory (in the case of 'myseq' this is because subseq() doesn't copy the sequence data) so they both need to be removed with 'rm(chr1)' and 'rm(myseq)'. Keeping the result of injectHardMask() around is ok because injectHardMask() does copy the sequence data. See the PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE subsection in the examples section of the man page for BSgenome objects for more information and tips about this. Cheers, H. ----- Original Message ----- From: "rna seq" <rna.seeker@gmail.com> To: "BioC List" <bioconductor at="" stat.math.ethz.ch=""> Sent: Tuesday, March 1, 2011 2:50:39 PM Subject: [BioC] insert Ns for repeat masked regions Hello List, I am trying to retrieve a sequence of ~1000 nts using the getseq() function from the BSgenomes package I would like to replace the repeat masked regions with Ns using something similar to the inject snps function from the SNPlocs.Hsapiens.dbSNP package. So far I can grab sequence from the genome either masked: getSeq(Hsapiens, "chr21", 33665196, 33665435, as.character=FALSE) or unmasked: getSeq(hg19snp, "chr21", 33665196, 33665435, as.character=FALSE) The problem is that the masked function returns a gap: TCCCAGGATGTGACATTGTTTGCCAGTGCAGAGGC...GGAGCTTTGGAAGAAGAGAGAGTTGACTACGG AAA and I would like the gap to be filled with Ns? Thanks in advance, Pete [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States
Hi, On Tue, Mar 1, 2011 at 5:50 PM, rna seq <rna.seeker at="" gmail.com=""> wrote: > Hello List, > > I am trying to retrieve a sequence of ~1000 nts using the getseq() function > from the BSgenomes package > > I would like to replace the repeat masked regions with Ns > > using something similar to the inject snps function from the > SNPlocs.Hsapiens.dbSNP package. > > So far I can grab sequence from the genome either masked: getSeq(Hsapiens, > "chr21", ?33665196, 33665435, ?as.character=FALSE) > > or unmasked: getSeq(hg19snp, "chr21", ?33665196, 33665435, > as.character=FALSE) > > The problem is that the masked function returns a gap: > > TCCCAGGATGTGACATTGTTTGCCAGTGCAGAGGC...GGAGCTTTGGAAGAAGAGAGAGTTGACTAC GGAAA > > ?and I would like the gap to be filled with Ns? I'm not sure that there is a gap there, as the middle '...' is just a result of how XString objects "show" themselves in R. Is that what you're talking about? Look at the result you get when you set as.character=TRUE R> library(BSgenome.Hsapiens.UCSC.hg19) R> getSeq(Hsapiens, "chr21", 33665196, 33665435, as.character=TRUE) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT

Login before adding your answer.

Traffic: 861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6