matchPattern in BSgenome.Hsapiens.UCSC.hg19
2
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 7 months ago
United States

Hi,

I am trying to locate GGCGTGGTGGTGTGTGCCTGTAG at the minus strand of chr16 of hg19 using matchPattern function. However, when I try to view one of the hits  chr16:24167733-24167755  via UCSC genome browser hg19  , the reverse complement of the genomic sequence at chr16:24167733-24167755 is not GGCGTGGTGGTGTGTGCCTGTAG. Blat search also fails to return chr16:24167733-24167755

Any ideas on why I am seeing the discrepancy? FYI, I tried both unmasked and masked hg19 Bsgenome.

Here is the code snippet.

library(BSgenome.Hsapiens.UCSC.hg19)

m1 <- matchPattern("GGCGTGGTGGTGTGTGCCTGTAG", reverseComplement(Hsapiens$chr16), max.mismatch=0)

 length(Hsapiens$chr16) - end(m1)[7] + 1

#[1] 24167733

 length(Hsapiens$chr16) - start(m1)[7] + 1

#[1] 24167755

Many thanks!

Best regards,

Julie

sessionInfo()

R version 3.2.2 Patched (2015-09-13 r69389)

Platform: x86_64-apple-darwin10.8.0 (64-bit)

Running under: OS X 10.8.5 (Mountain Lion)

 

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils    

[7] datasets  methods   base     

 

other attached packages:

 [1] BSgenome.Hsapiens.UCSC.hg19.masked_1.3.99

 [2] BSgenome.Hsapiens.UCSC.hg19_1.4.0        

 [3] BSgenome_1.36.3                          

 [4] rtracklayer_1.28.10                      

 [5] Biostrings_2.36.4                        

 [6] XVector_0.8.0                            

 [7] GenomicRanges_1.20.8                     

 [8] GenomeInfoDb_1.4.3                       

 [9] IRanges_2.2.9                            

[10] S4Vectors_0.6.6                          

[11] BiocGenerics_0.14.0                      

 

loaded via a namespace (and not attached):

 [1] XML_3.98-1.3            Rsamtools_1.20.5       

 [3] bitops_1.0-6            GenomicAlignments_1.4.2

 [5] futile.options_1.0.0    zlibbioc_1.14.0        

 [7] futile.logger_1.4.1     lambda.r_1.1.7         

 [9] BiocParallel_1.2.22     tools_3.2.2            

[11] RCurl_1.95-4.7     

 

BSgenome.Hsapiens.UCSC.hg19 • 1.8k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 6 hours ago
United States

It seems to me that it's OK. You can use matchPattern() on the negative strand of chr16, or just as easily use it on the reverseComplement() of your sequence:

> m1
  Views on a 90354753-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
       start      end width
[1] 22208414 22208436    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[2] 24264902 24264924    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[3] 33310309 33310331    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[4] 55533394 55533416    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[5] 56863250 56863272    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[6] 64513450 64513472    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[7] 66186999 66187021    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[8] 68122830 68122852    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[9] 69048579 69048601    23 [GGCGTGGTGGTGTGTGCCTGTAG]

> m2 <- matchPattern(as.character(reverseComplement(DNAStringSet("GGCGTGGTGGTGTGTGCCTGTAG"))), Hsapiens$chr16, max.mismatch=0)
> m2
  Views on a 90354753-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
       start      end width
[1] 21306153 21306175    23 [CTACAGGCACACACCACCACGCC]
[2] 22231902 22231924    23 [CTACAGGCACACACCACCACGCC]
[3] 24167733 24167755    23 [CTACAGGCACACACCACCACGCC]
[4] 25841282 25841304    23 [CTACAGGCACACACCACCACGCC]
[5] 33491482 33491504    23 [CTACAGGCACACACCACCACGCC]
[6] 34821338 34821360    23 [CTACAGGCACACACCACCACGCC]
[7] 57044423 57044445    23 [CTACAGGCACACACCACCACGCC]
[8] 66089830 66089852    23 [CTACAGGCACACACCACCACGCC]
[9] 68146318 68146340    23 [CTACAGGCACACACCACCACGCC]

So [3] here is the same as [7] above, right? And if we look at that position on UCSC, it matches the sequence for m2, CTACAGGCACACACCACCACGCC.

.

ADD COMMENT
0
Entering edit mode
Thanks, James! I must have done reverse complement wrong by eye. Do you know why running BLAT with GGCGTGGTGGTGTGTGCCTGTAG does not return chr16:24167733 � 24167755? Best, Julie From: "James W. MacDonald [bioc]" <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> Reply-To: "reply+00d04468+code@bioconductor.org<mailto:reply+00d04468+code@bioconductor.org>" <reply+00d04468+code@bioconductor.org<mailto:reply+00d04468+code@bioconductor.org>> Date: Monday, November 2, 2015 5:12 PM To: Lihua Julie Zhu <julie.zhu@umassmed.edu<mailto:julie.zhu@umassmed.edu>> Subject: [bioc] A: matchPattern in BSgenome.Hsapiens.UCSC.hg19 Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User James W. MacDonald<https: support.bioconductor.org="" u="" 5106=""/> wrote Answer: matchPattern in BSgenome.Hsapiens.UCSC.hg19<https: support.bioconductor.org="" p="" 74041="" #74049="">: It seems to me that it's OK. You can use matchPattern() on the negative strand of chr16, or just as easily use it on the reverseComplement() of your sequence: > m1 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 22208414 22208436 23 [GGCGTGGTGGTGTGTGCCTGTAG] [2] 24264902 24264924 23 [GGCGTGGTGGTGTGTGCCTGTAG] [3] 33310309 33310331 23 [GGCGTGGTGGTGTGTGCCTGTAG] [4] 55533394 55533416 23 [GGCGTGGTGGTGTGTGCCTGTAG] [5] 56863250 56863272 23 [GGCGTGGTGGTGTGTGCCTGTAG] [6] 64513450 64513472 23 [GGCGTGGTGGTGTGTGCCTGTAG] [7] 66186999 66187021 23 [GGCGTGGTGGTGTGTGCCTGTAG] [8] 68122830 68122852 23 [GGCGTGGTGGTGTGTGCCTGTAG] [9] 69048579 69048601 23 [GGCGTGGTGGTGTGTGCCTGTAG] > m2 <- matchPattern(as.character(reverseComplement(DNAStringSet("GGCGTGGTGGTGTGTGCCTGTAG"))), Hsapiens$chr16, max.mismatch=0) > m2 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 21306153 21306175 23 [CTACAGGCACACACCACCACGCC] [2] 22231902 22231924 23 [CTACAGGCACACACCACCACGCC] [3] 24167733 24167755 23 [CTACAGGCACACACCACCACGCC] [4] 25841282 25841304 23 [CTACAGGCACACACCACCACGCC] [5] 33491482 33491504 23 [CTACAGGCACACACCACCACGCC] [6] 34821338 34821360 23 [CTACAGGCACACACCACCACGCC] [7] 57044423 57044445 23 [CTACAGGCACACACCACCACGCC] [8] 66089830 66089852 23 [CTACAGGCACACACCACCACGCC] [9] 68146318 68146340 23 [CTACAGGCACACACCACCACGCC] So [3] here is the same as [7] above, right? And if we look at that position on UCSC<http: genome.ucsc.edu="" cgi-bin="" hgtracks?db="hg19&amp;position=chr16%3A24167733-24167755&amp;hgsid=451594119_SmrMZKIGmtmnoWPEzNIdOEtbKgSY">, it matches the sequence for m2, CTACAGGCACACACCACCACGCC. . ________________________________ Post tags: BSgenome.Hsapiens.UCSC.hg19 You may reply via email or visit A: matchPattern in BSgenome.Hsapiens.UCSC.hg19
ADD REPLY
0
Entering edit mode

Hi Julie,

I'm not sure why blat doesn't return it. My best guess is that blat is restricted to returning the first 135 hits (I get 135 for both the forward and reverse sequences), and blat starts on chr1, so it returns the first 135 hits and then stops. I don't see anything about a limit on how many hits it returns, so I could be wrong, but that's my best guess.

ADD REPLY
0
Entering edit mode
Interesting observation! Thanks, James! Best, Julie On Nov 3, 2015, at 9:31 AM, James W. MacDonald [bioc] <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> wrote: Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User James W. MacDonald<https: support.bioconductor.org="" u="" 5106=""/> wrote Comment: matchPattern in BSgenome.Hsapiens.UCSC.hg19<https: support.bioconductor.org="" p="" 74041="" #74070="">: Hi Julie, I'm not sure why blat doesn't return it. My best guess is that blat is restricted to returning the first 135 hits (I get 135 for both the forward and reverse sequences), and blat starts on chr1, so it returns the first 135 hits and then stops. I don't see anything about a limit on how many hits it returns, so I could be wrong, but that's my best guess. ________________________________ Post tags: BSgenome.Hsapiens.UCSC.hg19 You may reply via email or visit C: matchPattern in BSgenome.Hsapiens.UCSC.hg19
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 9 hours ago
Seattle, WA, United States

Hi Julie,

Even though matchPattern(pattern, reverseComplement(subject)) and matchPattern(reverseComplement(pattern), subject) are both finding matches on the minus strand, the latter is the recommended way. One of the reasons the former is discouraged is because it doesn't follow the universal convention of reporting the matches with respect to the 1st nucleotide of the plus strand (which is how things should always be reported whatever the strand is). See  C. SEARCHING THE MINUS STRAND OF A CHROMOSOME in ?reverseComplement where this is discussed in more details.

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6