Search
Question: matchPattern in BSgenome.Hsapiens.UCSC.hg19
0
gravatar for Julie Zhu
2.9 years ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:

Hi,

I am trying to locate GGCGTGGTGGTGTGTGCCTGTAG at the minus strand of chr16 of hg19 using matchPattern function. However, when I try to view one of the hits  chr16:24167733-24167755  via UCSC genome browser hg19  , the reverse complement of the genomic sequence at chr16:24167733-24167755 is not GGCGTGGTGGTGTGTGCCTGTAG. Blat search also fails to return chr16:24167733-24167755

Any ideas on why I am seeing the discrepancy? FYI, I tried both unmasked and masked hg19 Bsgenome.

Here is the code snippet.

library(BSgenome.Hsapiens.UCSC.hg19)

m1 <- matchPattern("GGCGTGGTGGTGTGTGCCTGTAG", reverseComplement(Hsapiens$chr16), max.mismatch=0)

 length(Hsapiens$chr16) - end(m1)[7] + 1

#[1] 24167733

 length(Hsapiens$chr16) - start(m1)[7] + 1

#[1] 24167755

Many thanks!

Best regards,

Julie

sessionInfo()

R version 3.2.2 Patched (2015-09-13 r69389)

Platform: x86_64-apple-darwin10.8.0 (64-bit)

Running under: OS X 10.8.5 (Mountain Lion)

 

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils    

[7] datasets  methods   base     

 

other attached packages:

 [1] BSgenome.Hsapiens.UCSC.hg19.masked_1.3.99

 [2] BSgenome.Hsapiens.UCSC.hg19_1.4.0        

 [3] BSgenome_1.36.3                          

 [4] rtracklayer_1.28.10                      

 [5] Biostrings_2.36.4                        

 [6] XVector_0.8.0                            

 [7] GenomicRanges_1.20.8                     

 [8] GenomeInfoDb_1.4.3                       

 [9] IRanges_2.2.9                            

[10] S4Vectors_0.6.6                          

[11] BiocGenerics_0.14.0                      

 

loaded via a namespace (and not attached):

 [1] XML_3.98-1.3            Rsamtools_1.20.5       

 [3] bitops_1.0-6            GenomicAlignments_1.4.2

 [5] futile.options_1.0.0    zlibbioc_1.14.0        

 [7] futile.logger_1.4.1     lambda.r_1.1.7         

 [9] BiocParallel_1.2.22     tools_3.2.2            

[11] RCurl_1.95-4.7     

 

ADD COMMENTlink modified 2.9 years ago by Hervé Pagès ♦♦ 13k • written 2.9 years ago by Julie Zhu3.8k
0
gravatar for James W. MacDonald
2.9 years ago by
United States
James W. MacDonald47k wrote:

It seems to me that it's OK. You can use matchPattern() on the negative strand of chr16, or just as easily use it on the reverseComplement() of your sequence:

> m1
  Views on a 90354753-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
       start      end width
[1] 22208414 22208436    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[2] 24264902 24264924    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[3] 33310309 33310331    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[4] 55533394 55533416    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[5] 56863250 56863272    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[6] 64513450 64513472    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[7] 66186999 66187021    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[8] 68122830 68122852    23 [GGCGTGGTGGTGTGTGCCTGTAG]
[9] 69048579 69048601    23 [GGCGTGGTGGTGTGTGCCTGTAG]

> m2 <- matchPattern(as.character(reverseComplement(DNAStringSet("GGCGTGGTGGTGTGTGCCTGTAG"))), Hsapiens$chr16, max.mismatch=0)
> m2
  Views on a 90354753-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
       start      end width
[1] 21306153 21306175    23 [CTACAGGCACACACCACCACGCC]
[2] 22231902 22231924    23 [CTACAGGCACACACCACCACGCC]
[3] 24167733 24167755    23 [CTACAGGCACACACCACCACGCC]
[4] 25841282 25841304    23 [CTACAGGCACACACCACCACGCC]
[5] 33491482 33491504    23 [CTACAGGCACACACCACCACGCC]
[6] 34821338 34821360    23 [CTACAGGCACACACCACCACGCC]
[7] 57044423 57044445    23 [CTACAGGCACACACCACCACGCC]
[8] 66089830 66089852    23 [CTACAGGCACACACCACCACGCC]
[9] 68146318 68146340    23 [CTACAGGCACACACCACCACGCC]

So [3] here is the same as [7] above, right? And if we look at that position on UCSC, it matches the sequence for m2, CTACAGGCACACACCACCACGCC.

.

ADD COMMENTlink written 2.9 years ago by James W. MacDonald47k
Thanks, James! I must have done reverse complement wrong by eye. Do you know why running BLAT with GGCGTGGTGGTGTGTGCCTGTAG does not return chr16:24167733 � 24167755? Best, Julie From: "James W. MacDonald [bioc]" <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> Reply-To: "reply+00d04468+code@bioconductor.org<mailto:reply+00d04468+code@bioconductor.org>" <reply+00d04468+code@bioconductor.org<mailto:reply+00d04468+code@bioconductor.org>> Date: Monday, November 2, 2015 5:12 PM To: Lihua Julie Zhu <julie.zhu@umassmed.edu<mailto:julie.zhu@umassmed.edu>> Subject: [bioc] A: matchPattern in BSgenome.Hsapiens.UCSC.hg19 Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User James W. MacDonald<https: support.bioconductor.org="" u="" 5106=""/> wrote Answer: matchPattern in BSgenome.Hsapiens.UCSC.hg19<https: support.bioconductor.org="" p="" 74041="" #74049="">: It seems to me that it's OK. You can use matchPattern() on the negative strand of chr16, or just as easily use it on the reverseComplement() of your sequence: > m1 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 22208414 22208436 23 [GGCGTGGTGGTGTGTGCCTGTAG] [2] 24264902 24264924 23 [GGCGTGGTGGTGTGTGCCTGTAG] [3] 33310309 33310331 23 [GGCGTGGTGGTGTGTGCCTGTAG] [4] 55533394 55533416 23 [GGCGTGGTGGTGTGTGCCTGTAG] [5] 56863250 56863272 23 [GGCGTGGTGGTGTGTGCCTGTAG] [6] 64513450 64513472 23 [GGCGTGGTGGTGTGTGCCTGTAG] [7] 66186999 66187021 23 [GGCGTGGTGGTGTGTGCCTGTAG] [8] 68122830 68122852 23 [GGCGTGGTGGTGTGTGCCTGTAG] [9] 69048579 69048601 23 [GGCGTGGTGGTGTGTGCCTGTAG] > m2 <- matchPattern(as.character(reverseComplement(DNAStringSet("GGCGTGGTGGTGTGTGCCTGTAG"))), Hsapiens$chr16, max.mismatch=0) > m2 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 21306153 21306175 23 [CTACAGGCACACACCACCACGCC] [2] 22231902 22231924 23 [CTACAGGCACACACCACCACGCC] [3] 24167733 24167755 23 [CTACAGGCACACACCACCACGCC] [4] 25841282 25841304 23 [CTACAGGCACACACCACCACGCC] [5] 33491482 33491504 23 [CTACAGGCACACACCACCACGCC] [6] 34821338 34821360 23 [CTACAGGCACACACCACCACGCC] [7] 57044423 57044445 23 [CTACAGGCACACACCACCACGCC] [8] 66089830 66089852 23 [CTACAGGCACACACCACCACGCC] [9] 68146318 68146340 23 [CTACAGGCACACACCACCACGCC] So [3] here is the same as [7] above, right? And if we look at that position on UCSC<http: genome.ucsc.edu="" cgi-bin="" hgtracks?db="hg19&amp;position=chr16%3A24167733-24167755&amp;hgsid=451594119_SmrMZKIGmtmnoWPEzNIdOEtbKgSY">, it matches the sequence for m2, CTACAGGCACACACCACCACGCC. . ________________________________ Post tags: BSgenome.Hsapiens.UCSC.hg19 You may reply via email or visit A: matchPattern in BSgenome.Hsapiens.UCSC.hg19
ADD REPLYlink written 2.9 years ago by Julie Zhu3.8k

Hi Julie,

I'm not sure why blat doesn't return it. My best guess is that blat is restricted to returning the first 135 hits (I get 135 for both the forward and reverse sequences), and blat starts on chr1, so it returns the first 135 hits and then stops. I don't see anything about a limit on how many hits it returns, so I could be wrong, but that's my best guess.

ADD REPLYlink written 2.9 years ago by James W. MacDonald47k
Interesting observation! Thanks, James! Best, Julie On Nov 3, 2015, at 9:31 AM, James W. MacDonald [bioc] <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> wrote: Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User James W. MacDonald<https: support.bioconductor.org="" u="" 5106=""/> wrote Comment: matchPattern in BSgenome.Hsapiens.UCSC.hg19<https: support.bioconductor.org="" p="" 74041="" #74070="">: Hi Julie, I'm not sure why blat doesn't return it. My best guess is that blat is restricted to returning the first 135 hits (I get 135 for both the forward and reverse sequences), and blat starts on chr1, so it returns the first 135 hits and then stops. I don't see anything about a limit on how many hits it returns, so I could be wrong, but that's my best guess. ________________________________ Post tags: BSgenome.Hsapiens.UCSC.hg19 You may reply via email or visit C: matchPattern in BSgenome.Hsapiens.UCSC.hg19
ADD REPLYlink written 2.9 years ago by Julie Zhu3.8k
0
gravatar for Hervé Pagès
2.9 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Julie,

Even though matchPattern(pattern, reverseComplement(subject)) and matchPattern(reverseComplement(pattern), subject) are both finding matches on the minus strand, the latter is the recommended way. One of the reasons the former is discouraged is because it doesn't follow the universal convention of reporting the matches with respect to the 1st nucleotide of the plus strand (which is how things should always be reported whatever the strand is). See  C. SEARCHING THE MINUS STRAND OF A CHROMOSOME in ?reverseComplement where this is discussed in more details.

Cheers,

H.

ADD COMMENTlink written 2.9 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 344 users visited in the last hour