Question: matchPattern in BSgenome.Hsapiens.UCSC.hg19
0
4.1 years ago by
Julie Zhu4.1k
United States
Julie Zhu4.1k wrote:

Hi,

I am trying to locate GGCGTGGTGGTGTGTGCCTGTAG at the minus strand of chr16 of hg19 using matchPattern function. However, when I try to view one of the hits  chr16:24167733-24167755  via UCSC genome browser hg19  , the reverse complement of the genomic sequence at chr16:24167733-24167755 is not GGCGTGGTGGTGTGTGCCTGTAG. Blat search also fails to return chr16:24167733-24167755

Any ideas on why I am seeing the discrepancy? FYI, I tried both unmasked and masked hg19 Bsgenome.

Here is the code snippet.

library(BSgenome.Hsapiens.UCSC.hg19)

m1 <- matchPattern("GGCGTGGTGGTGTGTGCCTGTAG", reverseComplement(Hsapiens$chr16), max.mismatch=0) length(Hsapiens$chr16) - end(m1)[7] + 1

#[1] 24167733

length(Hsapiens$chr16) - start(m1)[7] + 1 #[1] 24167755 Many thanks! Best regards, Julie sessionInfo() R version 3.2.2 Patched (2015-09-13 r69389) Platform: x86_64-apple-darwin10.8.0 (64-bit) Running under: OS X 10.8.5 (Mountain Lion) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils [7] datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19.masked_1.3.99 [2] BSgenome.Hsapiens.UCSC.hg19_1.4.0 [3] BSgenome_1.36.3 [4] rtracklayer_1.28.10 [5] Biostrings_2.36.4 [6] XVector_0.8.0 [7] GenomicRanges_1.20.8 [8] GenomeInfoDb_1.4.3 [9] IRanges_2.2.9 [10] S4Vectors_0.6.6 [11] BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] XML_3.98-1.3 Rsamtools_1.20.5 [3] bitops_1.0-6 GenomicAlignments_1.4.2 [5] futile.options_1.0.0 zlibbioc_1.14.0 [7] futile.logger_1.4.1 lambda.r_1.1.7 [9] BiocParallel_1.2.22 tools_3.2.2 [11] RCurl_1.95-4.7 ADD COMMENTlink modified 4.1 years ago by Hervé Pagès ♦♦ 14k • written 4.1 years ago by Julie Zhu4.1k Answer: matchPattern in BSgenome.Hsapiens.UCSC.hg19 0 4.1 years ago by United States James W. MacDonald52k wrote: It seems to me that it's OK. You can use matchPattern() on the negative strand of chr16, or just as easily use it on the reverseComplement() of your sequence: > m1 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 22208414 22208436 23 [GGCGTGGTGGTGTGTGCCTGTAG] [2] 24264902 24264924 23 [GGCGTGGTGGTGTGTGCCTGTAG] [3] 33310309 33310331 23 [GGCGTGGTGGTGTGTGCCTGTAG] [4] 55533394 55533416 23 [GGCGTGGTGGTGTGTGCCTGTAG] [5] 56863250 56863272 23 [GGCGTGGTGGTGTGTGCCTGTAG] [6] 64513450 64513472 23 [GGCGTGGTGGTGTGTGCCTGTAG] [7] 66186999 66187021 23 [GGCGTGGTGGTGTGTGCCTGTAG] [8] 68122830 68122852 23 [GGCGTGGTGGTGTGTGCCTGTAG] [9] 69048579 69048601 23 [GGCGTGGTGGTGTGTGCCTGTAG] > m2 <- matchPattern(as.character(reverseComplement(DNAStringSet("GGCGTGGTGGTGTGTGCCTGTAG"))), Hsapiens$chr16, max.mismatch=0)
> m2
Views on a 90354753-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
start      end width
[1] 21306153 21306175    23 [CTACAGGCACACACCACCACGCC]
[2] 22231902 22231924    23 [CTACAGGCACACACCACCACGCC]
[3] 24167733 24167755    23 [CTACAGGCACACACCACCACGCC]
[4] 25841282 25841304    23 [CTACAGGCACACACCACCACGCC]
[5] 33491482 33491504    23 [CTACAGGCACACACCACCACGCC]
[6] 34821338 34821360    23 [CTACAGGCACACACCACCACGCC]
[7] 57044423 57044445    23 [CTACAGGCACACACCACCACGCC]
[8] 66089830 66089852    23 [CTACAGGCACACACCACCACGCC]
[9] 68146318 68146340    23 [CTACAGGCACACACCACCACGCC]

So [3] here is the same as [7] above, right? And if we look at that position on UCSC, it matches the sequence for m2, CTACAGGCACACACCACCACGCC.

.

Thanks, James! I must have done reverse complement wrong by eye. Do you know why running BLAT with GGCGTGGTGGTGTGTGCCTGTAG does not return chr16:24167733 � 24167755? Best, Julie From: "James W. MacDonald [bioc]" <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> Reply-To: "reply+00d04468+code@bioconductor.org<mailto:reply+00d04468+code@bioconductor.org>" <reply+00d04468+code@bioconductor.org<mailto:reply+00d04468+code@bioconductor.org>> Date: Monday, November 2, 2015 5:12 PM To: Lihua Julie Zhu <julie.zhu@umassmed.edu<mailto:julie.zhu@umassmed.edu>> Subject: [bioc] A: matchPattern in BSgenome.Hsapiens.UCSC.hg19 Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User James W. MacDonald<https: support.bioconductor.org="" u="" 5106=""/> wrote Answer: matchPattern in BSgenome.Hsapiens.UCSC.hg19<https: support.bioconductor.org="" p="" 74041="" #74049="">: It seems to me that it's OK. You can use matchPattern() on the negative strand of chr16, or just as easily use it on the reverseComplement() of your sequence: > m1 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 22208414 22208436 23 [GGCGTGGTGGTGTGTGCCTGTAG] [2] 24264902 24264924 23 [GGCGTGGTGGTGTGTGCCTGTAG] [3] 33310309 33310331 23 [GGCGTGGTGGTGTGTGCCTGTAG] [4] 55533394 55533416 23 [GGCGTGGTGGTGTGTGCCTGTAG] [5] 56863250 56863272 23 [GGCGTGGTGGTGTGTGCCTGTAG] [6] 64513450 64513472 23 [GGCGTGGTGGTGTGTGCCTGTAG] [7] 66186999 66187021 23 [GGCGTGGTGGTGTGTGCCTGTAG] [8] 68122830 68122852 23 [GGCGTGGTGGTGTGTGCCTGTAG] [9] 69048579 69048601 23 [GGCGTGGTGGTGTGTGCCTGTAG] > m2 <- matchPattern(as.character(reverseComplement(DNAStringSet("GGCGTGGTGGTGTGTGCCTGTAG"))), Hsapiens\$chr16, max.mismatch=0) > m2 Views on a 90354753-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 21306153 21306175 23 [CTACAGGCACACACCACCACGCC] [2] 22231902 22231924 23 [CTACAGGCACACACCACCACGCC] [3] 24167733 24167755 23 [CTACAGGCACACACCACCACGCC] [4] 25841282 25841304 23 [CTACAGGCACACACCACCACGCC] [5] 33491482 33491504 23 [CTACAGGCACACACCACCACGCC] [6] 34821338 34821360 23 [CTACAGGCACACACCACCACGCC] [7] 57044423 57044445 23 [CTACAGGCACACACCACCACGCC] [8] 66089830 66089852 23 [CTACAGGCACACACCACCACGCC] [9] 68146318 68146340 23 [CTACAGGCACACACCACCACGCC] So [3] here is the same as [7] above, right? And if we look at that position on UCSC<http: genome.ucsc.edu="" cgi-bin="" hgtracks?db="hg19&amp;position=chr16%3A24167733-24167755&amp;hgsid=451594119_SmrMZKIGmtmnoWPEzNIdOEtbKgSY">, it matches the sequence for m2, CTACAGGCACACACCACCACGCC. . ________________________________ Post tags: BSgenome.Hsapiens.UCSC.hg19 You may reply via email or visit A: matchPattern in BSgenome.Hsapiens.UCSC.hg19

Hi Julie,

I'm not sure why blat doesn't return it. My best guess is that blat is restricted to returning the first 135 hits (I get 135 for both the forward and reverse sequences), and blat starts on chr1, so it returns the first 135 hits and then stops. I don't see anything about a limit on how many hits it returns, so I could be wrong, but that's my best guess.

Interesting observation! Thanks, James! Best, Julie On Nov 3, 2015, at 9:31 AM, James W. MacDonald [bioc] <noreply@bioconductor.org<mailto:noreply@bioconductor.org>> wrote: Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User James W. MacDonald<https: support.bioconductor.org="" u="" 5106=""/> wrote Comment: matchPattern in BSgenome.Hsapiens.UCSC.hg19<https: support.bioconductor.org="" p="" 74041="" #74070="">: Hi Julie, I'm not sure why blat doesn't return it. My best guess is that blat is restricted to returning the first 135 hits (I get 135 for both the forward and reverse sequences), and blat starts on chr1, so it returns the first 135 hits and then stops. I don't see anything about a limit on how many hits it returns, so I could be wrong, but that's my best guess. ________________________________ Post tags: BSgenome.Hsapiens.UCSC.hg19 You may reply via email or visit C: matchPattern in BSgenome.Hsapiens.UCSC.hg19
0
4.1 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Julie,

Even though matchPattern(pattern, reverseComplement(subject)) and matchPattern(reverseComplement(pattern), subject) are both finding matches on the minus strand, the latter is the recommended way. One of the reasons the former is discouraged is because it doesn't follow the universal convention of reporting the matches with respect to the 1st nucleotide of the plus strand (which is how things should always be reported whatever the strand is). See  C. SEARCHING THE MINUS STRAND OF A CHROMOSOME in ?reverseComplement where this is discussed in more details.

Cheers,

H.