Question

CpG Sites // CG dinucleotides

0

Entering edit mode

ibetfer3 • 0

@ibetfer3-13846

Last seen 6.6 years ago

Hello, I would like to know how to get the CpG sites (not islands) from a region of interest. Can anybody help me? Thank you very much!

CpG CpG sites CpG dinucleotides • 1.5k views

ADD COMMENT • link updated 6.6 years ago by Hervé Pagès 16k • written 6.6 years ago by ibetfer3 • 0

score 0 · Answer 1 · 2017-09-02

Hi,

If you have only one region of interest you can do something like this:

library(BSgenome.Dmelanogaster.UCSC.dm3)
my_region_of_interest <- GRanges("chr3LHet:2500000-2555491")
dna <- getSeq(BSgenome.Dmelanogaster.UCSC.dm3,
              my_region_of_interest)
CpG_sites <- shift(GRanges(seqnames(my_region_of_interest),
                           ranges(matchPattern("CG", dna[[1]]))),
                   start(my_region_of_interest) - 1)

Then:

CpG_sites
# GRanges object with 1846 ranges and 0 metadata columns:
#          seqnames             ranges strand
#             <Rle>          <IRanges>  <Rle>
#      [1] chr3LHet [2500021, 2500022]      *
#      [2] chr3LHet [2500058, 2500059]      *
#      [3] chr3LHet [2500060, 2500061]      *
#      [4] chr3LHet [2500075, 2500076]      *
#      [5] chr3LHet [2500080, 2500081]      *
#      ...      ...                ...    ...
#   [1842] chr3LHet [2555387, 2555388]      *
#   [1843] chr3LHet [2555427, 2555428]      *
#   [1844] chr3LHet [2555440, 2555441]      *
#   [1845] chr3LHet [2555464, 2555465]      *
#   [1846] chr3LHet [2555482, 2555483]      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

If you have a small number of regions (e.g. < 1000), you could loop over them and do the above in the loop. However, if you have thousands of regions or more, this might not be very efficient. In that case it might better to start by finding all the CpG sites for all chromosomes:

all_CpG_sites <- vmatchPattern("CG", BSgenome.Dmelanogaster.UCSC.dm3)
all_CpG_sites <- unstrand(all_CpG_sites[strand(all_CpG_sites) == "+"])
all_CpG_sites
# GRanges object with 6650479 ranges and 0 metadata columns:
#              seqnames               ranges strand
#                 <Rle>            <IRanges>  <Rle>
#         [1]     chr2L             [ 1,  2]      *
#         [2]     chr2L             [11, 12]      *
#         [3]     chr2L             [83, 84]      *
#         [4]     chr2L             [85, 86]      *
#         [5]     chr2L             [91, 92]      *
#         ...       ...                  ...    ...
#   [6650475] chrUextra [29003849, 29003850]      *
#   [6650476] chrUextra [29003856, 29003857]      *
#   [6650477] chrUextra [29004133, 29004134]      *
#   [6650478] chrUextra [29004383, 29004384]      *
#   [6650479] chrUextra [29004393, 29004394]      *
#   -------
#   seqinfo: 15 sequences from an unspecified genome

and then to subset all_CpG_sites by overlaps with the GRanges object containing your regions of interest:

subsetByOverlaps(all_CpG_sites, my_regions_of_interest)

Cheers,

H.