CpG Sites // CG dinucleotides
1
0
Entering edit mode
ibetfer3 • 0
@ibetfer3-13846
Last seen 7.2 years ago

Hello, I would like to know how to get the CpG sites (not islands) from a region of interest. Can anybody help me? Thank you very much!

CpG CpG sites CpG dinucleotides • 1.6k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 18 hours ago
Seattle, WA, United States

Hi,

If you have only one region of interest you can do something like this:

library(BSgenome.Dmelanogaster.UCSC.dm3)
my_region_of_interest <- GRanges("chr3LHet:2500000-2555491")
dna <- getSeq(BSgenome.Dmelanogaster.UCSC.dm3,
              my_region_of_interest)
CpG_sites <- shift(GRanges(seqnames(my_region_of_interest),
                           ranges(matchPattern("CG", dna[[1]]))),
                   start(my_region_of_interest) - 1)

Then:

CpG_sites
# GRanges object with 1846 ranges and 0 metadata columns:
#          seqnames             ranges strand
#             <Rle>          <IRanges>  <Rle>
#      [1] chr3LHet [2500021, 2500022]      *
#      [2] chr3LHet [2500058, 2500059]      *
#      [3] chr3LHet [2500060, 2500061]      *
#      [4] chr3LHet [2500075, 2500076]      *
#      [5] chr3LHet [2500080, 2500081]      *
#      ...      ...                ...    ...
#   [1842] chr3LHet [2555387, 2555388]      *
#   [1843] chr3LHet [2555427, 2555428]      *
#   [1844] chr3LHet [2555440, 2555441]      *
#   [1845] chr3LHet [2555464, 2555465]      *
#   [1846] chr3LHet [2555482, 2555483]      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

If you have a small number of regions (e.g. < 1000), you could loop over them and do the above in the loop. However, if you have thousands of regions or more, this might not be very efficient. In that case it might better to start by finding all the CpG sites for all chromosomes:

all_CpG_sites <- vmatchPattern("CG", BSgenome.Dmelanogaster.UCSC.dm3)
all_CpG_sites <- unstrand(all_CpG_sites[strand(all_CpG_sites) == "+"])
all_CpG_sites
# GRanges object with 6650479 ranges and 0 metadata columns:
#              seqnames               ranges strand
#                 <Rle>            <IRanges>  <Rle>
#         [1]     chr2L             [ 1,  2]      *
#         [2]     chr2L             [11, 12]      *
#         [3]     chr2L             [83, 84]      *
#         [4]     chr2L             [85, 86]      *
#         [5]     chr2L             [91, 92]      *
#         ...       ...                  ...    ...
#   [6650475] chrUextra [29003849, 29003850]      *
#   [6650476] chrUextra [29003856, 29003857]      *
#   [6650477] chrUextra [29004133, 29004134]      *
#   [6650478] chrUextra [29004383, 29004384]      *
#   [6650479] chrUextra [29004393, 29004394]      *
#   -------
#   seqinfo: 15 sequences from an unspecified genome

and then to subset all_CpG_sites by overlaps with the GRanges object containing your regions of interest:

subsetByOverlaps(all_CpG_sites, my_regions_of_interest)

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6