How to get the all CpG sites genome postion for hg19
1
0
Entering edit mode
@ericotorrieri-12900
Last seen 7.7 years ago
I want to find all CpG sites coordinates of hg19, could you give me any suggestions?
CpG • 6.8k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

CpG sites, or CpG islands? You can get the islands from UCSC, most easily from AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%
snapshotDate(): 2016-10-11
> query(hub, c("cpg","hg19"))
AnnotationHub with 2 records
# snapshotDate(): 2016-10-11
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH5086"]]'

           title      
  AH5086 | CpG Islands
  AH5096 | Evo Cpg  

 
> cpgs <- hub[["AH5086"]]
downloading from 'https://annotationhub.bioconductor.org/fetch/5086'
retrieving 1 resource
  |======================================================================| 100%
> cpgs
GRanges object with 28691 ranges and 1 metadata column:
                       seqnames           ranges strand |        name
                          <Rle>        <IRanges>  <Rle> | <character>
      [1]                  chr1 [ 28736,  29810]      * |    CpG:_116
      [2]                  chr1 [135125, 135563]      * |     CpG:_30
      [3]                  chr1 [327791, 328229]      * |     CpG:_29
      [4]                  chr1 [437152, 438164]      * |     CpG:_84
      [5]                  chr1 [449274, 450544]      * |     CpG:_99
      ...                   ...              ...    ... .         ...
  [28687]  chr9_gl000201_random [ 15651,  15909]      * |     CpG:_30
  [28688]  chr9_gl000201_random [ 26397,  26873]      * |     CpG:_43
  [28689] chr11_gl000202_random [ 16284,  16540]      * |     CpG:_23
  [28690] chr17_gl000204_random [ 54686,  57368]      * |    CpG:_228
  [28691] chr17_gl000205_random [117501, 117801]      * |     CpG:_23
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

If you mean the actual positions of all CpG sites, then you could use a modification of code posted on this site about 6 years ago, by Aaron Statham How to get the all CpG sites genome postion for mm9?.

> library(BSgenome.Hsapiens.UCSC.hg19)  
> chrs <- names(Hsapiens)[1:24]
> cgs <- lapply(chrs, function(x) start(matchPattern("CG", Hsapiens[[x]])))

> cpgr <- do.call(c, lapply(1:24, function(x) GRanges(names(Hsapiens)[x], IRanges(cgs[[x]], width = 2))))
There were 23 warnings (use warnings() to see them)
> cpgr
GRanges object with 28217009 ranges and 0 metadata columns:
             seqnames               ranges strand
                <Rle>            <IRanges>  <Rle>
         [1]     chr1       [10469, 10470]      *
         [2]     chr1       [10471, 10472]      *
         [3]     chr1       [10484, 10485]      *
         [4]     chr1       [10489, 10490]      *
         [5]     chr1       [10493, 10494]      *
         ...      ...                  ...    ...
  [28217005]     chrY [59362433, 59362434]      *
  [28217006]     chrY [59362485, 59362486]      *
  [28217007]     chrY [59362488, 59362489]      *
  [28217008]     chrY [59362493, 59362494]      *
  [28217009]     chrY [59362591, 59362592]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths
ADD COMMENT

Login before adding your answer.

Traffic: 576 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6