Assign genes to CpG islands
2
0
Entering edit mode
@jose-luis-lavin-5531
Last seen 6.1 years ago
Spain

Dear all,

I'm looking for a Bioconductor/R package that enables me to annotate a table of CpG island coordinates (downloaded fon the UCSC table browser) to the nearest gene.

My table has the following fields:

#bin    chrom    chromStart    chromEnd    CpGname    length    cpgNum    gcNum    perCpg    perGc    obsExp
 

After a first search on th internet I tried the following code to create a Granges object and then annotate it to the nearest gene and to the preceding gene.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)

genes = genes(TxDb.Mmusculus.UCSC.mm10.ensGene)

cpg<-read.table("mm10_CpG_coords.txt" ,header=TRUE, stringsAsFactors=TRUE, sep ="\t"  )
gr = makeGRangesFromDataFrame(cpg)
gr = GRanges(cpg$chrom, IRanges(cpg$chromStart, cpg$chromEnd))
head(gr)

GRanges with 6 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [ 84934572,  84935054]      *
  [2]     chr1 [ 63176547,  63177427]      *
  [3]     chr1 [125435174, 125435976]      *
  [4]     chr1 [183368926, 183369826]      *
  [5]     chr1 [  3531624,   3531843]      *
  [6]     chr1 [  3670619,   3671074]      *
  ---
  seqlengths:
                   chr1                chr10 ...                 chrX                 chrY
                     NA                   NA ...                   NA                   NA


near<-genes[nearest(gr, genes)]

pre<-genes[precede(gr, genes)]

 

But I get the following error related to what I think is due to the Granges object I obtain, which doesn't seem to have any seqlenghts information for the different chromosomes...

Error in IRanges:::normalizeSingleBracketSubscript(i, x) :
  subscript contains NAs or out of bounds indices

Would you please help me with this issue.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] XVector_0.4.0                           hiAnnotator_1.0.0                      
 [3] TxDb.Mmusculus.UCSC.mm10.ensGene_2.14.0 GenomicFeatures_1.16.3                 
 [5] AnnotationDbi_1.26.1                    Biobase_2.24.0                         
 [7] GenomicRanges_1.16.4                    GenomeInfoDb_1.0.2                     
 [9] IRanges_1.22.10                         BiocGenerics_0.10.0                    

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9              BiocParallel_0.6.1     
 [5] biomaRt_2.20.0          Biostrings_2.32.1       bitops_1.0-6            brew_1.0-6             
 [9] BSgenome_1.32.0         codetools_0.2-11        colorspace_1.2-6        checkmate_1.6.0        
[13] DBI_0.3.1               digest_0.6.8            fail_1.2                foreach_1.4.2          
[17] GenomicAlignments_1.0.6 ggplot2_1.0.1           grid_3.1.2              gtable_0.1.2           
[21] iterators_1.0.7         magrittr_1.5            MASS_7.3-42             munsell_0.4.2          
[25] plyr_1.8.3              proto_0.3-10            Rcpp_0.11.6             RCurl_1.95-4.7         
[29] reshape2_1.4.1          Rsamtools_1.16.1        RSQLite_1.0.0           rtracklayer_1.24.2     
[33] scales_0.2.5            sendmailR_1.2-1         stats4_3.1.2            stringi_0.5-5          
[37] stringr_1.0.0           tools_3.1.2             XML_3.98-1.3            zlibbioc_1.10.0


Best wishes

JL

annotation genomicranges bsgenome • 2.6k views
ADD COMMENT
2
Entering edit mode
Ou, Jianhong ★ 1.3k
@ou-jianhong-4539
Last seen 20 hours ago
United States

I think you want have a try of ChIPpeakAnno.

## you can load the cpg islands like Michael said,
library(rtracklayer)
session <- browserSession()
genome(session) <- "mm10"
cpg <- session[["CpG Islands"]]
## Or in case you want to import the local file to save the metadata, 
## you can have a try toGRange
library(ChIPpeakAnno)
cpg <- toGRanges("mm10_CpG_coords.txt" ,header=TRUE, stringsAsFactors=TRUE, sep ="\t", colNames=c("bin", "sapce", "start", "end", "CpGname", "length", "cpgNum", "gcNum", "perCpg", "perGc", "obsExp")
## Then annotate the cpgs by annotatePeakInBatch
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
annoData <- annoGR(TxDb.Mmusculus.UCSC.mm10.ensGene)
cpg.anno <- annotatePeakInBatch(cpg, AnnotationData=annoData, output="nearestLocation")
head(cpg.anno)
## You can also have a try EnsDb.Mmusculus.v79
library(EnsDb.Mmusculus.v79)
ensData <- annoGR(EnsDb.Mmusculus.v79)
cpg.anno1 <- annotatePeakInBatch(cpg, AnnotationData=ensData, output="nearestLocation")

Let me know if you have any question.

 

ADD COMMENT
0
Entering edit mode

Thank you Ou, Jianhong, your solution works very well and is easy to implement!

Best wishes

JL

ADD REPLY
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.3 years ago
United States

Might be easier to grab that table from UCSC like this:

library(rtracklayer)
session <- browserSession()
genome(session) <- "mm10"
cpg <- session[["CpG Islands"]]

Then, you have your seqlengths.

As the error message clearly states, you have NAs in your subscript, since there are CPG islands on contigs without any genes. Since those islands are unlikely to be of interest, just restrict the islands to the standard chromosomes:

cpg <- keepStandardChromosomes(cpg)
ADD COMMENT
0
Entering edit mode

Dear Michael,

Thanks for your solution, although there's a little problem. When I tried your code, R  fails to set session to genome mm10. Yielding the following error:

Error in `genome<-`(`*tmp*`, value = "mm10") :
  Failed to set session genome to 'mm10'

I think this issue may be related to that described in this post:  Failed to set session genome to ....

Is there a way fix this issue?

Best wishes

JL

> sessionInfo()   
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.26.3                     TxDb.Mmusculus.UCSC.mm10.ensGene_3.0.0 GenomicFeatures_1.18.7                
 [4] AnnotationDbi_1.28.2                   Biobase_2.26.0                         GenomicAlignments_1.2.2               
 [7] Rsamtools_1.18.3                       Biostrings_2.34.1                      BiocInstaller_1.16.5                  
[10] XVector_0.6.0                          GenomicRanges_1.18.4                   GenomeInfoDb_1.2.5                    
[13] IRanges_2.0.1                          S4Vectors_0.4.0                        BiocGenerics_0.12.1                   

loaded via a namespace (and not attached):
 [1] base64enc_0.1-3    BatchJobs_1.6      BBmisc_1.9         BiocParallel_1.0.3 biomaRt_2.22.0     bitops_1.0-6      
 [7] brew_1.0-6         codetools_0.2-14   checkmate_1.6.3    DBI_0.3.1          digest_0.6.8       fail_1.3          
[13] foreach_1.4.3      iterators_1.0.8    magrittr_1.5       RCurl_1.95-4.7     RSQLite_1.0.0      sendmailR_1.2-1   
[19] stringi_1.0-1      stringr_1.0.0      tools_3.1.2        XML_3.98-1.3       zlibbioc_1.12.0  
ADD REPLY
0
Entering edit mode

I updated that support post to indicate that the solution is to update your Bioconductor. You are two releases behind now.

ADD REPLY
0
Entering edit mode

Bioconductor updated!
Thanks for tje insight and the solution!
Best wishes

JL

ADD REPLY

Login before adding your answer.

Traffic: 696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6