bumphunter::annotateNearest with non-human genome
3
1
Entering edit mode
@jessicahekman-6877
Last seen 9.6 years ago
United States

I'm attempting to use bumphunter::annotateNearest() with Canis familiaris 3.1 (see https://github.com/ririzarr/bumphunter/issues/1). annotateNearest() appears to accept an object with a set of transcripts and annotate against that, even if the transcripts are from a non-human genome, but there isn't specific documentation about how to do this. My attempt is below along with the issue I've bumped up against.

I built my transcripts object like so:

library(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
library(RMySQL)
library(bumphunter)
tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
names(tx) <- unlist(mcols(tx)[,2])

con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
refGene <- dbGetQuery(con, "select * from xenoRefGene;")
refGene <- refGene[match(names(tx), refGene$name),]
Nexons = sapply(strsplit(refGene$exonStarts, ","), length)
Exons <- IRangesList(start = lapply(strsplit(refGene$exonStarts, ","), function(x) x[x != "" & !is.na(x)]), end = lapply(strsplit(refGene$exonEnds, ","), function(x) x[x != "" & !is.na(x)]))
mcols(tx) <- DataFrame(CSS = refGene$cdsStart, CSE = refGene$cdsEnd, Tx = refGene$name, Gene = refGene$name2, Nexons = Nexons, Exons = Exons)

I annotated like so:

annotation <- annotateNearest(regions$regions, tx)

(where regions$regions is derfinder output -- I am happy to provide more info there if anyone is curious).

The results are somewhat annotated:

> head(annotation) dist matchIndex type amountOverlap insideDist size1 size2 1 0 15454 inside NA 59 125 5819 2 0 10018 inside NA -1811 141 15559 3 0 10017 inside NA -34060 118 68867 4 0 13154 inside NA -22676 93 169910 5 0 876 inside NA -30307 59 157801 6 0 13167 inside NA 2095 114 12769

 

I am surprised that the annotation does not include the gene name or RefSeq ID, as the docs suggest it will. My suspicion is that I did not hand enough information along in my tx object, but the object does contain seqnames, and those weren't passed back in the annotation object.

> tx GRanges object with 254017 ranges and 6 metadata columns: seqnames ranges strand | CSS CSE <Rle> <IRanges> <Rle> | <numeric> <numeric> NM_011767 chr1 [363561, 368894] + | 75038343 75122798 NM_001011177 chr1 [363601, 367239] + | 363606 367239 NM_001090507 chr1 [363601, 367239] + | 363606 367239 NM_199558 chr1 [363604, 367325] + | 363671 367325 NM_001044283 chr1 [440183, 441303] + | 65252835 65294349 ... ... ... ... ... ... ... NM_001090227 chrX [123325049, 123390403] - | 123325048 123390403 NM_001102830 chrX [123325049, 123390403] - | 123325048 123390403 NM_001135068 chrX [123325049, 123390403] - | 123325048 123390403 NM_001012575 chrX [123325049, 123408056] - | 123325048 123408056 NM_001184797 chrX [123326032, 123438981] - | 123326060 123408176 Tx Gene Nexons <character> <character> <integer> NM_011767 NM_011767 Zfr 30 NM_001011177 NM_001011177 zfr 16 NM_001090507 NM_001090507 zfr 17 NM_199558 NM_199558 zfr 15 NM_001044283 NM_001044283 Snx3 10 ... ... ... ... NM_001090227 NM_001090227 MGC68497 6 NM_001102830 NM_001102830 tmlhe 7 NM_001135068 NM_001135068 tmlhe 7 NM_001012575 NM_001012575 TMLHE 7 NM_001184797 NM_001184797 TMLHE 7 Exons <IRangesList> NM_011767 [75038283, 75038380] [75038682, 75038775] [75056588, 75056878] ... NM_001011177 [363600, 363660] [363682, 363706] [363729, 364308] ... NM_001090507 [363600, 363660] [363682, 363706] [363729, 364308] ... NM_199558 [363603, 363703] [363723, 364293] [364310, 364412] ... NM_001044283 [65252221, 65252345] [65252377, 65252423] [65252439, 65252532] ... ... ... ------- seqinfo: 3268 sequences (1 circular) from canFam3 genome

 

I would love advice about how to proceed -- is bumphunter behaving as expected (and it's my problem) or is this a bug?

session_info to follow.

Thanks,
Jessica

 
bumphunter • 2.5k views
ADD COMMENT
0
Entering edit mode
> devtools::session_info()
Session info-------------------------------------------------------------------
 setting  value                       
 version  R version 3.1.1 (2014-07-10)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Chicago             

Packages-----------------------------------------------------------------------
 package                                   * version   date      
 acepack                                     1.3.3.3   2013-05-03
 annotate                                    1.44.0    2014-10-15
 AnnotationDbi                             * 1.28.1    2014-10-30
 base64enc                                   0.1.2     2014-06-26
 BatchJobs                                   1.5       2014-10-30
 BBmisc                                      1.8       2014-10-30
 Biobase                                   * 2.26.0    2014-10-15
 BiocGenerics                              * 0.12.1    2014-11-19
 BiocParallel                              * 1.0.0     2014-11-11
 biomaRt                                     2.22.0    2014-10-15
 Biostrings                                  2.34.0    2014-10-15
 bitops                                      1.0.6     2013-08-17
 brew                                        1.0.6     2011-04-13
 bumphunter                                * 1.6.0     2014-10-15
 checkmate                                   1.5.0     2014-10-19
 cluster                                     1.15.3    2014-09-04
 codetools                                   0.2.9     2014-08-21
 colorspace                                  1.2.4     2013-09-30
 DBI                                       * 0.3.1     2014-09-24
 DESeq2                                      1.6.2     2014-11-20
 devtools                                    1.6.1     2014-10-07
 digest                                      0.6.4     2013-12-03
 doRNG                                       1.6       2014-03-07
 fail                                        1.2       2013-09-19
 foreach                                   * 1.4.2     2014-04-11
 foreign                                     0.8.61    2014-03-28
 Formula                                     1.1.2     2014-07-13
 genefilter                                  1.48.1    2014-10-30
 geneplotter                                 1.44.0    2014-10-15
 GenomeInfoDb                              * 1.2.3     2014-11-17
 GenomicAlignments                           1.2.1     2014-11-10
 GenomicFeatures                           * 1.18.2    2014-10-30
 GenomicRanges                             * 1.18.3    2014-11-19
 ggplot2                                     1.0.0     2014-05-21
 gtable                                      0.1.2     2012-12-05
 Hmisc                                       3.14.5    2014-09-12
 IRanges                                   * 2.0.0     2014-10-15
 iterators                                 * 1.0.7     2014-04-11
 lattice                                     0.20.29   2014-04-04
 latticeExtra                                0.6.26    2013-08-15
 locfit                                    * 1.5.9.1   2013-04-20
 MASS                                        7.3.35    2014-09-30
 matrixStats                                 0.10.3    2014-10-15
 munsell                                     0.4.2     2013-07-11
 nnet                                        7.3.8     2014-03-28
 pkgmaker                                    0.22      2014-05-14
 plyr                                        1.8.1     2014-02-26
 proto                                       0.3.10    2012-12-22
 R.methodsS3                                 1.6.1     2014-01-05
 RColorBrewer                                1.0.5     2011-06-17
 Rcpp                                        0.11.3    2014-09-29
 RcppArmadillo                               0.4.500.0 2014-10-30
 RCurl                                       1.95.4.3  2014-07-29
 registry                                    0.2       2012-01-24
 reshape2                                    1.4       2014-04-23
 RMySQL                                    * 0.9.3     2012-01-17
 rngtools                                    1.2.4     2014-03-06
 rpart                                       4.1.8     2014-03-28
 Rsamtools                                   1.18.2    2014-11-11
 RSQLite                                     1.0.0     2014-10-25
 rstudioapi                                  0.1       2014-03-27
 rtracklayer                                 1.26.2    2014-11-11
 S4Vectors                                 * 0.4.0     2014-10-15
 scales                                      0.2.4     2014-04-22
 sendmailR                                   1.2.1     2014-09-21
 stringr                                     0.6.2     2012-12-06
 survival                                    2.37.7    2014-01-22
 TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene * 3.10.0    2014-11-24
 XML                                         3.98.1.1  2013-06-20
 xtable                                      1.7.4     2014-09-12
 XVector                                     0.6.0     2014-10-15
 zlibbioc                                    1.12.0    2014-10-15

ADD REPLY
1
Entering edit mode
rafa ▴ 60
@rafa-7160
Last seen 6.4 years ago
United States

Jessica, I've made some major changes to bumphunter to accommodate this. If you install the latest devel version 1.7.3 (still being tested) the matchGenes and annotateNearest will no longer assume hg19. Here is an example:

library(bumphunter)
islands=makeGRangesFromDataFrame(read.delim("http://rafalab.jhsph.edu/CGI/model-based-cpg-islands-hg19.txt")[1:100,])

library("TxDb.Hsapiens.UCSC.hg19.knownGene")

genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene, annotationPackage="org.Hs.eg.db")
tab<- matchGenes(islands,genes)

Note that if you don't specify the annotationPackage argument the function will try to figure out what library to load using the species method. If it can't find an annotation package it will continue and return an object with no gene annotation but all the other components. You should be able to substitute the human TxDb for TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene

Please try it out and report back if you run into trouble.

Note we are not making this back compatible. Apologies if this breaks code, but the previous way had to many problems to keep around. 

ps - You might need to install from gitbub like this devtools::install_github("ririzarr/bumphunter")

ADD COMMENT
1
Entering edit mode
@lcolladotor
Last seen 9 weeks ago
United States

Hi Jessica,

You can now use matchGenes() (similar to what annotateTranscripts used to do) with non-human genomes as Rafa already mentioned. Basically, use:

library('GenomicFeatures')
library('bumphunter')
can_txdb <- makeTranscriptDbFromUCSC("canFam3", "refGene")
can_genes <- annotateTranscripts(can_txdb, 'org.Cf.eg.db')
annotation <- matchGenes(x, can_genes)

where 'x' is a data.frame or GRanges. For example:

> matchGenes(head(can_genes), can_genes)
          name                                          annotation
1        KRT12                           NM_001082421 NP_001075890
2     CEACAM23                           NM_001097552 NP_001091021
3     CEACAM23                           NM_001097552 NP_001091021
4     CEACAM28 NM_001097546 NP_001091015 XM_005615767 XP_005615824
5     CEACAM24                           NM_001097554 NP_001091023
6 LOC100049001                           NM_001097555 NP_001091024
     description region distance      subregion insideDistance
1         covers covers        0 covers exon(s)             NA
2         covers covers        0 covers exon(s)             NA
3         covers covers        0 covers exon(s)             NA
4 covers exon(s) inside        5 covers exon(s)              0
5         covers covers        0 covers exon(s)             NA
6         covers covers        0 covers exon(s)             NA
  exonnumber nexons            UTR strand geneL codingL    Entrez
1         NA      8           <NA>      +  5581    5190 100034663
2         NA      9           <NA>      + 16264   16245 100048998
3         NA      9           <NA>      - 32617   16245 100048998
4          1      9 overlaps 3'UTR      + 22076   22029    484473
5         NA      6           <NA>      + 10598   10434 100049000
6         NA      5           <NA>      +  1424    1266 100049001
  subjectHits
1           1
2           2
3           3
4        1740
5           5
6           6
> 

Note that I proposed some fixes for some minor bugs at https://github.com/ririzarr/bumphunter/pull/3

Cheers,

Leo

 

> devtools::session_info()
Session info------------------------------------------------------------
 setting  value                                             
 version  R Under development (unstable) (2014-11-01 r66923)
 system   x86_64, darwin10.8.0                              
 ui       AQUA                                              
 language (EN)                                              
 collate  en_US.UTF-8                                       
 tz       America/New_York                                  

Packages----------------------------------------------------------------
 package           * version  date       source        
 AnnotationDbi     * 1.29.13  2015-01-14 Bioconductor  
 base64enc           0.1.2    2014-06-26 CRAN (R 3.2.0)
 BatchJobs           1.4      2014-09-24 CRAN (R 3.2.0)
 BBmisc              1.7      2014-06-21 CRAN (R 3.2.0)
 Biobase           * 2.27.1   2014-12-20 Bioconductor  
 BiocGenerics      * 0.13.4   2014-12-31 Bioconductor  
 BiocParallel        1.1.11   2015-01-10 Bioconductor  
 biomaRt             2.23.5   2014-11-22 Bioconductor  
 Biostrings          2.35.7   2014-12-13 Bioconductor  
 bitops              1.0.6    2013-08-17 CRAN (R 3.2.0)
 brew                1.0.6    2011-04-13 CRAN (R 3.2.0)
 bumphunter        * 1.7.4    2015-01-15 Bioconductor  
 checkmate           1.5.0    2014-10-19 CRAN (R 3.2.0)
 codetools           0.2.9    2014-08-21 CRAN (R 3.2.0)
 DBI               * 0.3.1    2014-09-24 CRAN (R 3.2.0)
 devtools            1.6.1    2014-10-07 CRAN (R 3.2.0)
 digest              0.6.4    2013-12-03 CRAN (R 3.2.0)
 doRNG               1.6      2014-03-07 CRAN (R 3.2.0)
 fail                1.2      2013-09-19 CRAN (R 3.2.0)
 foreach           * 1.4.2    2014-04-11 CRAN (R 3.2.0)
 GenomeInfoDb      * 1.3.12   2014-12-22 Bioconductor  
 GenomicAlignments   1.3.24   2015-01-14 Bioconductor  
 GenomicFeatures   * 1.19.10  2015-01-09 Bioconductor  
 GenomicRanges     * 1.19.33  2015-01-13 Bioconductor  
 IRanges           * 2.1.35   2015-01-07 Bioconductor  
 iterators         * 1.0.7    2014-04-11 CRAN (R 3.2.0)
 lattice             0.20.29  2014-04-04 CRAN (R 3.2.0)
 locfit            * 1.5.9.1  2013-04-20 CRAN (R 3.2.0)
 matrixStats         0.10.3   2014-10-15 CRAN (R 3.2.0)
 org.Cf.eg.db      * 3.0.0    2014-09-26 Bioconductor  
 pkgmaker            0.22     2014-05-14 CRAN (R 3.2.0)
 R.methodsS3         1.6.1    2014-01-05 CRAN (R 3.2.0)
 RCurl               1.95.4.3 2014-07-29 CRAN (R 3.2.0)
 registry            0.2      2012-01-24 CRAN (R 3.2.0)
 rngtools            1.2.4    2014-03-06 CRAN (R 3.2.0)
 Rsamtools           1.19.26  2015-01-13 Bioconductor  
 RSQLite           * 1.0.0    2014-10-25 CRAN (R 3.2.0)
 rstudioapi          0.1      2014-03-27 CRAN (R 3.2.0)
 rtracklayer         1.27.7   2015-01-14 Bioconductor  
 S4Vectors         * 0.5.16   2015-01-07 Bioconductor  
 sendmailR           1.2.1    2014-09-21 CRAN (R 3.2.0)
 stringr             0.6.2    2012-12-06 CRAN (R 3.2.0)
 XML                 3.98.1.1 2013-06-20 CRAN (R 3.2.0)
 xtable              1.7.4    2014-09-12 CRAN (R 3.2.0)
 XVector           * 0.7.3    2014-11-24 Bioconductor  
 zlibbioc            1.13.0   2014-10-14 Bioconductor 
ADD COMMENT
0
Entering edit mode

Amazing that it took me this long to get back to this! (In the meantime I went to Siberia, passed my prelims, presented my research twice. and helped submit an NSF proposal... All of which prevented me from getting to my own research.)

Anyways: It works! Thanks again to Leo and Rafa for your support.

Leo, I suggest that you modify the derfinder advanced vignette section on non-human data to add the suggestion to use code like this (this is what worked for me):

 

library('GenomicFeatures')
library('bumphunter')

can_txdb <- makeTranscriptDbFromUCSC("canFam3", "refGene")
can_genes <- annotateTranscripts(can_txdb, 'org.Cf.eg.db')

# Perhaps a note here that makeTranscriptDbFromUCSC is network-intensive and so it makes sense to run it once and then save the output and load it again in subsequent runs

results <- analyzeChr(..., runAnnotation=FALSE)

annotation <- matchGenes(results$regions$regions, can_genes)

 

By the way, I have been unable to figure out how to make an appropriate txdb object to hand to analyzeChr() so that this can all be done in one step. I'm happy with how things work for me now so I'm not intending to spend much time on it, but if you want to have a cleaner example in the advanced vignette, you could suggest some code for me to try and I'd be happy to make sure it works before you put it in there.

My hope is that I will now get to move forward with analyzing the implications of my DERs, in concert with some DEGs I got from DESeq2. If and when this is published I'll let you know (of course I would reference derfinder).

Thanks,

Jessica

ADD REPLY
0
Entering edit mode
@jessicahekman-6877
Last seen 9.6 years ago
United States

Thank you so much! I am in the middle of final exams at the moment but I will check it out in the next few days and report back.

I really appreciate the work.

Jessica

ADD COMMENT

Login before adding your answer.

Traffic: 691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6