I'm attempting to use bumphunter::annotateNearest() with Canis familiaris 3.1 (see https://github.com/ririzarr/bumphunter/issues/1). annotateNearest() appears to accept an object with a set of transcripts and annotate against that, even if the transcripts are from a non-human genome, but there isn't specific documentation about how to do this. My attempt is below along with the issue I've bumped up against.
I built my transcripts object like so:
library(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
library(RMySQL)
library(bumphunter)
tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
names(tx) <- unlist(mcols(tx)[,2])
con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
refGene <- dbGetQuery(con, "select * from xenoRefGene;")
refGene <- refGene[match(names(tx), refGene$name),]
Nexons = sapply(strsplit(refGene$exonStarts, ","), length)
Exons <- IRangesList(start = lapply(strsplit(refGene$exonStarts, ","), function(x) x[x != "" & !is.na(x)]), end = lapply(strsplit(refGene$exonEnds, ","), function(x) x[x != "" & !is.na(x)]))
mcols(tx) <- DataFrame(CSS = refGene$cdsStart, CSE = refGene$cdsEnd, Tx = refGene$name, Gene = refGene$name2, Nexons = Nexons, Exons = Exons)I annotated like so:
annotation <- annotateNearest(regions$regions, tx)
(where regions$regions is derfinder output -- I am happy to provide more info there if anyone is curious).
The results are somewhat annotated:
> head(annotation)
dist matchIndex   type amountOverlap insideDist size1  size2
1    0      15454 inside            NA         59   125   5819
2    0      10018 inside            NA      -1811   141  15559
3    0      10017 inside            NA     -34060   118  68867
4    0      13154 inside            NA     -22676    93 169910
5    0        876 inside            NA     -30307    59 157801
6    0      13167 inside            NA       2095   114  12769
I am surprised that the annotation does not include the gene name or RefSeq ID, as the docs suggest it will. My suspicion is that I did not hand enough information along in my tx object, but the object does contain seqnames, and those weren't passed back in the annotation object.
> tx
GRanges object with 254017 ranges and 6 metadata columns:
               seqnames                 ranges strand   |       CSS       CSE
                  <Rle>              <IRanges>  <Rle>   | <numeric> <numeric>
     NM_011767     chr1       [363561, 368894]      +   |  75038343  75122798
  NM_001011177     chr1       [363601, 367239]      +   |    363606    367239
  NM_001090507     chr1       [363601, 367239]      +   |    363606    367239
     NM_199558     chr1       [363604, 367325]      +   |    363671    367325
  NM_001044283     chr1       [440183, 441303]      +   |  65252835  65294349
           ...      ...                    ...    ... ...       ...       ...
  NM_001090227     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001102830     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001135068     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001012575     chrX [123325049, 123408056]      -   | 123325048 123408056
  NM_001184797     chrX [123326032, 123438981]      -   | 123326060 123408176
                         Tx        Gene    Nexons
                <character> <character> <integer>
     NM_011767    NM_011767         Zfr        30
  NM_001011177 NM_001011177         zfr        16
  NM_001090507 NM_001090507         zfr        17
     NM_199558    NM_199558         zfr        15
  NM_001044283 NM_001044283        Snx3        10
           ...          ...         ...       ...
  NM_001090227 NM_001090227    MGC68497         6
  NM_001102830 NM_001102830       tmlhe         7
  NM_001135068 NM_001135068       tmlhe         7
  NM_001012575 NM_001012575       TMLHE         7
  NM_001184797 NM_001184797       TMLHE         7
                                                                                  Exons
                                                                          <IRangesList>
     NM_011767       [75038283, 75038380] [75038682, 75038775] [75056588, 75056878] ...
  NM_001011177                   [363600, 363660] [363682, 363706] [363729, 364308] ...
  NM_001090507                   [363600, 363660] [363682, 363706] [363729, 364308] ...
     NM_199558                   [363603, 363703] [363723, 364293] [364310, 364412] ...
  NM_001044283       [65252221, 65252345] [65252377, 65252423] [65252439, 65252532] ...
           ...                                                                      ...
  -------
  seqinfo: 3268 sequences (1 circular) from canFam3 genome
I would love advice about how to proceed -- is bumphunter behaving as expected (and it's my problem) or is this a bug?
session_info to follow.
Thanks,
Jessica
