Question

GenomeInfoDb::extractSeqlevelsByGroup() on 'Canis familiaris'

0

Entering edit mode

Robert Castelo ★ 3.3k

@rcastelo

Last seen 5 days ago

Barcelona/Universitat Pompeu Fabra

hi,

i'm trying to extract the UCSC sequence levels for Canis familiaris using the function extractSeqlevelsByGroup() from the GenomeInfoDb package, but I stumbled onto the following error:

library(GenomeInfoDb)
library(BSgenome.Cfamiliaris.UCSC.canFam3)

extractSeqlevelsByGroup(species=organism(Cfamiliaris), style="UCSC", group="auto")
Error in extractSeqlevelsByGroup(species = organism(Cfamiliaris), style = "UCSC",  :
  The style specified by 'UCSC' does not have a compatible entry for the species Canis lupus familiaris

while the analogous call works perfectly for Homo sapiens:

library(BSgenome.Hsapiens.UCSC.hg38)
extractSeqlevelsByGroup(species=organism(Hsapiens), style="UCSC", group="auto")
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"
 [8] "chr8"  "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14"
[15] "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21"
[22] "chr22"

i guess something is wrong with the mapping from the out of organism to the GenomeInfoDb data file at GenomeInfoDb/inst/extdata/dataFiles, see:

organism(Cfamiliaris)
[1] "Canis lupus familiaris"
organism(Hsapiens)
[1] "Homo sapiens"

thanks for your help, session information below.

robert.
ps: sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
 [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8    
 [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8   
 [7] LC_PAPER=en_US.UTF8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C      

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices
[6] utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.1      
 [2] BSgenome.Cfamiliaris.UCSC.canFam3_1.4.0
 [3] BSgenome_1.38.0                        
 [4] rtracklayer_1.30.1                     
 [5] Biostrings_2.38.0                      
 [6] XVector_0.10.0                         
 [7] GenomicRanges_1.22.0                   
 [8] GenomeInfoDb_1.6.0                     
 [9] IRanges_2.4.1                          
[10] S4Vectors_0.8.0                        
[11] BiocGenerics_0.16.0                    
[12] vimcom_1.2-3                           
[13] setwidth_1.0-4                         
[14] colorout_1.1-0                         

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.1   
 [3] BiocParallel_1.4.0         tools_3.2.2               
 [5] SummarizedExperiment_1.0.0 Biobase_2.30.0            
 [7] lambda.r_1.1.7             futile.logger_1.4.1       
 [9] futile.options_1.0.0       bitops_1.0-6              
[11] RCurl_1.95-4.7             Rsamtools_1.22.0          
[13] XML_3.98-1.3

genomeinfodb canis familiaris • 1.8k views

ADD COMMENT • link updated 8.7 years ago by Hervé Pagès 16k • written 8.7 years ago by Robert Castelo ★ 3.3k

0

Entering edit mode

I wonder if it is a clash between the organism for the canFam3 BSGenome package and the organism for GenomeInfoDb?

> genomeStyles("Canis lupus familiaris")
Error in .getDataInFile(species) :
  The species, Canis lupus familiaris is not supported by GenomeInfoDb
> genomeStyles("Canis familiaris")
   circular  auto   sex NCBI  UCSC
1     FALSE  TRUE FALSE    1  chr1
2     FALSE  TRUE FALSE    2  chr2
3     FALSE  TRUE FALSE    3  chr3
4     FALSE  TRUE FALSE    4  chr4
5     FALSE  TRUE FALSE    5  chr5
6     FALSE  TRUE FALSE    6  chr6
7     FALSE  TRUE FALSE    7  chr7
8     FALSE  TRUE FALSE    8  chr8
9     FALSE  TRUE FALSE    9  chr9
10    FALSE  TRUE FALSE   10 chr10
11    FALSE  TRUE FALSE   11 chr11
12    FALSE  TRUE FALSE   12 chr12
13    FALSE  TRUE FALSE   13 chr13
14    FALSE  TRUE FALSE   14 chr14
15    FALSE  TRUE FALSE   15 chr15
16    FALSE  TRUE FALSE   16 chr16
17    FALSE  TRUE FALSE   17 chr17
18    FALSE  TRUE FALSE   18 chr18
19    FALSE  TRUE FALSE   19 chr19
20    FALSE  TRUE FALSE   20 chr20
21    FALSE  TRUE FALSE   21 chr21
22    FALSE  TRUE FALSE   22 chr22
23    FALSE  TRUE FALSE   23 chr23
24    FALSE  TRUE FALSE   24 chr24
25    FALSE  TRUE FALSE   25 chr25
26    FALSE  TRUE FALSE   26 chr26
27    FALSE  TRUE FALSE   27 chr27
28    FALSE  TRUE FALSE   28 chr28
29    FALSE  TRUE FALSE   29 chr29
30    FALSE  TRUE FALSE   30 chr30
31    FALSE  TRUE FALSE   31 chr31
32    FALSE  TRUE FALSE   32 chr32
33    FALSE  TRUE FALSE   33 chr33
34    FALSE  TRUE FALSE   34 chr34
35    FALSE  TRUE FALSE   35 chr35
36    FALSE  TRUE FALSE   36 chr36
37    FALSE  TRUE FALSE   37 chr37
38    FALSE  TRUE FALSE   38 chr38
39    FALSE FALSE  TRUE    X  chrX
40    FALSE FALSE  TRUE    Y  chrY
41     TRUE FALSE FALSE   MT  chrM

ADD REPLY • link 8.7 years ago James W. MacDonald 66k

0

Entering edit mode

> extractSeqlevelsByGroup(species = "Canis familiaris", style = "UCSC", group = "auto")
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chr23" "chr24" "chr25" "chr26" "chr27"
[28] "chr28" "chr29" "chr30" "chr31" "chr32" "chr33" "chr34" "chr35" "chr36"
[37] "chr37" "chr38"

ADD REPLY • link 8.7 years ago James W. MacDonald 66k

0

Entering edit mode

yes, but i'd like to have it working in general, that is, if i have a BSgenome-class object 'x':

extractSeqlevelsByGroup(species=organism(x), style="UCSC", group="auto")

this is something i use within my package VariantFiltering. i thought this could be an error but if not i guess i should move the question to bioc-devel.

any hint will be appreciated.

ADD REPLY • link 8.7 years ago Robert Castelo ★ 3.3k

score 2 · Accepted Answer · 2015-11-02

Hi Robert,

Even though there is an entry for Canis_familiaris in the genome style db included in GenomeInfoDb:

names(genomeStyles())
#  [1] "Arabidopsis_thaliana"     "Caenorhabditis_elegans"  
#  [3] "Canis_familiaris"         "Cyanidioschyzon_merolae" 
#  [5] "Drosophila_melanogaster"  "Homo_sapiens"            
#  [7] "Mus_musculus"             "Oryza_sativa"            
#  [9] "Populus_trichocarpa"      "Rattus_norvegicus"       
# [11] "Saccharomyces_cerevisiae" "Zea_mays"

the failure you got with

extractSeqlevelsByGroup(organism(Cfamiliaris), style="UCSC", group="auto")

is because the supplied organism (organism(Cfamiliaris)) is "Canis lupus familiaris" and extractSeqlevelsByGroup() didn't know how to map it to Canis_familiaris. Thanks for catching this!

I just committed the following change to GenomeInfoDb, which addresses the problem:

hpages@latitude:~/svn/bioconductor/Rpacks_3_2/GenomeInfoDb$ svn log -r 110226
------------------------------------------------------------------------
r110226 | hpages@fhcrc.org | 2015-11-02 14:51:50 -0800 (Mon, 02 Nov 2015) | 12 lines

Mapping from user-supplied species/organism to officially supported organism
(as returned by 'names(genomeStyles())') now ignores the 2nd part (e.g. "lupus"
in "Canis_lupus_familiaris"). With this change, things like

  extractSeqlevelsByGroup("Canis lupus familiaris", style="UCSC", group="auto")

now work even though only "Canis_familiaris" is officially supported.
...

This is in GenomeInfoDb 1.6.1 (and GenomeInfoDb 1.7.3 in BioC devel) which should become available tomorrow via biocLite().

Cheers,

H.