GenomeInfoDb: where are the genome patches?
1
0
Entering edit mode
@mattchambers42-10186
Last seen 4.3 years ago

I'm trying to implement a function to convert Ensembl chromosome names to UCSC names for many potential input species (i.e. the intersection of species supported by both sources). I saw the seqlevelStyles function in GenomeInfoDb, but only the canonical chromosomes are mapped. Why is that? It's kind of funny because the canonical ones can mostly be fixed with a sub() call. It's the patches that are really irregular and vexing.

seqnames genomeinfodb • 535 views
0
Entering edit mode
@herve-pages-1542
Last seen 4 hours ago
Seattle, WA, United States

Hi Matt,

I guess supporting the seqlevel mappings for the canonical chromosomes only was the easy thing to do, mainly because the mappings for a given species don't depend on a particular assembly. So the approach taken in GenomeInfoDb was to simply hardcode these mappings in tabulated files (located in inst/extdata/dataFiles). It's a very straightforward approach but, unfortunately, it's an approach that wouldn't easily allow to support mappings of the patches or scaffolds for a given assembly.

FWIW note that fetchExtendedChromInfoFromUCSC() in GenomeInfoDb is one way to get the mapping between NCBI and UCSC seqlevels for all the sequences in a given assembly. It supports only a few assemblies (see ?fetchExtendedChromInfoFromUCSC for the list). It's a work-in-progress and maybe seqlevelsStyles() should use something like this behind the scene.

H.