I'm trying to implement a function to convert Ensembl chromosome names to UCSC names for many potential input species (i.e. the intersection of species supported by both sources). I saw the `seqlevelStyles` function in GenomeInfoDb, but only the canonical chromosomes are mapped. Why is that? It's kind of funny because the canonical ones can mostly be fixed with a `sub()` call. It's the patches that are really irregular and vexing.
I guess supporting the seqlevel mappings for the canonical chromosomes only was the easy thing to do, mainly because the mappings for a given species don't depend on a particular assembly. So the approach taken in GenomeInfoDb was to simply hardcode these mappings in tabulated files (located in
inst/extdata/dataFiles). It's a very straightforward approach but, unfortunately, it's an approach that wouldn't easily allow to support mappings of the patches or scaffolds for a given assembly.
FWIW note that
fetchExtendedChromInfoFromUCSC() in GenomeInfoDb is one way to get the mapping between NCBI and UCSC seqlevels for all the sequences in a given assembly. It supports only a few assemblies (see
?fetchExtendedChromInfoFromUCSC for the list). It's a work-in-progress and maybe
seqlevelsStyles() should use something like this behind the scene.