Is mm10.p6 available as BSgenome?
1
0
Entering edit mode
Aditya ▴ 160
@aditya-7667
Last seen 21 months ago
Germany

BSgenome.Mmusculus.UCSC.mm10 contains mm10 (2012 version). Is mm10.patch 6 - 2017: also available as a BSgenome?

BSgenome.Mmusculus.UCSC.mm10 • 1.4k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

You could just make your own version.

ADD COMMENT
0
Entering edit mode

Thank you James :-).

I could make BSgenome.Mmusculus.UCSC.mm10.p6 and submit it to BioC, but BioC seems to host only the first release of the contemporary major version, am I right?

I wonder why BioC doesn't upgrade the BSgenomes with each new BioC release? Freezing the release guarantees stability, but from the other side the subsequent patches do not alter the genomic coordinates (only a new major version does), and 7 years is a lot... What would you say?

ADD REPLY
1
Entering edit mode

The main reason the BSgenomes don't get updated is lack of personnel to do so. There are maybe 3-4 people who do the bulk of the work for each release, and while some of that involves updating annotation data, probably more involves the logistics of ensuring that thousands of different packages (both analytical and experimental) are all ready to go upon release.

With limited personnel there has to be a hierarchy of necessity, and building BSgenome packages for each successive patch unfortunately comes way down on that hierarchy. Which is why the infrastructure exists to allow people to build their own if they so desire.

That said, there are 819 different TwoBit files on the AnnotationHub for Mus musculus, most of which are Ensembl based. Anything from release 92-97, so far as I know, is p6, so you can always get the TwoBitFile from there, but you probably want the toplevel rather than the primary assembly, so have to choose the strain:

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("twobitfile", "musculus"))
AnnotationHub with 819 records
## urg. Do better
> query(hub, c("twobitfile", "musculus", "release-96"))
AnnotationHub with 65 records
# snapshotDate(): 2019-05-02 
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: TwoBitFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH70174"]]' 

            title                                              
  AH70174 | Mus_musculus.GRCm38.cdna.all.2bit                  
  AH70175 | Mus_musculus.GRCm38.dna.primary_assembly.2bit      
  AH70176 | Mus_musculus.GRCm38.dna_rm.primary_assembly.2bit   
  AH70177 | Mus_musculus.GRCm38.dna_sm.primary_assembly.2bit   
  AH70178 | Mus_musculus.GRCm38.ncrna.2bit                     
  ...       ...                                                
  AH70234 | Mus_musculus_pwkphj.PWK_PhJ_v1.ncrna.2bit          
  AH70235 | Mus_musculus_wsbeij.WSB_EiJ_v1.cdna.all.2bit       
  AH70236 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_rm.toplevel.2bit
  AH70237 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
  AH70238 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit 

> tb <- hub[["AH70175"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache 
     AH70175 : 76921 
require( rtracklayer )
> tb
TwoBitFile object
resource: /home/jmacdon/.cache/AnnotationHub/5935639632667_76921 
> getSeq(tb, GRanges("1:34567-34599"))
  A DNAStringSet instance of length 1
    width seq
[1]    33 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

## huh, seems masked, because primary?
> tb2 <- hub[["AH70201"]] ## black6
> getSeq(tb2, GRanges("1:34567-34599"))
  A DNAStringSet instance of length 1
    width seq
[1]    33 TTTTTCTCCTTAAAATATTCGGGCAAGAAAGGA

I don't do much with BSgenome packages, so I don't know the fundamental differences, but to my eye, the TwoBitFile is pretty similar.

ADD REPLY
1
Entering edit mode

I'll second James' observations, including a work flow using TwoBit (via AnnotationHub) or even fasta files (managed using BiocFileCache) rather than BSgenome if these resources are sufficient for your research purposes.

ADD REPLY
0
Entering edit mode

Thank you Martin :-)

ADD REPLY
0
Entering edit mode

Thank you James for this extensive reply :-). I was not aware of the presence of these twobit files, so this is definitely good to know!

After looking into the ensembl fasta files, I realized the patches are provided in a separate alternate sequences file, leaving the primary assembly untouched, making the patch level information difficult to use for many applications. With a new Mus musculus major release being planned in the not so distant future, I think I will actually work with the current primary assembly for now, and update to the new major release when available.

ADD REPLY

Login before adding your answer.

Traffic: 769 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6