Question: Is mm10.p6 available as BSgenome?
0
gravatar for Aditya
3 months ago by
Aditya120
Germany
Aditya120 wrote:

BSgenome.Mmusculus.UCSC.mm10 contains mm10 (2012 version). Is mm10.patch 6 - 2017: also available as a BSgenome?

ADD COMMENTlink modified 3 months ago by James W. MacDonald52k • written 3 months ago by Aditya120
Answer: Is mm10.p6 available as BSgenome?
1
gravatar for James W. MacDonald
3 months ago by
United States
James W. MacDonald52k wrote:

You could just make your own version.

ADD COMMENTlink written 3 months ago by James W. MacDonald52k

Thank you James :-).

I could make BSgenome.Mmusculus.UCSC.mm10.p6 and submit it to BioC, but BioC seems to host only the first release of the contemporary major version, am I right?

I wonder why BioC doesn't upgrade the BSgenomes with each new BioC release? Freezing the release guarantees stability, but from the other side the subsequent patches do not alter the genomic coordinates (only a new major version does), and 7 years is a lot... What would you say?

ADD REPLYlink modified 3 months ago • written 3 months ago by Aditya120
1

The main reason the BSgenomes don't get updated is lack of personnel to do so. There are maybe 3-4 people who do the bulk of the work for each release, and while some of that involves updating annotation data, probably more involves the logistics of ensuring that thousands of different packages (both analytical and experimental) are all ready to go upon release.

With limited personnel there has to be a hierarchy of necessity, and building BSgenome packages for each successive patch unfortunately comes way down on that hierarchy. Which is why the infrastructure exists to allow people to build their own if they so desire.

That said, there are 819 different TwoBit files on the AnnotationHub for Mus musculus, most of which are Ensembl based. Anything from release 92-97, so far as I know, is p6, so you can always get the TwoBitFile from there, but you probably want the toplevel rather than the primary assembly, so have to choose the strain:

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("twobitfile", "musculus"))
AnnotationHub with 819 records
## urg. Do better
> query(hub, c("twobitfile", "musculus", "release-96"))
AnnotationHub with 65 records
# snapshotDate(): 2019-05-02 
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: TwoBitFile
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH70174"]]' 

            title                                              
  AH70174 | Mus_musculus.GRCm38.cdna.all.2bit                  
  AH70175 | Mus_musculus.GRCm38.dna.primary_assembly.2bit      
  AH70176 | Mus_musculus.GRCm38.dna_rm.primary_assembly.2bit   
  AH70177 | Mus_musculus.GRCm38.dna_sm.primary_assembly.2bit   
  AH70178 | Mus_musculus.GRCm38.ncrna.2bit                     
  ...       ...                                                
  AH70234 | Mus_musculus_pwkphj.PWK_PhJ_v1.ncrna.2bit          
  AH70235 | Mus_musculus_wsbeij.WSB_EiJ_v1.cdna.all.2bit       
  AH70236 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_rm.toplevel.2bit
  AH70237 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
  AH70238 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit 

> tb <- hub[["AH70175"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache 
     AH70175 : 76921 
require( rtracklayer )
> tb
TwoBitFile object
resource: /home/jmacdon/.cache/AnnotationHub/5935639632667_76921 
> getSeq(tb, GRanges("1:34567-34599"))
  A DNAStringSet instance of length 1
    width seq
[1]    33 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

## huh, seems masked, because primary?
> tb2 <- hub[["AH70201"]] ## black6
> getSeq(tb2, GRanges("1:34567-34599"))
  A DNAStringSet instance of length 1
    width seq
[1]    33 TTTTTCTCCTTAAAATATTCGGGCAAGAAAGGA

I don't do much with BSgenome packages, so I don't know the fundamental differences, but to my eye, the TwoBitFile is pretty similar.

ADD REPLYlink written 3 months ago by James W. MacDonald52k
1

I'll second James' observations, including a work flow using TwoBit (via AnnotationHub) or even fasta files (managed using BiocFileCache) rather than BSgenome if these resources are sufficient for your research purposes.

ADD REPLYlink written 3 months ago by Martin Morgan ♦♦ 24k

Thank you Martin :-)

ADD REPLYlink written 3 months ago by Aditya120

Thank you James for this extensive reply :-). I was not aware of the presence of these twobit files, so this is definitely good to know!

After looking into the ensembl fasta files, I realized the patches are provided in a separate alternate sequences file, leaving the primary assembly untouched, making the patch level information difficult to use for many applications. With a new Mus musculus major release being planned in the not so distant future, I think I will actually work with the current primary assembly for now, and update to the new major release when available.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aditya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 418 users visited in the last hour