Search
Question: What's the best way to retrieve the sizes (lengths) of chromosomes for a reference genome assembly?
0
gravatar for Nathan Sheffield
5 months ago by
University of Virginia
Nathan Sheffield10 wrote:

When I need to know the total sizes of chromosomes, I typically use the `BSGenome` package, which provides both metadata (like seqnames(BSgenome)) as well as raw sequence data.

I'm building a package now that needs this metadata information -- but for a package that never needs the raw sequence data and only the metadata, I'd rather not introduce a dependency on the huge BSgenome raw data packages nor on the BSgenome package itself, which has many dependencies; instead, is there a more streamlined way to easily get *just the metadata* for a package that needs only that?

It would have to come from something other than BSgenome. How do you approach this?

ADD COMMENTlink modified 5 months ago by Hervé Pagès ♦♦ 13k • written 5 months ago by Nathan Sheffield10
2
gravatar for Hervé Pagès
5 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Nathan,

Have you tried si <- Seqinfo(genome="canFam3"); seqlengths(si) ? It supports many of the most commonly used UCSC assemblies. See ?Seqinfo in the GenomeInfoDb package for more information (don't miss the examples). Let me know if your assembly is not supported and we'll add it.

Cheers,

H.

ADD COMMENTlink written 5 months ago by Hervé Pagès ♦♦ 13k

Thanks, that could be exactly what I need... out of curiosity, is there a way to get assembly gaps out of GenomeInfoDb the way you can from a BSgenome object? I didn't see that so far...

ADD REPLYlink written 5 months ago by Nathan Sheffield10

This is actually a new question so it should be asked as such. This way people can find it when they search our support site.

The GenomeInfoDb package provides no specific tools for retrieving the assembly gaps. If your assembly is supported by the UCSC Genome Browser, then just import the "gap" table as a GRanges object. You can use getTable() from the rtracklayer package for this:

library(rtracklayer)
session <- browserSession()
genome(session) <- "hg38"
query <- ucscTableQuery(session, table="gap")
gaps <- getTable(query)

or query the UCSC SQL server directly:

library(RMariaDB)
UCSC_dbselect <- function(dbname, from, columns=NULL, where=NULL,
                          server="genome-mysql.soe.ucsc.edu")
{
    columns <- if (is.null(columns)) "*"
               else paste0(columns, collapse=",")
    SQL <- sprintf("SELECT %s FROM %s", columns, from)
    if (!is.null(where)) {
        stopifnot(isSingleString(where))
        SQL <- paste(SQL, "WHERE", where)
    }
    dbconn <- dbConnect(RMariaDB::MariaDB(), dbname=dbname,
                                             username="genome",
                                             host=server,
                                             port=3306)
    on.exit(dbDisconnect(dbconn))
    dbGetQuery(dbconn, SQL)
}
gaps2 <- UCSC_dbselect("hg38", "gap")

The 2 gaps and gaps2 data.frames should contain the same data (possibly in different order). To turn the data.frame into a GRanges object:

library(GenomicRanges)
makeGRangesFromDataFrame(gaps, starts.in.df.are.0based=TRUE)

H.

ADD REPLYlink modified 5 months ago • written 5 months ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 325 users visited in the last hour