Search
Question: What's the best way to retrieve the sizes (lengths) of chromosomes for a reference genome assembly?
0
7 months ago by
University of Virginia
Nathan Sheffield10 wrote:

When I need to know the total sizes of chromosomes, I typically use the BSGenome package, which provides both metadata (like seqnames(BSgenome)) as well as raw sequence data.

I'm building a package now that needs this metadata information -- but for a package that never needs the raw sequence data and only the metadata, I'd rather not introduce a dependency on the huge BSgenome raw data packages nor on the BSgenome package itself, which has many dependencies; instead, is there a more streamlined way to easily get *just the metadata* for a package that needs only that?

It would have to come from something other than BSgenome. How do you approach this?

modified 7 months ago by Hervé Pagès ♦♦ 13k • written 7 months ago by Nathan Sheffield10
2
7 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Nathan,

Have you tried si <- Seqinfo(genome="canFam3"); seqlengths(si) ? It supports many of the most commonly used UCSC assemblies. See ?Seqinfo in the GenomeInfoDb package for more information (don't miss the examples). Let me know if your assembly is not supported and we'll add it.

Cheers,

H.

Thanks, that could be exactly what I need... out of curiosity, is there a way to get assembly gaps out of GenomeInfoDb the way you can from a BSgenome object? I didn't see that so far...

This is actually a new question so it should be asked as such. This way people can find it when they search our support site.

The GenomeInfoDb package provides no specific tools for retrieving the assembly gaps. If your assembly is supported by the UCSC Genome Browser, then just import the "gap" table as a GRanges object. You can use getTable() from the rtracklayer package for this:

library(rtracklayer)
session <- browserSession()
genome(session) <- "hg38"
query <- ucscTableQuery(session, table="gap")
gaps <- getTable(query)

or query the UCSC SQL server directly:

library(RMariaDB)
UCSC_dbselect <- function(dbname, from, columns=NULL, where=NULL,
server="genome-mysql.soe.ucsc.edu")
{
columns <- if (is.null(columns)) "*"
else paste0(columns, collapse=",")
SQL <- sprintf("SELECT %s FROM %s", columns, from)
if (!is.null(where)) {
stopifnot(isSingleString(where))
SQL <- paste(SQL, "WHERE", where)
}
host=server,
port=3306)
on.exit(dbDisconnect(dbconn))
dbGetQuery(dbconn, SQL)
}
gaps2 <- UCSC_dbselect("hg38", "gap")

The 2 gaps and gaps2 data.frames should contain the same data (possibly in different order). To turn the data.frame into a GRanges object:

library(GenomicRanges)
makeGRangesFromDataFrame(gaps, starts.in.df.are.0based=TRUE)

H.