What's the best way to retrieve the sizes (lengths) of chromosomes for a reference genome assembly?
1
0
Entering edit mode
@nathan-sheffield-7613
Last seen 15 months ago
University of Virginia

When I need to know the total sizes of chromosomes, I typically use the `BSGenome` package, which provides both metadata (like seqnames(BSgenome)) as well as raw sequence data.

I'm building a package now that needs this metadata information -- but for a package that never needs the raw sequence data and only the metadata, I'd rather not introduce a dependency on the huge BSgenome raw data packages nor on the BSgenome package itself, which has many dependencies; instead, is there a more streamlined way to easily get *just the metadata* for a package that needs only that?

It would have to come from something other than BSgenome. How do you approach this?

bsgenome chromosome reference genome • 2.3k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 6 days ago
Seattle, WA, United States

Hi Nathan,

Have you tried si <- Seqinfo(genome="canFam3"); seqlengths(si) ? It supports many of the most commonly used UCSC assemblies. See ?Seqinfo in the GenomeInfoDb package for more information (don't miss the examples). Let me know if your assembly is not supported and we'll add it.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Thanks, that could be exactly what I need... out of curiosity, is there a way to get assembly gaps out of GenomeInfoDb the way you can from a BSgenome object? I didn't see that so far...

ADD REPLY
0
Entering edit mode

This is actually a new question so it should be asked as such. This way people can find it when they search our support site.

The GenomeInfoDb package provides no specific tools for retrieving the assembly gaps. If your assembly is supported by the UCSC Genome Browser, then just import the "gap" table as a GRanges object. You can use getTable() from the rtracklayer package for this:

library(rtracklayer)
session <- browserSession()
genome(session) <- "hg38"
query <- ucscTableQuery(session, table="gap")
gaps <- getTable(query)

or query the UCSC SQL server directly:

library(RMariaDB)
UCSC_dbselect <- function(dbname, from, columns=NULL, where=NULL,
                          server="genome-mysql.soe.ucsc.edu")
{
    columns <- if (is.null(columns)) "*"
               else paste0(columns, collapse=",")
    SQL <- sprintf("SELECT %s FROM %s", columns, from)
    if (!is.null(where)) {
        stopifnot(isSingleString(where))
        SQL <- paste(SQL, "WHERE", where)
    }
    dbconn <- dbConnect(RMariaDB::MariaDB(), dbname=dbname,
                                             username="genome",
                                             host=server,
                                             port=3306)
    on.exit(dbDisconnect(dbconn))
    dbGetQuery(dbconn, SQL)
}
gaps2 <- UCSC_dbselect("hg38", "gap")

The 2 gaps and gaps2 data.frames should contain the same data (possibly in different order). To turn the data.frame into a GRanges object:

library(GenomicRanges)
makeGRangesFromDataFrame(gaps, starts.in.df.are.0based=TRUE)

H.

ADD REPLY

Login before adding your answer.

Traffic: 618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6