GRanges: problem with multiple seqnames and seqlengths
3
0
Entering edit mode
@topherhamm-6775
Last seen 7.0 years ago
United States

I would like to take the results of a BLAST search and look at their location on a de novo genome. I'm running into an issue creating the GRanges object and I believe this is because many of the BLAST results hit to the same scaffold.

As I understand it, GRanges requires a seqname for each row of data (where each row corresponds to one line of tabular blastn output). I would like to add the length of each genome scaffold as metadata but it seems that seqlengths will only accept one interger for each seqname. The examples I have read in the 'help' and other on-line resources feed seqlengths something like seqlengths = c("chr1" = 346364242, "chr2" = 353453). This seems perfectly reasonable for small data sets, but could be an issue for larger data sets, like mine. Whether I give seqlengths a vector with one integer per row of data, or only the unique interger values for each scaffold length I get an error (see code below).

I've made a hacky work around but was wondering if there is a more elegant way of adding the seqlengths. I would really appreciate any help or guidance you can offer. Thanks for reading this. Code below.

Chris Hamm (University of Kansas)

GenomicRanges v1.16.4 ; IRanges v1.22.10

The steps are: 1) Read in the genome:

Manduca <- readDNAStringSet("~/Desktop/Projects/Genomes/M_sexta/Msex05162011.genome.fa", format = "fasta", use.names = TRUE)
ManSI <- seqinfo(Manduca)

2) read in my blast results and subset

tbx1 <- read.table("genome_blast/W_tblastx.1", sep = "\t")
colnames(tbx1) <- c("queryID", "subjectID", "pcID", "aln_len", "mismatches", "gap_count", "query_start", "query_end", "sub_start", "sub_end", "eVal", "bit_score")
tbx1 <- subset(tbx1, eVal <= 1.0e-4 & pcID >= 50)
tbx1 <- drop.levels(tbx1, reorder = FALSE)

3) create an IRanges object:

c18977 <- subset(tbx1, queryID == "MsW_c18977_593L")
c18977 <- drop.levels(c18977, reorder = FALSE)
c18977.IR <- IRanges(start = c18977$sub_start, width = abs(c18977$sub_start - c18977$sub_end), names = seq(1:length(c18977$sub_start)))

4) create the GRanges obejct:

c18977seq <- c18977$subjectID[c18977$subjectID %in% names(Manduca)]
gr18977 <- GRanges(seqnames = as.character(Manduca[c18977seq, ]@ranges@NAMES), ranges = c18977.IR)
gr18977@seqinfo@seqlengths <- unique(Manduca[c18977seq, ]@ranges@width)

This works, but seems a clunky way to add the sequence lengths. But, the following generates an error.

5) create GRanges object with a vector of lengths:

gr18977 <- GRanges(seqnames = as.character(Manduca[c18977seq, ]@ranges@NAMES), ranges = c18977.IR, seqlengths = Manduca[c18977seq, ]@ranges@width)
Error in .normargSeqlengths(seqlengths, seqnames) :
length of supplied 'seqlengths' must equal the number of sequences
GRanges seqnames seqlengths IRanges • 4.4k views
0
Entering edit mode
@michael-lawrence-3846
Last seen 11 days ago
United States

Bioconductor objects have accessors that hide the implementation of the data structures from user code. It is highly recommended to take advantage of those, because direct slot manipulation will (1) be extremely difficult to get right and (2) make your code difficult to read, write and maintain.

So you should have something like:

gr18977 <- GRanges(seqnames = names(Manduca[c18977seq]),
ranges = c18977.IR)
seqlengths(gr18977) <- unique(width(Manduca[c18977seq]))

The next step fails, because you failed to take the unique lengths:

gr18977 <- GRanges(seqnames = names(Manduca[c18977seq]),
ranges = c18977.IR,
seqlengths = unique(width(Manduca[c18977seq])))

That should avoid much of the clunkiness.

0
Entering edit mode

This seqlengths(gr18977) <- unique(width(Manduca[c18977seq])) fails because the right-hand side is an unnamed vector, so the software cannot associate seqlength with seqlevel; it needs to be a named vector.

0
Entering edit mode
@topherhamm-6775
Last seen 7.0 years ago
United States

Thanks for the response!

I actually tried your version of the code before I posted. I apologize if I was not clear enough in my above quesiton when I said "Whether I give seqlengths a vector with one integer per row of data, or only the unique interger values for each scaffold length I get an error (see code below)"

When I use code with "unique" I get the following error back:

> gr18977 <- GRanges(seqnames = names(Manduca[c18977seq]), ranges = c18977.IR, seqlengths = unique(width(Manduca[c18977seq])))
Error in .normargSeqlengths(seqlengths, seqnames) :
length of supplied 'seqlengths' must equal the number of sequences

Thanks again.

0
Entering edit mode

Here are two wrong and one correct way of doing things

> GRanges(c("A", "B"), IRanges(1, 1), seqlengths=c(123, 456))
Error in .normargSeqlengths(seqlengths, seqnames) :
length of supplied 'seqlengths' must equal the number of sequences
> GRanges(c("A", "B"), IRanges(1, 1), seqlengths=c(A=123))
Error in validObject(.Object) :
invalid class "GRanges" object: 'seqnames' contains missing values
> GRanges(c("A", "B"), IRanges(1, 1), seqlengths=c(A=123, B=456))
GRanges object with 2 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]        A    [1, 1]      *
[2]        B    [1, 1]      *
-------
seqinfo: 2 sequences from an unspecified genome

In your case, unique(width(Manduca[c18977seq]))) returns an unnamed vector, so is an example of wrong way 1.

One possible formulation is

> gr = GRanges(c("A", "B"), IRanges(1, 1))
> seqlengths(gr) = sapply(split(width(gr), seqnames(gr)), max)

0
Entering edit mode

Note that the error message returned by "wrong way 1" is inadequate. The length of supplied 'seqlengths' is actually equal to the number of sequences. The real problem is that the supplied 'seqlengths' is not named. Also we have an inconsistency between what the GRanges() constructor expects and what the seqlengths() setter expects. The latter actually accepts an unamed vector:

gr <- GRanges(c("A", "B"), IRanges(1, 1))
seqlengths(gr) <- c(123, 456)

Then:

> seqlengths(gr)
A   B
123 456 

So it seems we need to improve these things a little bit.

Finally, for inferring the seqlengths of a GRanges object from the ranges it contains, I recommend you use end() instead of width():

seqlengths(gr) <- max(split(end(gr), seqnames(gr)))

Of course, this will not produce the "real" seqlengths (unless gr contains for each chromosome at least one range that covers the end of the chromosome) but it will guarantee that no range in gr is located beyond the end of a chromosome.

H.

0
Entering edit mode
@topherhamm-6775
Last seen 7.0 years ago
United States

Thank you all for the answers and support, I really appreciate it.

CH