I would like to take the results of a BLAST search and look at their location on a de novo genome. I'm running into an issue creating the GRanges object and I believe this is because many of the BLAST results hit to the same scaffold.
As I understand it, GRanges requires a seqname for each row of data (where each row corresponds to one line of tabular blastn output). I would like to add the length of each genome scaffold as metadata but it seems that seqlengths will only accept one interger for each seqname. The examples I have read in the 'help' and other on-line resources feed seqlengths something like seqlengths = c("chr1" = 346364242, "chr2" = 353453).
This seems perfectly reasonable for small data sets, but could be an issue for larger data sets, like mine. Whether I give seqlengths a vector with one integer per row of data, or only the unique interger values for each scaffold length I get an error (see code below).
I've made a hacky work around but was wondering if there is a more elegant way of adding the seqlengths. I would really appreciate any help or guidance you can offer. Thanks for reading this. Code below.
Chris Hamm (University of Kansas)
GenomicRanges v1.16.4 ; IRanges v1.22.10
The steps are: 1) Read in the genome:
Manduca <- readDNAStringSet("~/Desktop/Projects/Genomes/M_sexta/Msex05162011.genome.fa", format = "fasta", use.names = TRUE) ManSI <- seqinfo(Manduca)
2) read in my blast results and subset
tbx1 <- read.table("genome_blast/W_tblastx.1", sep = "\t") colnames(tbx1) <- c("queryID", "subjectID", "pcID", "aln_len", "mismatches", "gap_count", "query_start", "query_end", "sub_start", "sub_end", "eVal", "bit_score") tbx1 <- subset(tbx1, eVal <= 1.0e-4 & pcID >= 50) tbx1 <- drop.levels(tbx1, reorder = FALSE)
3) create an IRanges object:
c18977 <- subset(tbx1, queryID == "MsW_c18977_593L") c18977 <- drop.levels(c18977, reorder = FALSE) c18977.IR <- IRanges(start = c18977$sub_start, width = abs(c18977$sub_start - c18977$sub_end), names = seq(1:length(c18977$sub_start)))
4) create the GRanges obejct:
c18977seq <- c18977$subjectID[c18977$subjectID %in% names(Manduca)] gr18977 <- GRanges(seqnames = as.character(Manduca[c18977seq, ]@ranges@NAMES), ranges = c18977.IR) gr18977@seqinfo@seqlengths <- unique(Manduca[c18977seq, ]@ranges@width)
This works, but seems a clunky way to add the sequence lengths. But, the following generates an error.
5) create GRanges object with a vector of lengths:
gr18977 <- GRanges(seqnames = as.character(Manduca[c18977seq, ]@ranges@NAMES), ranges = c18977.IR, seqlengths = Manduca[c18977seq, ]@ranges@width) Error in .normargSeqlengths(seqlengths, seqnames) : length of supplied 'seqlengths' must equal the number of sequences
This
seqlengths(gr18977) <- unique(width(Manduca[c18977seq]))
fails because the right-hand side is an unnamed vector, so the software cannot associate seqlength with seqlevel; it needs to be a named vector.