Here is one way:
> library(GenomicRanges)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from ‘package:stats’:
xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
rownames, sapply, setdiff, sort, table, tapply, union, unique,
unlist, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: IRanges
Loading required package: GenomeInfoDb
> segmented <- GRanges(c(1,1), IRanges(c(5930000, 16850000), c(11730000,18010000)))
> reference <- GRanges(rep(1,4), IRanges(c(5930500,6930500,16854500,17810032), c(6230500,7340500,16950000, 17910064)))
> mcols(reference)$Genes <- c("SPSB1","SPSB2","TAS1R1","ENO1")
> olap <- findOverlaps(segmented, reference)
> mcols(segmented)$Genes <- sapply(split(mcols(reference)$Genes, queryHits(olap)), paste, collapse = ",")
> segmented
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | Genes
<Rle> <IRanges> <Rle> | <character>
[1] 1 [ 5930000, 11730000] * | SPSB1,SPSB2
[2] 1 [16850000, 18010000] * | TAS1R1,ENO1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
@James..when i tried this command on my dataset, getting an error message
Warning message:
In split.default(mcols(segment)$Gens, queryHits(olap)) :
data length is not a multiple of split variable
When you tried what command? You need to show your work if you expect anybody to help.
@James..
library(GenomicRanges)
df1.Grange <- GRanges(seqnames= df1$Chromosome, IRanges(start=df1$Start, end =df1$End),ID= df1$Gene.symb) ref.Grange <- GRanges(seqnames= ref$Chromosome.Name, IRanges(start=ref$Gene.Start..bp., end = ref$Gene.End..bp.),ID= ref$Associated.Gene.Name.1)
olap <- findOverlaps(df1.Grange, ref.Grange)
mcols(df1.Grange)$M.Genes <- sapply(split(mcols(ref.Grange)$ID, queryHits(olap)), paste, collapse = ",")
Error message: 136 elements in value to replace 148 elements
In addition: Warning message:
In split.default(mcols(ref.Grange)$ID, queryHits(olap)) :
data length is not a multiple of split variable
Here is a slight modification of your original example:
So let's look a bit deeper to figure out what is going on. Let's run some of the code separately. The call to split() generates a list containing the genes that are within each range of segmented. Note here that there are only two list items, because there are no genes in the third range of segmented.
The naive code I gave you assumes that there will be one or more genes in each range of segmented, which fails because we collapse like this:
And there are only two things here that we are trying to fit into a DataFrame that has three rows. So you either have to subset the segmented GRanges object to just those ranges that have genes, or you have to take the list of gene names, and 'bump it out' to have empty character vectors for those ranges in segmented that have no genes, and then collapse to a vector which will be the right length.
These are both basic data manipulation steps in R, so I will leave it to you to try to figure out how to do so. The best way to learn R is to have something concrete (and relatively simple) that you want to accomplish, and then figure out how to do it. So there you go.
Hi Jim, Rajith,
Don't use
split()
. First turn Hits objectolap
into a list-like object parallel tosegmented
:Then use
extractList()
:Hope this helps,
H.
Hi Herve,
Yes indeed it was helpful. But the class of segmented$Gene in CharacterList, which makes it difficult to do the analysis. So i converted. Hope this is the correct way.
segmented$Gene <- unlist(lapply(segment$Gene , function(m) {paste(m,collapse = ",")}))
or use
unstrsplit()
:It's equivalent but much faster than
sapply(..., paste, collapse=",")
(the difference will be noticeable if the CharacterList object contains tens of thousands of character vectors).Cheers,
H.
Nice! Thanks Herve.