Search
Question: subsetByOverlaps with Multimapping
0
gravatar for Assa Yeroslaviz
10 weeks ago by
Assa Yeroslaviz1.3k
Munich, Germany
Assa Yeroslaviz1.3k wrote:

as a sequel to my subsetting GRanges object based on gene IDs, I was wondering how multiple mapping can be handled. I have two GRanges objects. One is with intensities per position, the other is for two genes, I would like to overlap with. 

gr1 <- GRanges("VI", IRanges(c(3320:3321,3330:3331,3341:3342), c(3320:3321,3330:3331,3341:3342) ))
gr2 <- GRanges("VI", IRanges(c(3322, 3030), c(3846, 3338), names=c("YFL064C", "YFL065C")), gene_id=c("YFL064C", "YFL065C"))

I know, I can handle multi-mapping using the CharacterList(split())-combination. When doing so this is what I get:

ranges <- subsetByOverlaps(gr1,gr2)
hits <- findOverlaps(gr1, gr2)
geneids <- CharacterList(split(gr2$gene_id[subjectHits(hits)], queryHits(hits)))
mcols(ranges) <- DataFrame(mcols(ranges), geneids)

 

and the results looks like that:

> ranges
GRanges object with 6 ranges and 1 metadata column:
      seqnames       ranges strand |         geneids
         <Rle>    <IRanges>  <Rle> | <CharacterList>
  [1]       VI [3320, 3320]      * |         YFL065C
  [2]       VI [3321, 3321]      * |         YFL065C
  [3]       VI [3330, 3330]      * | YFL065C,YFL064C
  [4]       VI [3331, 3331]      * | YFL065C,YFL064C
  [5]       VI [3341, 3341]      * |         YFL064C
  [6]       VI [3342, 3342]      * |         YFL064C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

This is what I expect to see, as the first and last two positions are unique to the the genes YFL065C and YFL064C respectively, while the two positions in between overlap both genes genes.

But I would like to know if it is possible to separate these multiple mapping into separate row, so that I would be able later on to split the GRanges into a list based on the gene_id column. 

I would like it to be like that

> ranges
GRanges object with 6 ranges and 1 metadata column:
      seqnames       ranges strand |         geneids
         <Rle>    <IRanges>  <Rle> | <CharacterList>
  [1]       VI [3320, 3320]      * |         YFL065C
  [2]       VI [3321, 3321]      * |         YFL065C
  [3]       VI [3330, 3330]      * |         YFL065C
  [4]       VI [3330, 3330]      * |         YFL064C
  [5]       VI [3331, 3331]      * |         YFL065C
  [6]       VI [3331, 3331]      * |         YFL064C
  [7]       VI [3341, 3341]      * |         YFL064C
  [8]       VI [3342, 3342]      * |         YFL064C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

so that the multi-mapped positions are separated into multiple rows, each with a different gene id. 

thanks Assa

 

 

 

 

 

 

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Assa Yeroslaviz1.3k

I managed to do it in a very indirect way by exporting the GRanges object to a data.frame, converting the CharacterList column into character vectors using the unstrsplit() command from IRanges package. I than expanded the list based on the geneids column with my expand.delimited function (s. below for the script).

This was followed by merging the two newly created data frames into one and re-organizing the columns. The data was than coerced back into a GRanges object.

I am sure there is a better and more efficient way of doing it. I would appreciate your help.

 

this was the workflow:

ranges$geneids <- unstrsplit(ranges$geneids, sep=",") # converting the CharacterList into Character
df <- as.data.frame(ranges)

df.2 <- expand.delimited(x=df, col1=2, col2=6) # expanding the data frames

df.merged <- merge(x=df.2, y=df, by.x=1, by.y=2,all.y=TRUE) # merging
df.merged <- df.merged[,c("seqnames", "start", "end", "width", "strand", "geneids.x")] # reorgenizing

as(df.merged, "GRanges") #coercing back to GRanges

expand.delimited <- function(x, col1=1, col2=2, sep=",") {
  rnum <- 1
  expand_row <- function(y) {
    factr <- y[col1]
    strng <- toString(y[col2])
    expand <- strsplit(strng, sep)[[1]]
    num <- length(expand)
    factor <- rep(factr,num)
    return(as.data.frame(cbind(factor,expand),row.names=seq(rnum:(rnum+num)-1)))
    rnum <- (rnum+num)-1
  }
  expanded <- apply(x,1,expand_row)
  df <- do.call("rbind", expanded)
  names(df) <- c(names(x)[col1],names(x)[col2])
  return(df)
}

 

 

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Assa Yeroslaviz1.3k
2
gravatar for Hervé Pagès
10 weeks ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Assa,

I would just do:

hits <- findOverlaps(gr2, gr1)
setNames(extractList(gr1, as(hits, "List")), names(gr2))
# GRangesList object of length 2:
# $YFL064C 
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames       ranges strand
#          <Rle>    <IRanges>  <Rle>
#   [1]       VI [3330, 3330]      *
#   [2]       VI [3331, 3331]      *
#   [3]       VI [3341, 3341]      *
#   [4]       VI [3342, 3342]      *
#
# $YFL065C 
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]       VI [3320, 3320]      *
#   [2]       VI [3321, 3321]      *
#   [3]       VI [3330, 3330]      *
#   [4]       VI [3331, 3331]      *
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

This produces a GRangesList object parallel to gr2 i.e. with one list-element per gene. You can then unlist() it if you want a GRanges object, and add the gene_id metadata column (using the names for that):

gr1b <- unlist(setNames(extractList(gr1, as(hits, "List")), names(gr2)))
mcols(gr1b)$gene_id <- names(gr1b)
gr1b
# GRanges object with 8 ranges and 1 metadata column:
#           seqnames       ranges strand |     gene_id
#              <Rle>    <IRanges>  <Rle> | <character>
#   YFL064C       VI [3330, 3330]      * |     YFL064C
#   YFL064C       VI [3331, 3331]      * |     YFL064C
#   YFL064C       VI [3341, 3341]      * |     YFL064C
#   YFL064C       VI [3342, 3342]      * |     YFL064C
#   YFL065C       VI [3320, 3320]      * |     YFL065C
#   YFL065C       VI [3321, 3321]      * |     YFL065C
#   YFL065C       VI [3330, 3330]      * |     YFL065C
#   YFL065C       VI [3331, 3331]      * |     YFL065C
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Cheers,

H.

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Hervé Pagès ♦♦ 13k

thanks a lot. I thought there might be a better way to do it.

ADD REPLYlink written 10 weeks ago by Assa Yeroslaviz1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 198 users visited in the last hour