Question

Splitting lines of a GRanges object based on character list

0

Entering edit mode

stephen.williams ▴ 10

@stephenwilliams-15198

Last seen 7.6 years ago

I have a Granges object that was generated using some of the really nice info from this page (Mapping genome regions to gene symbols). I'm finding overlaps between my query Granges and my subject Granges (Homo.sapiens) and assigning gene symbols to the given locus. However when two genes overlap the same locus you get something like this.

     seqnames                 ranges strand |     numBC    SYMBOL
         <Rle>              <IRanges>  <Rle> | <integer>    <CharacterList>
  [1]    chr12 [122692988, 122693157]      * |       174    DIABLO,VPS33A
  [2]    chr12 [122693161, 122693336]      * |       167    DIABLO,VPS33A
  [3]    chr12 [122694166, 122694413]      * |       133    DIABLO,VPS33A

Using the script

grange_test<- makeGRangesFromDataFrame(bc_test, keep.extra.columns=TRUE)
symInCnv_test = splitColumnByOverlap(hs, grange_test, "SYMBOL")
grange_test$SYMBOL <- symInCnv_test

However, the function

splitColumnByOverlap <-
    function(query, subject, column="ENTREZID", ...)
{
    olaps <- findOverlaps(query, subject, ...)
    f1 <- factor(subjectHits(olaps),
                 levels=seq_len(subjectLength(olaps)))
    splitAsList(mcols(query)[[column]][queryHits(olaps)], f1)
}

creates a character list for the gene symbol. For a variety of reasons I actually need each gene to be in a new line as seen below.

seqnames                 ranges strand |     numBC    SYMBOL
         <Rle>              <IRanges>  <Rle> | <integer>    <Character>
  [1]    chr12 [122692988, 122693157]      * |       174    DIABLO
  [2]    chr12 [122692988, 122693157]      * |       174    VPS33A
  [3]    chr12 [122693161, 122693336]      * |       167    DIABLO
  [4]    chr12 [122693161, 122693336]      * |       167    VPS33A
  [5]    chr12 [122694166, 122694413]      * |       133    DIABLO
  [6]    chr12 [122694166, 122694413]      * |       133    VPS33A

Can anyone think of a way to do this (GenomicRanges, fix splitColumnByOverlap(), tidy, or otherwise)?

I've tried making my ending Granges a data.frame and splitting a variety of ways but nothing gets me where I need to be. Any help would be greatly appreciated.

Thanks.

granges grangeslist • 2.5k views

ADD COMMENT • link updated 8.0 years ago by Michael Lawrence ★ 11k • written 8.0 years ago by stephen.williams ▴ 10

score 2 · Accepted Answer · 2018-03-08

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.2 years ago

United States

expand(grange_test, "SYMBOL")

ADD COMMENT • link 8.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for the reply but this does not work.

grange_test <- as.data.frame(grange_test) 
expand(grange_test, "SYMBOL")

Gives

# A tibble: 1 x 1
  `"SYMBOL"`
  <chr>     
1 SYMBOL

ADD REPLY • link 8.0 years ago stephen.williams ▴ 10

0

Entering edit mode

Why are you coercing to a data frame first?

ADD REPLY • link 8.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

expand does not seem to work with Granges

expand(grange_test, "SYMBOL")
Error in UseMethod("expand_") : 
  no applicable method for 'expand_' applied to an object of class "c('GRanges', 'GenomicRanges', 'GRanges_OR_NULL', 'GRangesOrIRanges', 'Vector', 'GenomicRanges_OR_missing', 'GenomicRanges_OR_GRangesList', 'GenomicRanges_OR_GenomicRangesList', 'Annotated')"

ADD REPLY • link 8.0 years ago stephen.williams ▴ 10

0

Entering edit mode

I've gotten fairly close using

grange_test <- 
as.data.frame(grange_test) %>% 
  mutate(SYMBOL = strsplit(as.character(SYMBOL), ",")) %>% 
  unnest(SYMBOL)

But the resulting "SYMBOL" column has a bunch of left over characters that I'm having a hard time removing

seqnames     start       end    numBC    SYMBOL
chr3     150601398    150601565   168    c("CLRN1-AS1"
chr3     150601398    150601565   168    "CLRN1")

ADD REPLY • link 8.0 years ago stephen.williams ▴ 10

0

Entering edit mode

Success! Your method worked but you have to use

S4Vectors::expand

not

Matrix::expand

or

tidyr::expand