Question

How do I split GRange object (set of genomic ranges with 2 metadata column) by chromosome?

1

Entering edit mode

Jurat Shahidin ▴ 80

@jurat-shahidin-9488

Last seen 4.1 years ago

Chicago, IL, USA

Dear all:

When I am gonna see overlapped regions between two set of GRanges objects, and I want to split theses regions into sub regions by order of chromosome. Intuitively, take all genomic regions from each chromosome and iterate over. Any hint to do this in R ? Thanks a lot

granges split r • 7.3k views

ADD COMMENT • link updated 2.3 years ago by jon.klonowski ▴ 40 • written 8.2 years ago by Jurat Shahidin ▴ 80

0

Entering edit mode

Jurat Shahidin ▴ 80

@jurat-shahidin-9488

Last seen 4.1 years ago

Chicago, IL, USA

Hi,

there was one posting that bit of related about what I asked. They did like this:

# read bed files

bed_gr <- as(import.bed(bed.1), "GRanges")

#then , select specific chromosome (i.e., chr1 )

bed_gr$chr1 <- bed_gr[seqnames(bed_gr)=="chr1"]

but, in my case, I am gonna iterate this one by one. Do you know any easy way to do this? Instead I have to write function to handle this case. Thanks a lot Julian !!

ADD COMMENT • link 8.2 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

I'd suggest going with Julian's solution.

ADD REPLY • link 8.2 years ago Sean Davis 21k

Martin Morgan · Accepted Answer · 2016-01-30

5

Entering edit mode

Julian Gehring ★ 1.3k

@julian-gehring-5818

Last seen 4.9 years ago

This will split a GRanges object by chromosome, which gives you a GRangesList. Then you can use lapply, sapply and friends to process the data for each chromosome individually:

## some example data
gr0 <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1))

## optional, if you want a genomic order of the chromosomes
gr0 = sortSeqlevels(gr0)

## split into a GRangesList
## where each element has all ranges for one chromosome
grl = split(gr0, seqnames(gr0))

## apply a function to the ranges of each chromosome
res = lapply(grl, your_function)

ADD COMMENT • link updated 8.2 years ago by Martin Morgan 25k • written 8.2 years ago by Julian Gehring ★ 1.3k

0

Entering edit mode

Hi,

Thanks a lot for your reply. I have tried your solution but it gave me error. such as:

split(gr0, seqnames(gr0)) GRangesList object of length 3: Error in nchar(nms) : could not find symbol "keepNA" in environment of the generic function

But, I have checked other related post and I figured out now. Thank you

ADD REPLY • link 8.2 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Out of curiosity, how did you manage to solve it in the end?

ADD REPLY • link 8.2 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

It would be good to understand why this throws an error for you. Does this fail just with your data, or also with the example data from above? Can you paste here how the object 'gr0' looks like before trying to split it?

ADD REPLY • link 8.2 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Hi,

this how it looks like when use your example:

gr0 <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1))
> gr0
GRanges object with 10 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr2  [ 1, 10]      *
   [2]     chr2  [ 2, 10]      *
   [3]     chr2  [ 3, 10]      *
   [4]     chr2  [ 4, 10]      *
   [5]     chr1  [ 5, 10]      *
   [6]     chr1  [ 6, 10]      *
   [7]     chr3  [ 7, 10]      *
   [8]     chr3  [ 8, 10]      *
   [9]     chr3  [ 9, 10]      *
  [10]     chr3  [10, 10]      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> split(gr0, seqnames(gr0))
GRangesList object of length 3:
Error in nchar(nms) :
  could not find symbol "keepNA" in environment of the generic function

ADD REPLY • link 8.2 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Okay, that seems strange, because the same is working fine for me. Can you check if you are running an outdated version of the GenomicRanges package? My sessionInfo():

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3   IRanges_2.4.6       
[4] S4Vectors_0.8.11     BiocGenerics_0.16.1 

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 XVector_0.10.0

ADD REPLY • link 8.2 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Hi,

oops, may be I have to updated the packages. I will try.

FYI, I am gonna iterate this one by one. Do you know any easy way to do this? Instead I have to write function to handle this case. Thank you !!

ADD REPLY • link 8.2 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

I have updated my initial answer which now also explains how to apply a function to the entries of each chromosome.

You can also have a look at a recent discussion (looping through a GRangesList object) why calling a function for each entry in a GRanges object may be slow.

ADD REPLY • link 8.2 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

The split command does not work. It split the data but only took the first 4562 transcripts from each chromosome. I assume this is because the first chromosome had that many transcripts.

ADD REPLY • link 2.3 years ago jon.klonowski ▴ 40