GenomicRanges: identify overlapping regions, get intersection between each overlapping pair and add gene coordinates (start, end, strand and ids)
1
1
Entering edit mode
@caroline-grisbach-meissner-14798
Last seen 6.8 years ago
Brazil/Curitiba/UFPR

I have a GRange object containing a list of genes. The genomic coordinate might be unique but the gene's id may repeat from a few to several times.

I would like to get a file containing all the overlapping genes, their intersection and information (start,end,strand and id) for each matching pair with ignore.strand=TRUE.

I almost got what I want but something is going wrong when I try to save the result:

> ovlp[1:30,]
   chr  start    end strand           id
1    1  11873  14409      +      DDX11L1
2    1  14361  29961      -       WASH7P
3    1  34610  36081      -      FAM138F
4    1  69090  70008      +        OR4F5
5    1 134772 140566      -    LOC729737
6    1 321083 321115      +     DQ597235
7    1 321145 321207      +     DQ599768
8    1 322036 326938      + LOC100133331
9    1 327545 328439      +    LOC388312
10   1 323891 328581      + LOC100132062
11   1 367658 368597      +       OR4F29
12   1 420205 421839      +     BC036251
13   1 566092 566115      -     JA429830
14   1 566134 566155      -     JA429831
15   1 566239 566263      -     JB137814
16   1 568843 568913      +       M37726
17   1 621095 622034      -       OR4F29
18   1 661138 665731      - LOC100133331
19   1 668417 668479      -     DQ575786
20   1 668509 668541      -     DQ599872
21   1 661138 670994      - LOC100133331
22   1 671823 671885      -     DQ575786
23   1 671915 671947      -     DQ599872
24   1 661138 679736      - LOC100133331
25   1 674239 679736      -     AK310751
26   1 700244 714068      - LOC100288069
27   1 761585 762902      -    LINC00115
28   1 762970 794826      +    LOC643837
29   1 803450 812182      -       FAM41C
30   1 846814 850328      +     AK056486

ovlp <- ovlp[,c(1,6,7,3,4)]
colnames(ovlp) <- c('chr','start','end','strand','id')
ovlp$gene <- ovlp$id
ovlp$spos <- ovlp$start
ovlp$epos <- ovlp$end
ovlp$str <- ovlp$strand

gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE)
d <- disjoin(gr,with.revmap=TRUE,ignore.strand=TRUE)
ovpairs <- findOverlapPairs(d,gr,ignore.strand=T)
pint <- pintersect(ovpairs,ignore.strand=T)

rvmp <- mcols(pint)$revmap
mcols(pint) <- DataFrame(mcols(pint),id=extractList(mcols(gr)$id,rvmp),
               sp=extractList(mcols(gr)$spos,rvmp),
               ep=extractList(mcols(gr)$epos,rvmp),
               str=extractList(mcols(gr)$str,rvmp))
upint <- unique(pint)                                

> upint
GRanges object with 41310 ranges and 6 metadata columns:
          seqnames         ranges strand |        revmap       hit
             <Rle>      <IRanges>  <Rle> | <IntegerList> <logical>
      [1]        1 [11873, 14360]      * |             1      TRUE
      [2]        1 [14361, 14409]      * |           1,2      TRUE
      [3]        1 [14410, 29961]      * |             2      TRUE
      [4]        1 [34610, 36081]      * |             3      TRUE
      [5]        1 [69090, 70008]      * |             4      TRUE
      ...      ...            ...    ... .           ...       ...
  [41306]       26 [14699, 14855]      * |         15331      TRUE
  [41307]       26 [14856, 15888]      * |   15331,15340      TRUE
  [41308]       26 [15959, 15997]      * |         15341      TRUE
  [41309]       26 [15998, 16024]      * |   15341,15342      TRUE
  [41310]       26 [16025, 16571]      * |         15342      TRUE
                             id            sp            ep          str
                   <FactorList> <IntegerList> <IntegerList> <FactorList>
      [1]               DDX11L1         11873         14409            +
      [2]        DDX11L1,WASH7P   11873,14361   14409,29961          +,-
      [3]                WASH7P         14361         29961            -
      [4]               FAM138F         34610         36081            -
      [5]                 OR4F5         69090         70008            +
      ...                   ...           ...           ...          ...
  [41306]              JA760602          7586         15888            -
  [41307] JA760602,cytochrome.b    7586,14856   15888,15888          -,+
  [41308]              tRNA.Pro         15959         16024            -
  [41309]     tRNA.Pro,AF079515   15959,15998   16024,16571          -,+
  [41310]              AF079515         15998         16571            +
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

df <- data.frame(upint,stringsAsFactors=F)

write.table(df,"hg19.ucsc.pintersect.final.txt",sep=";",quote=F,row.names=F)

So, my currently output file:

seqnames;start;end;width;strand;revmap;hit;id;sp;ep;str
1;11873;14360;2488;*;1;TRUE;6982;11873;14409;2
1;14361;14409;49;*;1:2;TRUE;c(6982, 27293);c(11873, 14361);c(14409, 29961);c(2, 1)
1;14410;29961;15552;*;2;TRUE;27293;14361;29961;1
1;34610;36081;1472;*;3;TRUE;10012;34610;36081;1
1;69090;70008;919;*;4;TRUE;19518;69090;70008;2
1;134772;140566;5795;*;5;TRUE;15617;134772;140566;1
1;321083;321115;33;*;6;TRUE;8970;321083;321115;2
1;321145;321207;63;*;7;TRUE;9071;321145;321207;2
1;322036;323890;1855;*;8;TRUE;14842;322036;326938;2
1;323891;326938;3048;*;c(8, 10);TRUE;c(14842, 14819);c(322036, 323891);c(326938, 328581);c(2, 2)
1;326939;327544;606;*;10;TRUE;14819;323891;328581;2
1;327545;328439;895;*;9:10;TRUE;c(15336, 14819);c(327545, 323891);c(328439, 328581);c(2, 2)

As you can see, the revmap, id, sp, ep and str columns contain c() notations. Besides, the ids were gone and so did the strand signals (+,-).

My desirable output is exactly the same information I am seeing in the upint (Grange Object).

Is there any other way to export or save a Grange Object?

Thank you!

Caroline

 

 

 

genomicranges overlap intersect findoverlaps • 5.7k views
0
Entering edit mode

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

Matrix products: default
BLAS: /home/cgrisbach/programme/R-3.4.2/lib64/R/lib/libRblas.so
LAPACK: /home/cgrisbach/programme/R-3.4.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] dplyr_0.7.4          GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
[4] IRanges_2.12.0       S4Vectors_0.16.0     BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14           assertthat_0.2.0       bitops_1.0-6
 [4] R6_2.2.2               magrittr_1.5           pillar_1.0.1
 [7] rlang_0.1.6            zlibbioc_1.24.0        XVector_0.18.0
[10] bindrcpp_0.2           glue_1.2.0             RCurl_1.95-4.10
[13] compiler_3.4.2         pkgconfig_2.0.1        bindr_0.1
[16] tibble_1.4.1           GenomeInfoDbData_1.0.0

0
Entering edit mode

Hi Michael,

I would like to save this data into a csv document in order to use as a reference file for further manipulations outside the R environment. But it would be also interesting to have this data as a .RData file.

My most wanted final output would be like this:

seqnames start end width id.1 sp.1 ep.1 str.1 id.2 sp.2 ep.2 str.2 id.3 sp.3 ep.3 str.3 
1 14361  14409    49 DDX11L1      11873  14409  + WASH7P       14361  29961   -  
1 323891 326938 3048 LOC100133331 322036 326938 + LOC100132062 323891 328581  +  
1 327545 328439  895 LOC388312    327545 328439 + LOC100132062 323891 328581  +  
1 661138 665731 4594 LOC100133331 661138 665731 - LOC100133331 661138 670994  - LOC100133331 661138 679736 -
1 665732 668416 2685 LOC100133331 661138 670994 - LOC100133331 661138 679736  -
1 668417 668479   63 DQ575786     668417 668479 - LOC100133331 661138 670994  - LOC100133331 661138 679736  -
1 668480 668508  29  LOC100133331 661138 670994 - LOC100133331 661138 679736  -
1 668509 668541  33  DQ599872     668509 668541 - LOC100133331 661138 670994  - LOC100133331 661138 679736  - 
1 668542 670994 2453 LOC100133331 661138 670994 - LOC100133331 661138 679736  -  

Which is basically the data from my upint object with some convenient modifications:

> upint[c(2,10,12,21,22,23,24,25,26,28,30,32)]

> upint[c(2,10,12,21,22)] #first 5 elements
GRanges object with 5 ranges and 4 metadata columns:
      seqnames           ranges strand |                                     id
         <Rle>        <IRanges>  <Rle> |                           <FactorList>
  [1]        1 [ 14361,  14409]      * |                         DDX11L1,WASH7P
  [2]        1 [323891, 326938]      * |              LOC100133331,LOC100132062
  [3]        1 [327545, 328439]      * |                 LOC388312,LOC100132062
  [4]        1 [661138, 665731]      * | LOC100133331,LOC100133331,LOC100133331
  [5]        1 [665732, 668416]      * |              LOC100133331,LOC100133331
                        sp                   ep          str
             <IntegerList>        <IntegerList> <FactorList>
  [1]          11873,14361          14409,29961          +,-
  [2]        322036,323891        326938,328581          +,+
  [3]        327545,323891        328439,328581          +,+
  [4] 661138,661138,661138 665731,670994,679736        -,-,-
  [5]        661138,661138        670994,679736          -,-
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

How could I get this data?

Many thanks!

Caroline

0
Entering edit mode

Unfortunately, I got an error message:

ovlp <- read.csv("hg19.ucsc.final.txt",sep=",",header=T)
ovlp <- ovlp[,c(1,6,7,3,4)]
colnames(ovlp) <- c('chr','start','end','strand','id')
ovlp$gene <- ovlp$id
ovlp$spos <- ovlp$start 
ovlp$epos <- ovlp$end
ovlp$str <- ovlp$strand

gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE)

d <- disjoin(gr,with.revmap=TRUE,ignore.strand=TRUE)
ovpairs <- findOverlapPairs(d,gr,ignore.strand=T)
pint <- pintersect(ovpairs,ignore.strand=T)

rvmp <- mcols(pint)$revmap
urvmp <- unlist(rvmp)
mcols(pint) <- DataFrame(mcols(pint)[togroup(PartitioningByEnd(rvmp))],
                          id=mcols(gr)$id[urvmp],
                          sp=mcols(gr)$spos[urvmp],
                          ep=mcols(gr)$epos[urvmp],
                          str=mcols(gr)$str[urvmp])

Error: subscript contains out-of-bounds indices
0
Entering edit mode

You'll need to provide a reproducible example.

ADD REPLY
0
Entering edit mode

I’ve just used the ovlp data frame I posted above (question). You may copy and paste its content to a text editor and save as “hg19.ucsc.final.txt”. My file was a csv, so I used read.csv. Let me know if this is enough  or tell me how could I provide a reproducible example.

0
Entering edit mode

I updated my answer. Hopefully it gets you closer.

ADD REPLY
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.9 years ago
United States

This might get you closer, but it's still not clear what you want exactly:

ovlp <- read.table("ovlp.txt", header=TRUE)
## Note: no need for renaming nor reordering of columns
gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE)
hits <- findOverlaps(gr, ignore.strand=TRUE,
                     drop.self=TRUE, drop.redundant=TRUE)
ovpairs <- Pairs(gr, gr, hits=hits)
pint <- pintersect(ovpairs, ignore.strand=TRUE)
mcols(pint) <- data.frame(first(ovpairs), second(ovpairs))

 

ADD COMMENT
0
Entering edit mode

This is exactly what I was looking for!

Thanks a lot!

 

Login before adding your answer.

Traffic: 910 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6