I have a GRange object containing a list of genes. The genomic coordinate might be unique but the gene's id may repeat from a few to several times.
I would like to get a file containing all the overlapping genes, their intersection and information (start,end,strand and id) for each matching pair with ignore.strand=TRUE.
I almost got what I want but something is going wrong when I try to save the result:
> ovlp[1:30,]
chr start end strand id
1 1 11873 14409 + DDX11L1
2 1 14361 29961 - WASH7P
3 1 34610 36081 - FAM138F
4 1 69090 70008 + OR4F5
5 1 134772 140566 - LOC729737
6 1 321083 321115 + DQ597235
7 1 321145 321207 + DQ599768
8 1 322036 326938 + LOC100133331
9 1 327545 328439 + LOC388312
10 1 323891 328581 + LOC100132062
11 1 367658 368597 + OR4F29
12 1 420205 421839 + BC036251
13 1 566092 566115 - JA429830
14 1 566134 566155 - JA429831
15 1 566239 566263 - JB137814
16 1 568843 568913 + M37726
17 1 621095 622034 - OR4F29
18 1 661138 665731 - LOC100133331
19 1 668417 668479 - DQ575786
20 1 668509 668541 - DQ599872
21 1 661138 670994 - LOC100133331
22 1 671823 671885 - DQ575786
23 1 671915 671947 - DQ599872
24 1 661138 679736 - LOC100133331
25 1 674239 679736 - AK310751
26 1 700244 714068 - LOC100288069
27 1 761585 762902 - LINC00115
28 1 762970 794826 + LOC643837
29 1 803450 812182 - FAM41C
30 1 846814 850328 + AK056486
ovlp <- ovlp[,c(1,6,7,3,4)]
colnames(ovlp) <- c('chr','start','end','strand','id')
ovlp$gene <- ovlp$id
ovlp$spos <- ovlp$start
ovlp$epos <- ovlp$end
ovlp$str <- ovlp$strand
gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE)
d <- disjoin(gr,with.revmap=TRUE,ignore.strand=TRUE)
ovpairs <- findOverlapPairs(d,gr,ignore.strand=T)
pint <- pintersect(ovpairs,ignore.strand=T)
rvmp <- mcols(pint)$revmap
mcols(pint) <- DataFrame(mcols(pint),id=extractList(mcols(gr)$id,rvmp),
sp=extractList(mcols(gr)$spos,rvmp),
ep=extractList(mcols(gr)$epos,rvmp),
str=extractList(mcols(gr)$str,rvmp))
upint <- unique(pint)
> upint
GRanges object with 41310 ranges and 6 metadata columns:
seqnames ranges strand | revmap hit
<Rle> <IRanges> <Rle> | <IntegerList> <logical>
[1] 1 [11873, 14360] * | 1 TRUE
[2] 1 [14361, 14409] * | 1,2 TRUE
[3] 1 [14410, 29961] * | 2 TRUE
[4] 1 [34610, 36081] * | 3 TRUE
[5] 1 [69090, 70008] * | 4 TRUE
... ... ... ... . ... ...
[41306] 26 [14699, 14855] * | 15331 TRUE
[41307] 26 [14856, 15888] * | 15331,15340 TRUE
[41308] 26 [15959, 15997] * | 15341 TRUE
[41309] 26 [15998, 16024] * | 15341,15342 TRUE
[41310] 26 [16025, 16571] * | 15342 TRUE
id sp ep str
<FactorList> <IntegerList> <IntegerList> <FactorList>
[1] DDX11L1 11873 14409 +
[2] DDX11L1,WASH7P 11873,14361 14409,29961 +,-
[3] WASH7P 14361 29961 -
[4] FAM138F 34610 36081 -
[5] OR4F5 69090 70008 +
... ... ... ... ...
[41306] JA760602 7586 15888 -
[41307] JA760602,cytochrome.b 7586,14856 15888,15888 -,+
[41308] tRNA.Pro 15959 16024 -
[41309] tRNA.Pro,AF079515 15959,15998 16024,16571 -,+
[41310] AF079515 15998 16571 +
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
df <- data.frame(upint,stringsAsFactors=F)
write.table(df,"hg19.ucsc.pintersect.final.txt",sep=";",quote=F,row.names=F)
So, my currently output file:
seqnames;start;end;width;strand;revmap;hit;id;sp;ep;str 1;11873;14360;2488;*;1;TRUE;6982;11873;14409;2 1;14361;14409;49;*;1:2;TRUE;c(6982, 27293);c(11873, 14361);c(14409, 29961);c(2, 1) 1;14410;29961;15552;*;2;TRUE;27293;14361;29961;1 1;34610;36081;1472;*;3;TRUE;10012;34610;36081;1 1;69090;70008;919;*;4;TRUE;19518;69090;70008;2 1;134772;140566;5795;*;5;TRUE;15617;134772;140566;1 1;321083;321115;33;*;6;TRUE;8970;321083;321115;2 1;321145;321207;63;*;7;TRUE;9071;321145;321207;2 1;322036;323890;1855;*;8;TRUE;14842;322036;326938;2 1;323891;326938;3048;*;c(8, 10);TRUE;c(14842, 14819);c(322036, 323891);c(326938, 328581);c(2, 2) 1;326939;327544;606;*;10;TRUE;14819;323891;328581;2 1;327545;328439;895;*;9:10;TRUE;c(15336, 14819);c(327545, 323891);c(328439, 328581);c(2, 2)
As you can see, the revmap, id, sp, ep and str columns contain c() notations. Besides, the ids were gone and so did the strand signals (+,-).
My desirable output is exactly the same information I am seeing in the upint (Grange Object).
Is there any other way to export or save a Grange Object?
Thank you!
Caroline

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)
Matrix products: default
BLAS: /home/cgrisbach/programme/R-3.4.2/lib64/R/lib/libRblas.so
LAPACK: /home/cgrisbach/programme/R-3.4.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dplyr_0.7.4 GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
[4] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 assertthat_0.2.0 bitops_1.0-6
[4] R6_2.2.2 magrittr_1.5 pillar_1.0.1
[7] rlang_0.1.6 zlibbioc_1.24.0 XVector_0.18.0
[10] bindrcpp_0.2 glue_1.2.0 RCurl_1.95-4.10
[13] compiler_3.4.2 pkgconfig_2.0.1 bindr_0.1
[16] tibble_1.4.1 GenomeInfoDbData_1.0.0
Hi Michael,
I would like to save this data into a csv document in order to use as a reference file for further manipulations outside the R environment. But it would be also interesting to have this data as a .RData file.
My most wanted final output would be like this:
Which is basically the data from my upint object with some convenient modifications:
> upint[c(2,10,12,21,22,23,24,25,26,28,30,32)] > upint[c(2,10,12,21,22)] #first 5 elements GRanges object with 5 ranges and 4 metadata columns: seqnames ranges strand | id <Rle> <IRanges> <Rle> | <FactorList> [1] 1 [ 14361, 14409] * | DDX11L1,WASH7P [2] 1 [323891, 326938] * | LOC100133331,LOC100132062 [3] 1 [327545, 328439] * | LOC388312,LOC100132062 [4] 1 [661138, 665731] * | LOC100133331,LOC100133331,LOC100133331 [5] 1 [665732, 668416] * | LOC100133331,LOC100133331 sp ep str <IntegerList> <IntegerList> <FactorList> [1] 11873,14361 14409,29961 +,- [2] 322036,323891 326938,328581 +,+ [3] 327545,323891 328439,328581 +,+ [4] 661138,661138,661138 665731,670994,679736 -,-,- [5] 661138,661138 670994,679736 -,- ------- seqinfo: 25 sequences from an unspecified genome; no seqlengthsHow could I get this data?
Many thanks!
Caroline
Unfortunately, I got an error message:
ovlp <- read.csv("hg19.ucsc.final.txt",sep=",",header=T) ovlp <- ovlp[,c(1,6,7,3,4)] colnames(ovlp) <- c('chr','start','end','strand','id') ovlp$gene <- ovlp$id ovlp$spos <- ovlp$start ovlp$epos <- ovlp$end ovlp$str <- ovlp$strand gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE) d <- disjoin(gr,with.revmap=TRUE,ignore.strand=TRUE) ovpairs <- findOverlapPairs(d,gr,ignore.strand=T) pint <- pintersect(ovpairs,ignore.strand=T) rvmp <- mcols(pint)$revmap urvmp <- unlist(rvmp) mcols(pint) <- DataFrame(mcols(pint)[togroup(PartitioningByEnd(rvmp))], id=mcols(gr)$id[urvmp], sp=mcols(gr)$spos[urvmp], ep=mcols(gr)$epos[urvmp], str=mcols(gr)$str[urvmp]) Error: subscript contains out-of-bounds indicesYou'll need to provide a reproducible example.
I’ve just used the ovlp data frame I posted above (question). You may copy and paste its content to a text editor and save as “hg19.ucsc.final.txt”. My file was a csv, so I used read.csv. Let me know if this is enough or tell me how could I provide a reproducible example.
I updated my answer. Hopefully it gets you closer.