I have a GRange object containing a list of genes. The genomic coordinate might be unique but the gene's id may repeat from a few to several times.
I would like to get a file containing all the overlapping genes, their intersection and information (start,end,strand and id) for each matching pair with ignore.strand=TRUE.
I almost got what I want but something is going wrong when I try to save the result:
> ovlp[1:30,] chr start end strand id 1 1 11873 14409 + DDX11L1 2 1 14361 29961 - WASH7P 3 1 34610 36081 - FAM138F 4 1 69090 70008 + OR4F5 5 1 134772 140566 - LOC729737 6 1 321083 321115 + DQ597235 7 1 321145 321207 + DQ599768 8 1 322036 326938 + LOC100133331 9 1 327545 328439 + LOC388312 10 1 323891 328581 + LOC100132062 11 1 367658 368597 + OR4F29 12 1 420205 421839 + BC036251 13 1 566092 566115 - JA429830 14 1 566134 566155 - JA429831 15 1 566239 566263 - JB137814 16 1 568843 568913 + M37726 17 1 621095 622034 - OR4F29 18 1 661138 665731 - LOC100133331 19 1 668417 668479 - DQ575786 20 1 668509 668541 - DQ599872 21 1 661138 670994 - LOC100133331 22 1 671823 671885 - DQ575786 23 1 671915 671947 - DQ599872 24 1 661138 679736 - LOC100133331 25 1 674239 679736 - AK310751 26 1 700244 714068 - LOC100288069 27 1 761585 762902 - LINC00115 28 1 762970 794826 + LOC643837 29 1 803450 812182 - FAM41C 30 1 846814 850328 + AK056486 ovlp <- ovlp[,c(1,6,7,3,4)] colnames(ovlp) <- c('chr','start','end','strand','id') ovlp$gene <- ovlp$id ovlp$spos <- ovlp$start ovlp$epos <- ovlp$end ovlp$str <- ovlp$strand gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE) d <- disjoin(gr,with.revmap=TRUE,ignore.strand=TRUE) ovpairs <- findOverlapPairs(d,gr,ignore.strand=T) pint <- pintersect(ovpairs,ignore.strand=T) rvmp <- mcols(pint)$revmap mcols(pint) <- DataFrame(mcols(pint),id=extractList(mcols(gr)$id,rvmp), sp=extractList(mcols(gr)$spos,rvmp), ep=extractList(mcols(gr)$epos,rvmp), str=extractList(mcols(gr)$str,rvmp)) upint <- unique(pint) > upint GRanges object with 41310 ranges and 6 metadata columns: seqnames ranges strand | revmap hit <Rle> <IRanges> <Rle> | <IntegerList> <logical> [1] 1 [11873, 14360] * | 1 TRUE [2] 1 [14361, 14409] * | 1,2 TRUE [3] 1 [14410, 29961] * | 2 TRUE [4] 1 [34610, 36081] * | 3 TRUE [5] 1 [69090, 70008] * | 4 TRUE ... ... ... ... . ... ... [41306] 26 [14699, 14855] * | 15331 TRUE [41307] 26 [14856, 15888] * | 15331,15340 TRUE [41308] 26 [15959, 15997] * | 15341 TRUE [41309] 26 [15998, 16024] * | 15341,15342 TRUE [41310] 26 [16025, 16571] * | 15342 TRUE id sp ep str <FactorList> <IntegerList> <IntegerList> <FactorList> [1] DDX11L1 11873 14409 + [2] DDX11L1,WASH7P 11873,14361 14409,29961 +,- [3] WASH7P 14361 29961 - [4] FAM138F 34610 36081 - [5] OR4F5 69090 70008 + ... ... ... ... ... [41306] JA760602 7586 15888 - [41307] JA760602,cytochrome.b 7586,14856 15888,15888 -,+ [41308] tRNA.Pro 15959 16024 - [41309] tRNA.Pro,AF079515 15959,15998 16024,16571 -,+ [41310] AF079515 15998 16571 + ------- seqinfo: 25 sequences from an unspecified genome; no seqlengths df <- data.frame(upint,stringsAsFactors=F) write.table(df,"hg19.ucsc.pintersect.final.txt",sep=";",quote=F,row.names=F)
So, my currently output file:
seqnames;start;end;width;strand;revmap;hit;id;sp;ep;str 1;11873;14360;2488;*;1;TRUE;6982;11873;14409;2 1;14361;14409;49;*;1:2;TRUE;c(6982, 27293);c(11873, 14361);c(14409, 29961);c(2, 1) 1;14410;29961;15552;*;2;TRUE;27293;14361;29961;1 1;34610;36081;1472;*;3;TRUE;10012;34610;36081;1 1;69090;70008;919;*;4;TRUE;19518;69090;70008;2 1;134772;140566;5795;*;5;TRUE;15617;134772;140566;1 1;321083;321115;33;*;6;TRUE;8970;321083;321115;2 1;321145;321207;63;*;7;TRUE;9071;321145;321207;2 1;322036;323890;1855;*;8;TRUE;14842;322036;326938;2 1;323891;326938;3048;*;c(8, 10);TRUE;c(14842, 14819);c(322036, 323891);c(326938, 328581);c(2, 2) 1;326939;327544;606;*;10;TRUE;14819;323891;328581;2 1;327545;328439;895;*;9:10;TRUE;c(15336, 14819);c(327545, 323891);c(328439, 328581);c(2, 2)
As you can see, the revmap, id, sp, ep and str columns contain c() notations. Besides, the ids were gone and so did the strand signals (+,-).
My desirable output is exactly the same information I am seeing in the upint (Grange Object).
Is there any other way to export or save a Grange Object?
Thank you!
Caroline
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)
Matrix products: default
BLAS: /home/cgrisbach/programme/R-3.4.2/lib64/R/lib/libRblas.so
LAPACK: /home/cgrisbach/programme/R-3.4.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dplyr_0.7.4 GenomicRanges_1.30.1 GenomeInfoDb_1.14.0
[4] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 assertthat_0.2.0 bitops_1.0-6
[4] R6_2.2.2 magrittr_1.5 pillar_1.0.1
[7] rlang_0.1.6 zlibbioc_1.24.0 XVector_0.18.0
[10] bindrcpp_0.2 glue_1.2.0 RCurl_1.95-4.10
[13] compiler_3.4.2 pkgconfig_2.0.1 bindr_0.1
[16] tibble_1.4.1 GenomeInfoDbData_1.0.0
Hi Michael,
I would like to save this data into a csv document in order to use as a reference file for further manipulations outside the R environment. But it would be also interesting to have this data as a .RData file.
My most wanted final output would be like this:
Which is basically the data from my upint object with some convenient modifications:
How could I get this data?
Many thanks!
Caroline
Unfortunately, I got an error message:
ovlp <- read.csv("hg19.ucsc.final.txt",sep=",",header=T) ovlp <- ovlp[,c(1,6,7,3,4)] colnames(ovlp) <- c('chr','start','end','strand','id') ovlp$gene <- ovlp$id ovlp$spos <- ovlp$start ovlp$epos <- ovlp$end ovlp$str <- ovlp$strand gr <- makeGRangesFromDataFrame(ovlp, keep.extra.columns = TRUE) d <- disjoin(gr,with.revmap=TRUE,ignore.strand=TRUE) ovpairs <- findOverlapPairs(d,gr,ignore.strand=T) pint <- pintersect(ovpairs,ignore.strand=T) rvmp <- mcols(pint)$revmap urvmp <- unlist(rvmp) mcols(pint) <- DataFrame(mcols(pint)[togroup(PartitioningByEnd(rvmp))], id=mcols(gr)$id[urvmp], sp=mcols(gr)$spos[urvmp], ep=mcols(gr)$epos[urvmp], str=mcols(gr)$str[urvmp]) Error: subscript contains out-of-bounds indices
You'll need to provide a reproducible example.
I’ve just used the ovlp data frame I posted above (question). You may copy and paste its content to a text editor and save as “hg19.ucsc.final.txt”. My file was a csv, so I used read.csv. Let me know if this is enough or tell me how could I provide a reproducible example.
I updated my answer. Hopefully it gets you closer.