Intersecting two big dataframe
1
0
Entering edit mode
AZ ▴ 30
@fereshteh-15803
Last seen 11 months ago
United Kingdom

Hi

I have human exome coordinates like this

Exon    Choromosome     Exone_Start Exone_End
uc001aaa.3  chr1          11873   12227
uc001aaa.3  chr1          12612   12721
uc001aaa.3  chr1          13220    14409

I also have coordinate of copy number data like this

Chromosome  Start   End           Segment_mean  Num_prob
1   64613          5707515          0.981113452 
1   5712940       5732322           0.981113452 
1   5732399      1638368zz          0.981113452

I want to calculate how many exon locate in the segment position, potentially the number of exon would fill the Num_prob column

I really don't know how to do that

Thank you for any help in advance

genomicrange intersect • 540 views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 13 minutes ago
United States

There is probably a super-sophisticated way to do this, and if so either Herve or Michael will be along soon to tell you. For now, here is my dumb approach. Do note that you probably want GRanges for this, rather than whatever you have now.

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("EXONID", "TXNAME"))
## You would convert your copy number data to a GRanges, rather than making a fake one like this
> segmentGR <- GRanges(rep("chr1", 5), IRanges(c(11532,12222,12228, 12561, 12705), c(12345, 12934,12593, 13721, 14457)), Segment_mean = runif(5)) 
> segmentGR
GRanges object with 5 ranges and 1 metadata column:
      seqnames      ranges strand |       Segment_mean
         <Rle>   <IRanges>  <Rle> |          <numeric>
  [1]     chr1 11532-12345      * |  0.293932071421295
  [2]     chr1 12222-12934      * |  0.826928496127948
  [3]     chr1 12228-12593      * |  0.351696432800964
  [4]     chr1 12561-13721      * |  0.251908763078973
  [5]     chr1 12705-14457      * | 0.0250664639752358
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> fo <- findOverlaps(segmentGR, ex)
> fo
Hits object with 16 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         2           1
   [3]         2           2
   [4]         2           3
   [5]         2           4
   ...       ...         ...
  [12]         5           3
  [13]         5           5
  [14]         5           6
  [15]         5       13964
  [16]         5       13965
  -------
  queryLength: 5 / subjectLength: 289969

> np <- vector("numeric", length(segmentGR))
> np
[1] 0 0 0 0 0
> tab <- table(queryHits(fo))
> np[as.numeric(names(tab))] <- tab
> np
[1] 1 4 0 5 6
> segmentGR$Num_prob <- np
> segmentGR
GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |       Segment_mean  Num_prob
         <Rle>   <IRanges>  <Rle> |          <numeric> <numeric>
  [1]     chr1 11532-12345      * |  0.293932071421295         1
  [2]     chr1 12222-12934      * |  0.826928496127948         4
  [3]     chr1 12228-12593      * |  0.351696432800964         0
  [4]     chr1 12561-13721      * |  0.251908763078973         5
  [5]     chr1 12705-14457      * | 0.0250664639752358         6
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
2
Entering edit mode

Not sure how well table will work for a really big vector, so a more performant option might be to use an Rle instead.

> rle <- Rle(queryHits(fo))
> np <- vector("numeric", length(segmentGR))
> np[runValue(rle)] <- runLength(rle)
> segmentGR$Num_prob <- np
> segmentGR
GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |       Segment_mean  Num_prob
         <Rle>   <IRanges>  <Rle> |          <numeric> <numeric>
  [1]     chr1 11532-12345      * |  0.224100045626983         1
  [2]     chr1 12222-12934      * |   0.12095244275406         4
  [3]     chr1 12228-12593      * |  0.546647080918774         0
  [4]     chr1 12561-13721      * |  0.526769673451781         5
  [5]     chr1 12705-14457      * | 0.0679673755075783         6
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD REPLY

Login before adding your answer.

Traffic: 794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6