Question

Intersecting two big dataframe

0

Entering edit mode

AZ ▴ 30

@fereshteh-15803

Last seen 11 months ago

United Kingdom

Hi

I have human exome coordinates like this

Exon    Choromosome     Exone_Start Exone_End
uc001aaa.3  chr1          11873   12227
uc001aaa.3  chr1          12612   12721
uc001aaa.3  chr1          13220    14409

I also have coordinate of copy number data like this

Chromosome  Start   End           Segment_mean  Num_prob
1   64613          5707515          0.981113452 
1   5712940       5732322           0.981113452 
1   5732399      1638368zz          0.981113452

I want to calculate how many exon locate in the segment position, potentially the number of exon would fill the Num_prob column

I really don't know how to do that

Thank you for any help in advance

genomicrange intersect • 540 views

ADD COMMENT • link updated 4.0 years ago by James W. MacDonald 65k • written 4.0 years ago by AZ ▴ 30

score 3 · Accepted Answer · 2020-04-03

There is probably a super-sophisticated way to do this, and if so either Herve or Michael will be along soon to tell you. For now, here is my dumb approach. Do note that you probably want GRanges for this, rather than whatever you have now.

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("EXONID", "TXNAME"))
## You would convert your copy number data to a GRanges, rather than making a fake one like this
> segmentGR <- GRanges(rep("chr1", 5), IRanges(c(11532,12222,12228, 12561, 12705), c(12345, 12934,12593, 13721, 14457)), Segment_mean = runif(5)) 
> segmentGR
GRanges object with 5 ranges and 1 metadata column:
      seqnames      ranges strand |       Segment_mean
         <Rle>   <IRanges>  <Rle> |          <numeric>
  [1]     chr1 11532-12345      * |  0.293932071421295
  [2]     chr1 12222-12934      * |  0.826928496127948
  [3]     chr1 12228-12593      * |  0.351696432800964
  [4]     chr1 12561-13721      * |  0.251908763078973
  [5]     chr1 12705-14457      * | 0.0250664639752358
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> fo <- findOverlaps(segmentGR, ex)
> fo
Hits object with 16 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         2           1
   [3]         2           2
   [4]         2           3
   [5]         2           4
   ...       ...         ...
  [12]         5           3
  [13]         5           5
  [14]         5           6
  [15]         5       13964
  [16]         5       13965
  -------
  queryLength: 5 / subjectLength: 289969

> np <- vector("numeric", length(segmentGR))
> np
[1] 0 0 0 0 0
> tab <- table(queryHits(fo))
> np[as.numeric(names(tab))] <- tab
> np
[1] 1 4 0 5 6
> segmentGR$Num_prob <- np
> segmentGR
GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |       Segment_mean  Num_prob
         <Rle>   <IRanges>  <Rle> |          <numeric> <numeric>
  [1]     chr1 11532-12345      * |  0.293932071421295         1
  [2]     chr1 12222-12934      * |  0.826928496127948         4
  [3]     chr1 12228-12593      * |  0.351696432800964         0
  [4]     chr1 12561-13721      * |  0.251908763078973         5
  [5]     chr1 12705-14457      * | 0.0250664639752358         6
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths