Question

Finding genes with loss of Heterozygosity

0

Entering edit mode

AZ ▴ 20

@fereshteh-15803

Last seen 2.8 years ago

United Kingdom

Hello everyone

I hope you are safe and good

I have genomic ranges below in which minor allele = 0 means loss of Heterozygosity. (LOH)

Chr start         end
1   10583       863511
1   12841835    12854479
1   17231299    17232877
1   120380739   142540462
1   142540484   142684781
1   142685823   143462598
1   143466928   143533092
1   143534046   144064964
1   145079160   145110146
1   145121086   145400508

I have another data frame like below where I have position of mutations in each genes

> head(mut[,c(2:7)])
  Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
1       chr1       89616151     89616151                -                 T
2       chr6       51909815     51909815                A                 -
3      chr16       20556547     20556547                -                 G
4      chr17       66992092     66992092                -                 A
5      chr20       13251339     13251339                -                 T
6      chr20       35929770     35929771               TC                 -
  Hugo_Symbol
1        GBP7
2       PKHD1
3      ACSM2B
4       ABCA9
5        ISM1
6      MANBAL
>

I want to know which genes also have los of Heterozygosity based on the first data frame

Thank you for any help

genomicrange • 1.6k views

ADD COMMENT • link updated 5.8 years ago by Kevin Blighe ★ 4.0k • written 5.8 years ago by AZ ▴ 20

score 1 · Answer 1 · 2020-04-19

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 5 weeks ago

The Cave, 181 Longwood Avenue, Boston, …

Hello again Fereshteh,

This can be achieved via findOverlaps(), which is part of the GenomicRanges package. Here is a reproducible example:

1, load package

require(GenomicRanges)

2, create gr1 object

gr1 <- makeGRangesFromDataFrame(
    data.frame(
        chr = rep('chr1',3),
        start = c(1,100,500),
        end = c(50,150,1500),
        feature = c('feat1', 'feat2', 'feat3')),
        keep.extra.columns = TRUE)

gr1

GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |     feature
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1      1-50      * |       feat1
  [2]     chr1   100-150      * |       feat2
  [3]     chr1  500-1500      * |       feat3
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

3, create gr2 object

gr2 <- makeGRangesFromDataFrame(
    data.frame(
        chr = rep('chr1',3),
        start = c(100,500,1001),
        end = c(200,1000,1050),
        gene = c('ENSG2', 'ENSG3', 'ENSG4')),
        keep.extra.columns = TRUE)

gr2

GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        gene
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1   100-200      * |       ENSG2
  [2]     chr1  500-1000      * |       ENSG3
  [3]     chr1 1001-1050      * |       ENSG4
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

4(a), find any features in gr1 that overlap any in gr2

gr1[queryHits(findOverlaps(gr1, gr2, type="any")),] 

GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |     feature
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1   100-150      * |       feat2
  [2]     chr1  500-1500      * |       feat3
  [3]     chr1  500-1500      * |       feat3
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

4(b), same as 4(a), but add on [to the output] those features in gr2 that are overlapped

data.frame(
  gr1[queryHits(findOverlaps(gr1, gr2, type="any")),],
  gr2[queryHits(findOverlaps(gr2, gr1, type="any")),])

  seqnames start  end width strand feature seqnames.1 start.1 end.1 width.1
1     chr1   100  150    51      *   feat2       chr1     100   200     101
2     chr1   500 1500  1001      *   feat3       chr1     500  1000     501
3     chr1   500 1500  1001      *   feat3       chr1    1001  1050      50
  strand.1  gene
1        * ENSG2
2        * ENSG3
3        * ENSG4

This should make this task very easy for you.

By the way, in your mut, object, I would change 'Chromosome', 'Start_Position', and 'End_Position' to 'chromosome', 'start' and 'end'; otherwise, GenomicRanges will not automatically recognise these columns.

Kevin

ADD COMMENT • link 5.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thank you

Sorry one question: If gr1 comes from my first data frame (which containing range with minor allele =0) then what I should put in feature column of that?

For gr2 if I am not wrong, gene column contains gene symbols I am proving in the second data frame with mutations

I am wondering because when creating my gr objects I got this

> gr1[queryHits(findOverlaps(gr1, gr2, type="any")),] 
Error: vector memory exhausted (limit reached?)
>

ADD REPLY • link 5.8 years ago AZ ▴ 20

0

Entering edit mode

Sorry one question: If gr1 comes from my first data frame (which containing range with minor allele =0) then what I should put in feature column of that?

My code is just providing an example. For your data, you just need 'chr', 'start', 'end'.

For gr2 if I am not wrong, gene column contains gene symbols I am proving in the second data frame with mutations

Again, my code is just providing an example. You can have any amount of extra columns (other than 'chromosome', 'start', and 'end') that contain any information.

----------

The 'vector memory' error means that you do not have enough RAM. On which computer are you running the code and how much RAM does it have? Also, which version of R, and what ids the size of your gr1 and gr2?

ADD REPLY • link 5.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thank you

2.7 GHz Intel Core i5

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.3                         
year           2020                        
month          02                          
day            29                          
svn rev        77875                       
language       R                           
version.string R version 3.6.3 (2020-02-29)
nickname       Holding the Windsock

My data is for 15 patients

ADD REPLY • link 5.8 years ago AZ ▴ 20

0

Entering edit mode

Thank you, but, how much RAM does your computer have?; how many regions are in your input files?; which commands are you running?

ADD REPLY • link 5.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

RAM is 8 GB 1867 MHz DDR3

Sorry I don't know how many regions I have

I have followed your commands

Sorry, even by googling I can not figure out why I have to overlap regions of genome with minor allele = 0 with regions of genomes with mutation

Actually in my understanding, LOH is losing one parental allele

But why we need to find a gene with both copy number loss and mutation

To be honest I can not understand the intuition and logic behind this story

ADD REPLY • link 5.8 years ago AZ ▴ 20

0

Entering edit mode

Hello again. It seems that your questions are now more general. This website is designed for questions about technical issues relating to Bioconductor packages. I wish you the best in your project.

ADD REPLY • link 5.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Sorry @Kevin Blighe

I have used your instruction like below

> head(cn)
             sample chr     start       end major_cn minor_cn total_cn normal_cn   ploidy
1 LP6008334-DNA_B02   1     10583   3457311        2        0        2         2 1.863911
2 LP6008334-DNA_B02   1   3458681 143542402        1        0        1         2 1.863911
3 LP6008334-DNA_B02   1 143542478 144839407        2        0        2         2 1.863911
4 LP6008334-DNA_B02   1 144839411 144951386        2        1        3         2 1.863911
5 LP6008334-DNA_B02   1 144951543 145380399        2        0        2         2 1.863911
6 LP6008334-DNA_B02   1 145380949 148192510        2        1        3         2 1.863911
  purity
1   0.23
2   0.23
3   0.23
4   0.23
5   0.23
6   0.23
> head(snv)
  chr   start     end REF ALT ref_counts var_counts total_counts N_REF_COUNT N_ALT_COUNT
1   1  782112  782112   G   A         45         21           66          46           0
2   1 1026918 1026918   C   T         30         20           50          21           0
3   1 1133283 1133283   C   T         47         28           75          41           0
4   1 1431511 1431511   G   A         40         20           60          33           0
5   1 1742395 1742395   C   T         40         26           66          32           0
6   1 1864994 1864994   G   A         43         28           71          51           0
        VAF N_VAF       CCF
1 0.3181818     0 0.6363636
2 0.4000000     0 0.8000000
3 0.3733333     0 0.7466667
4 0.3333333     0 0.6666667
5 0.3939394     0 0.7878788
6 0.3943662     0 0.7887324
>

gr1 <- makeGRangesFromDataFrame(as.data.frame(cn),keep.extra.columns = TRUE)

gr2 <- makeGRangesFromDataFrame(as.data.frame(snv),keep.extra.columns = TRUE)

> merged=data.frame(
+     gr1[queryHits(findOverlaps(gr1, gr2, type="any")),],
+     gr2[queryHits(findOverlaps(gr2, gr1, type="any")),])

> head(merged)
  seqnames start     end   width strand            sample major_cn minor_cn total_cn normal_cn
1        1 10583 3457311 3446729      * LP6008334-DNA_B02        2        0        2         2
2        1 10583 3457311 3446729      * LP6008334-DNA_B02        2        0        2         2
3        1 10583 3457311 3446729      * LP6008334-DNA_B02        2        0        2         2
4        1 10583 3457311 3446729      * LP6008334-DNA_B02        2        0        2         2
5        1 10583 3457311 3446729      * LP6008334-DNA_B02        2        0        2         2
6        1 10583 3457311 3446729      * LP6008334-DNA_B02        2        0        2         2
    ploidy purity seqnames.1 start.1  end.1 width.1 strand.1 REF ALT ref_counts var_counts
1 1.863911   0.23          1  782112 782112       1        *   G   A         45         21
2 1.863911   0.23          1  782112 782112       1        *   G   A         45         21
3 1.863911   0.23          1  782112 782112       1        *   G   A         45         21
4 1.863911   0.23          1  782112 782112       1        *   G   A         45         21
5 1.863911   0.23          1  782112 782112       1        *   G   A         45         21
6 1.863911   0.23          1  782112 782112       1        *   G   A         45         21
  total_counts N_REF_COUNT N_ALT_COUNT       VAF N_VAF       CCF
1           66          46           0 0.3181818     0 0.6363636
2           66          46           0 0.3181818     0 0.6363636
3           66          46           0 0.3181818     0 0.6363636
4           66          46           0 0.3181818     0 0.6363636
5           66          46           0 0.3181818     0 0.6363636
6           66          46           0 0.3181818     0 0.6363636
>

I need the output for clonality analysis but I am seeing likely I have redundancy and reputation after merging copy number and variant data

Is it normal or I am doing something wrong?

ADD REPLY • link 5.7 years ago AZ ▴ 20