Question

Mapping of two different regions to one gene symbol

0

Entering edit mode

NS ▴ 60

@ns-7498

Last seen 5.7 years ago

United States

I am analyzing CNV data downloaded from TCGA database (level 3) and aim to convert it to a gene-level matrix.

The files are like the below:

Sample    Chromosome    Start    End    Num_Probes    Segment_Mean
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    3218610    16796721    7253    -0.0198
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    16796742    17763566    312    -0.3615
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    17764034    221905958    105172    -0.0073

To convert CNV data to gene-level data, I map genome regions to genes. In some cases, two different regions with different 'Segment_Mean' values are mapped to one gene. In this case, is it correct if I use the average of 'Segment_Mean' values for that gene?

Any thoughts?

It should be mentioned that the data has been obtained using SNP Array 6.0.

Thanks.

TCGA CNV • 1.7k views

ADD COMMENT • link updated 9.5 years ago by Marc Carlson ★ 7.2k • written 9.5 years ago by NS ▴ 60

score 0 · Answer 1 · 2015-05-22

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.3 years ago

United States

Your question is a little bit opaque to me. But it is pretty normal for a couple of different ranges to overlap with a single gene. Can you please clarify what it is you are trying to do?

Marc

ADD COMMENT • link 9.5 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

I aim to make a gene-level matrix from CNV data. I mean a matrix like [genes x samples] which each entry shows Segment_mean value of the corresponding gene in sample. Does it make sense?

ADD REPLY • link 9.5 years ago NS ▴ 60

0

Entering edit mode

Hi,

It all depends on what Segment_mean represents exactly.

Let's try to reformulate your problem in a more generic way: You've a set of regions of interest, and a numeric value associated to each of them. Now you want to assign a single value to your set of regions as a whole by combining the individual values in some way. What's the correct way to combine the values? Well, it depends. Sometimes you want to take the max (e.g. if the i-th value is the height of the highest peak in the i-th region), or the weighted mean where the weights are the sizes of the regions (e.g. if the i-th value is the GC content in % in the i-th region), or the sum (e.g. if the i-th value is an absolute count like the number of hits of some sort or the number of CpG islands in the i-th region), or the product (e.g. if the i-th value is an absolute count to which a mathematical transformation like exp() was applied), etc...

So I guess we would need to know a little bit more about the nature of Segment_mean in order to be able to give some advice.

H.

ADD REPLY • link 9.5 years ago Hervé Pagès 16k