Question: Mapping of two different regions to one gene symbol
0
4.7 years ago by
NS60
United States
NS60 wrote:

I am analyzing CNV data downloaded from TCGA database (level 3) and aim to convert it to a gene-level matrix.

The files are like the below:

Sample    Chromosome    Start    End    Num_Probes    Segment_Mean
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    3218610    16796721    7253    -0.0198
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    16796742    17763566    312    -0.3615
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    17764034    221905958    105172    -0.0073

To convert CNV data to gene-level data, I map genome regions to genes. In some cases, two different regions with different 'Segment_Mean' values are mapped to one gene. In this case, is it correct if I use the average of 'Segment_Mean' values for that gene?

Any thoughts?

It should be mentioned that the data has been obtained using SNP Array 6.0.

Thanks.

tcga cnv • 942 views
modified 4.7 years ago by Marc Carlson7.2k • written 4.7 years ago by NS60
Answer: Mapping of two different regions to one gene symbol
0
4.7 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Your question is a little bit opaque to me.  But it is pretty normal for a couple of different ranges to overlap with a single gene.  Can you please clarify what it is you are trying to do?

Marc

I aim to make a gene-level matrix from CNV data. I mean a matrix like [genes x samples] which each entry shows Segment_mean value of the corresponding gene in sample. Does it make sense?

Hi,

It all depends on what Segment_mean represents exactly.

Let's try to reformulate your problem in a more generic way: You've a set of regions of interest, and a numeric value associated to each of them. Now you want to assign a single value to your set of regions as a whole by combining the individual values in some way. What's the correct way to combine the values? Well, it depends. Sometimes you want to take the max (e.g. if the i-th value is the height of the highest peak in the i-th region), or the weighted mean where the weights are the sizes of the regions (e.g. if the i-th value is the GC content in % in the i-th region), or the sum (e.g. if the i-th value is an absolute count like the number of hits of some sort or the number of CpG islands in the i-th region), or the product (e.g. if the i-th value is an absolute count to which a mathematical transformation like exp() was applied), etc...

So I guess we would need to know a little bit more about the nature of Segment_mean in order to be able to give some advice.

H.