Mapping of two different regions to one gene symbol
Entering edit mode
NS ▴ 60
Last seen 3.6 years ago
United States

I am analyzing CNV data downloaded from TCGA database (level 3) and aim to convert it to a gene-level matrix.

The files are like the below:

Sample    Chromosome    Start    End    Num_Probes    Segment_Mean
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    3218610    16796721    7253    -0.0198
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    16796742    17763566    312    -0.3615
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    17764034    221905958    105172    -0.0073

To convert CNV data to gene-level data, I map genome regions to genes. In some cases, two different regions with different 'Segment_Mean' values are mapped to one gene. In this case, is it correct if I use the average of 'Segment_Mean' values for that gene?

Any thoughts?

It should be mentioned that the data has been obtained using SNP Array 6.0.


TCGA CNV • 1.2k views
Entering edit mode
Marc Carlson ★ 7.2k
Last seen 6.2 years ago
United States

Your question is a little bit opaque to me.  But it is pretty normal for a couple of different ranges to overlap with a single gene.  Can you please clarify what it is you are trying to do?



Entering edit mode

I aim to make a gene-level matrix from CNV data. I mean a matrix like [genes x samples] which each entry shows Segment_mean value of the corresponding gene in sample. Does it make sense?

Entering edit mode


It all depends on what Segment_mean represents exactly.

Let's try to reformulate your problem in a more generic way: You've a set of regions of interest, and a numeric value associated to each of them. Now you want to assign a single value to your set of regions as a whole by combining the individual values in some way. What's the correct way to combine the values? Well, it depends. Sometimes you want to take the max (e.g. if the i-th value is the height of the highest peak in the i-th region), or the weighted mean where the weights are the sizes of the regions (e.g. if the i-th value is the GC content in % in the i-th region), or the sum (e.g. if the i-th value is an absolute count like the number of hits of some sort or the number of CpG islands in the i-th region), or the product (e.g. if the i-th value is an absolute count to which a mathematical transformation like exp() was applied), etc... 

So I guess we would need to know a little bit more about the nature of Segment_mean in order to be able to give some advice.



Login before adding your answer.

Traffic: 365 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6