Mapping of two different regions to one gene symbol
1
0
Entering edit mode
NS ▴ 60
@ns-7498
Last seen 5.7 years ago
United States

I am analyzing CNV data downloaded from TCGA database (level 3) and aim to convert it to a gene-level matrix.

The files are like the below:

Sample    Chromosome    Start    End    Num_Probes    Segment_Mean
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    3218610    16796721    7253    -0.0198
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    16796742    17763566    312    -0.3615
BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774    1    17764034    221905958    105172    -0.0073

To convert CNV data to gene-level data, I map genome regions to genes. In some cases, two different regions with different 'Segment_Mean' values are mapped to one gene. In this case, is it correct if I use the average of 'Segment_Mean' values for that gene?

Any thoughts?

It should be mentioned that the data has been obtained using SNP Array 6.0.

Thanks.

TCGA CNV • 1.7k views
ADD COMMENT
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.3 years ago
United States

Your question is a little bit opaque to me.  But it is pretty normal for a couple of different ranges to overlap with a single gene.  Can you please clarify what it is you are trying to do?

 

 Marc

ADD COMMENT
0
Entering edit mode

I aim to make a gene-level matrix from CNV data. I mean a matrix like [genes x samples] which each entry shows Segment_mean value of the corresponding gene in sample. Does it make sense?

ADD REPLY
0
Entering edit mode

Hi,

It all depends on what Segment_mean represents exactly.

Let's try to reformulate your problem in a more generic way: You've a set of regions of interest, and a numeric value associated to each of them. Now you want to assign a single value to your set of regions as a whole by combining the individual values in some way. What's the correct way to combine the values? Well, it depends. Sometimes you want to take the max (e.g. if the i-th value is the height of the highest peak in the i-th region), or the weighted mean where the weights are the sizes of the regions (e.g. if the i-th value is the GC content in % in the i-th region), or the sum (e.g. if the i-th value is an absolute count like the number of hits of some sort or the number of CpG islands in the i-th region), or the product (e.g. if the i-th value is an absolute count to which a mathematical transformation like exp() was applied), etc... 

So I guess we would need to know a little bit more about the nature of Segment_mean in order to be able to give some advice.

H.

ADD REPLY

Login before adding your answer.

Traffic: 759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6