I am analyzing CNV data downloaded from TCGA database (level 3) and aim to convert it to a gene-level matrix.
The files are like the below:
Sample Chromosome Start End Num_Probes Segment_Mean BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774 1 3218610 16796721 7253 -0.0198 BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774 1 16796742 17763566 312 -0.3615 BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_A02_808774 1 17764034 221905958 105172 -0.0073
To convert CNV data to gene-level data, I map genome regions to genes. In some cases, two different regions with different 'Segment_Mean' values are mapped to one gene. In this case, is it correct if I use the average of 'Segment_Mean' values for that gene?
Any thoughts?
It should be mentioned that the data has been obtained using SNP Array 6.0.
Thanks.
I aim to make a gene-level matrix from CNV data. I mean a matrix like [genes x samples] which each entry shows Segment_mean value of the corresponding gene in sample. Does it make sense?
Hi,
It all depends on what
Segment_mean
represents exactly.Let's try to reformulate your problem in a more generic way: You've a set of regions of interest, and a numeric value associated to each of them. Now you want to assign a single value to your set of regions as a whole by combining the individual values in some way. What's the correct way to combine the values? Well, it depends. Sometimes you want to take the max (e.g. if the i-th value is the height of the highest peak in the i-th region), or the weighted mean where the weights are the sizes of the regions (e.g. if the i-th value is the GC content in % in the i-th region), or the sum (e.g. if the i-th value is an absolute count like the number of hits of some sort or the number of CpG islands in the i-th region), or the product (e.g. if the i-th value is an absolute count to which a mathematical transformation like exp() was applied), etc...
So I guess we would need to know a little bit more about the nature of
Segment_mean
in order to be able to give some advice.H.