Hi,
Thanks for reading. Any help/ advice on this issue would be much appreciated. Also, if my approach does not make any sense. I have got all CNV calls/segments in .txt (one for every sample) with following structure:
For SAMPLE 1 it would be:
Chromosome Start End Value
chr1 754192 151015495 -0.02005069889128208
chr1 151016790 151150857 -0.2238580733537674
chr1 151174772 243812552 0.02483091503381729
chr1 243818465 243918083 0.16757509112358093
chr1 243919773 249212878 0.06885097920894623
chr2 21494 243052331 -0.0025195078924298286
chr3 63411 69846904 -0.050300538539886475
chr3 69847460 70004208 -0.126520037651062
....
For SAMPLE 2 it would be:
Chromosome Start End Value
chr1 754192 186557453 0.0036580897867679596
chr1 186577925 186639485 -0.08182021975517273
chr1 186642429 189369841 -0.006529499311000109
chr1 189378725 189721806 -0.09558720141649246
chr1 189731338 197300995 0.02319585345685482
....
And so on. My question is, is it possible to merge all this information in a dataframe using R, where every row is a sample and every column is a segment? I can not figure out how to do it as most of samples will have different segments, some of them overlapping between samples, and the total number of segments varies between samples.
The aim of constructing this data frame is to perform PCA and clustering. I also have numerical and categorical variables for every sample, which is the best way of putting it together with the CNV data? Any help will be much appreciated. Thanks
Kind regards
IOM
Hi,
Don't think the vignette is bad. I added 3 lines of code to the vignette example to do a PCA (untested code):
require("CNTools")
data(sampleData)
head(sampleData)
###################################################
### code chunk number 2: HowTo.Rnw:65-70
###################################################
cnseg <- CNSeg(sampleData[which(is.element(sampleData[, "ID"], sample(unique(sampleData[, "ID"]), 20))), ])
rdseg <- getRS(cnseg, by = "region", imput = FALSE, XY = FALSE, what = "mean")
data("geneInfo")
geneInfo <- geneInfo[sample(1:nrow(geneInfo), 2000), ]
rdByGene <- getRS(cnseg, by = "gene", imput = FALSE, XY = FALSE, geneMap = geneInfo, what = "median")
# remove gene information from the copy number data.frame
m <- rs(rdByGene)[,-(1:6)]
pca <- prcomp(t(m))
plot(pca$x[,1], pca$x[,2])
If you cluster raw log-ratios, tumor purity and will probably confound your clustering. If you do not expect a lot of variance in purity, you can also categorize log-ratios GISTIC-like (deep loss, loss, normal, gain, amplification), which might or might not improve clustering.
Good luck with your data,
Markus
Hi Markus,
Thanks very much for your reply. I am going to have a deeper look at it and will come back to you. Thanks
IOM