Hello everyone !
This is my first post here, and i thank you all for the time that you will give to solve my problem.
I'm a trainee and i'm working on a subject based on personnalized-medicine. I have to use gene expression data from the GEO database. To preprocess these raw data, i'm using the package SCAN.UPC, maintained by Stephen Picollo, which is a great tool and really effective for the next part of my work.
But i have a problem concerning the number of samples at the end of the preprocessing part. For example for the GSE26639 (REMAGUS02-trial), i get 64 patients from the 226 in total. I have followed the workflow proposed in the pdf available in Bioconductor, and mapped the probe to genes using the BrainArray package available for the specific platform. I'm using the latest version of BrainArray CDF, the 23.0.0.
I'm also tuning the convThreshold parameter, which allows me to gain some samples or to lost them. For example :
- convThreshold = 0.50 => 64 samples
- convThreshold = 0.90 => 45 samples
- convThreshold = 0.01 => 24 samples
It would be good if i can get between 60 and 75 % of the samples. Is there a way to achieve this goal ?
This a part of my code where i get the raw CEL files from GEO, get the corresponding ENTREZ BrainArray probe mapping, and process it all with SCAN.
# Get Raw Cel files and pkg name CDF BrainArray
raw_data <- get_CEL(gseName,PATH_RAW)
colnames(raw_data)[1]
celPath <- paste(PATH_RAW,gseName,"/data", sep = "")
pkgName <- InstallBrainArrayPackage(paste(celPath,colnames(raw_data)[1], sep = "/"),"23.0.0","hs", "entrezg")
# celFilesPath <- file.path(celPath, "*.CEL*")
# SCAN UPC part
gc()
scan_norm <- SCAN(gseName,probeSummaryPackage = pkgName,convThreshold = 0.40)
dim(t(exprs(scan_norm)))
Thank you and i hope it was clear.
Thanks for your question. I may be misunderstanding, but GSE26636 only has 13 samples (see https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26636). Is this the one you meant?
Thanks for the reply, sorry this is my mistake it is GSE26639 for REMAGUS02. There are 226 patients in this dataset. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26639
I have edited it in my main post.