Question

SCAN.UPC : lost too many samples from GSE (BrainArray + ConvThreshold)

1

Entering edit mode

s.goncalvesclaro ▴ 10

@sgoncalvesclaro-20405

Last seen 5.0 years ago

Hello everyone !

This is my first post here, and i thank you all for the time that you will give to solve my problem.

I'm a trainee and i'm working on a subject based on personnalized-medicine. I have to use gene expression data from the GEO database. To preprocess these raw data, i'm using the package SCAN.UPC, maintained by Stephen Picollo, which is a great tool and really effective for the next part of my work.

But i have a problem concerning the number of samples at the end of the preprocessing part. For example for the GSE26639 (REMAGUS02-trial), i get 64 patients from the 226 in total. I have followed the workflow proposed in the pdf available in Bioconductor, and mapped the probe to genes using the BrainArray package available for the specific platform. I'm using the latest version of BrainArray CDF, the 23.0.0.

I'm also tuning the convThreshold parameter, which allows me to gain some samples or to lost them. For example :

convThreshold = 0.50 => 64 samples
convThreshold = 0.90 => 45 samples
convThreshold = 0.01 => 24 samples

It would be good if i can get between 60 and 75 % of the samples. Is there a way to achieve this goal ?

This a part of my code where i get the raw CEL files from GEO, get the corresponding ENTREZ BrainArray probe mapping, and process it all with SCAN.

# Get Raw Cel files and pkg name CDF BrainArray
raw_data <- get_CEL(gseName,PATH_RAW)
colnames(raw_data)[1]
celPath <- paste(PATH_RAW,gseName,"/data", sep = "")
pkgName <- InstallBrainArrayPackage(paste(celPath,colnames(raw_data)[1], sep = "/"),"23.0.0","hs", "entrezg")
# celFilesPath <- file.path(celPath, "*.CEL*")

# SCAN UPC part
gc()
scan_norm <- SCAN(gseName,probeSummaryPackage = pkgName,convThreshold = 0.40)
dim(t(exprs(scan_norm)))

Thank you and i hope it was clear.

SCAN.UPC probe normalization GEOdata cancer • 1.2k views

ADD COMMENT • link 5.0 years ago s.goncalvesclaro ▴ 10

0

Entering edit mode

Thanks for your question. I may be misunderstanding, but GSE26636 only has 13 samples (see https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26636). Is this the one you meant?

ADD REPLY • link 5.0 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Thanks for the reply, sorry this is my mistake it is GSE26639 for REMAGUS02. There are 226 patients in this dataset. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26639

I have edited it in my main post.

ADD REPLY • link 5.0 years ago s.goncalvesclaro ▴ 10

score 0 · Answer 1 · 2019-04-09

I solved my problem after further investigation. It was the registerDoParallel without any arguments like cores. After several redo of my workflow, i found out that the problem was not due to SCAN but to R with its memory management.

As R had some troubles before (abort computation, freeze), i let the registerDoParallel() like this. I hoped that R would automatically manage the parallel jobs considering my CPUs. But it is important to precise at least the cores argument like this : registerDoParallel(cores = 2), as described in the SCAN.UPC vignette.

Now the convThreshold argument has an effect on the number of sample that i get at the end of my workflow. I hope my mistake will help to prevent this error in other project using SCAN.UPC.

Thank you Stephen for your support. I'm going to use SCAN.UPC for my workflow, it is really trustworthy.

If you want to add anything to my answer, feel free to had your expertise. Thanks a lot !