Entering edit mode
mirkocelii
▴
40
@mirkocelii-23498
Last seen 4.5 years ago
Hello, I have CAGE data and I want to find DEG genes with DESeq, so I don't want any normalization as DESeq requires integer values When running
quantilePositions(ce ,
clusters = "tagClusters",
qLow = 0.1,
qUp = 0.9, useMulticore = TRUE)
it loads in few minutes the first 12 samples, then it takes forever (process died after 24 hours) The last 3 ctss files are 10x bigger than the others, (the biggest is 378M, 21.200.855 lines) I gave 150GB for ram, 30 nodes, 16 cpu for each node
What can be the problem?
this is the sequence of commands I've run
ce <- CAGEexp( genomeName = "BSgenome.Hsapiens.UCSC.hg38" , inputFiles = samples , inputFilesType = "ctss" , sampleLabels = labels )
getCTSS(ce)
CTSStagCountSE(ce)
CTSScoordinatesGR(ce)
CTSStagCountDF(ce)
CTSScoordinatesGR(ce)
gff=import.gff("gencode.v34.annotation.gff3")
annotateCTSS(ce, gff)
normalizeTagCount(ce, method = "none")
clusterCTSS(object = ce,
threshold = 1,
thresholdIsTpm = TRUE,
nrPassThreshold = 1,
method = "distclu",
maxDist = 20,
removeSingletons = TRUE,
keepSingletonsAbove = 5)
cumulativeCTSSdistribution(ce, clusters = "tagClusters")
quantilePositions(ce ,
clusters = "tagClusters",
qLow = 0.1,
qUp = 0.9, useMulticore = TRUE)
aggregateTagClusters(ce,
tpmThreshold = 5,
qLow = 0.1,
qUp = 0.9,
maxDist = 100,useMulticore = TRUE)
To check if it is a performance issue or a bug, can you subset the "big" samples and check if there is a chromosome in particular with which
quantilePositions()
does not end in 24 hours ?Thanks Charles, I'll try ! however the big samples have just been sequenced deeper than the others
Dear Charles, As I said above, 3 files out of 15 had 10 times more data than the others. I subsampled these 3 files with runif(), hence I took roughly the same proportion of ctss per chromosomes and it worked, making them as big as the other 12. The whole script run in 1 hour. Do you think it's just a matter of file size? What cab be wrong with the quantile function?
It could be just a performance issue:
quantilePositions()
is one of the slowest functions in CAGEr: see its check report for instance.