CAGEr quantilePositions() takes forever to run
0
1
Entering edit mode
mirkocelii ▴ 40
@mirkocelii-23498
Last seen 3.8 years ago

Hello, I have CAGE data and I want to find DEG genes with DESeq, so I don't want any normalization as DESeq requires integer values When running

quantilePositions(ce ,
                  clusters = "tagClusters",
                  qLow = 0.1,
                  qUp = 0.9, useMulticore = TRUE)

it loads in few minutes the first 12 samples, then it takes forever (process died after 24 hours) The last 3 ctss files are 10x bigger than the others, (the biggest is 378M, 21.200.855 lines) I gave 150GB for ram, 30 nodes, 16 cpu for each node

What can be the problem?

this is the sequence of commands I've run

ce <- CAGEexp( genomeName = "BSgenome.Hsapiens.UCSC.hg38" , inputFiles = samples    , inputFilesType = "ctss" , sampleLabels = labels )
getCTSS(ce)
CTSStagCountSE(ce)
CTSScoordinatesGR(ce)
CTSStagCountDF(ce)
CTSScoordinatesGR(ce)

gff=import.gff("gencode.v34.annotation.gff3")
annotateCTSS(ce, gff)

normalizeTagCount(ce, method = "none")
clusterCTSS(object = ce,
            threshold = 1,
            thresholdIsTpm = TRUE,
            nrPassThreshold = 1,
            method = "distclu",
            maxDist = 20,
            removeSingletons = TRUE,
            keepSingletonsAbove = 5)

cumulativeCTSSdistribution(ce, clusters = "tagClusters")
quantilePositions(ce ,
                  clusters = "tagClusters",
                  qLow = 0.1,
                  qUp = 0.9, useMulticore = TRUE)

aggregateTagClusters(ce,
                     tpmThreshold = 5,
                     qLow = 0.1,
                     qUp = 0.9,
                     maxDist = 100,useMulticore = TRUE)
CAGEr quantilePositions • 816 views
ADD COMMENT
0
Entering edit mode

To check if it is a performance issue or a bug, can you subset the "big" samples and check if there is a chromosome in particular with which quantilePositions()does not end in 24 hours ?

ADD REPLY
0
Entering edit mode

Thanks Charles, I'll try ! however the big samples have just been sequenced deeper than the others

ADD REPLY
0
Entering edit mode

Dear Charles, As I said above, 3 files out of 15 had 10 times more data than the others. I subsampled these 3 files with runif(), hence I took roughly the same proportion of ctss per chromosomes and it worked, making them as big as the other 12. The whole script run in 1 hour. Do you think it's just a matter of file size? What cab be wrong with the quantile function?

ADD REPLY
0
Entering edit mode

It could be just a performance issue: quantilePositions() is one of the slowest functions in CAGEr: see its check report for instance.

ADD REPLY

Login before adding your answer.

Traffic: 1070 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6