Question

CAGEr quantilePositions() takes forever to run

1

Entering edit mode

mirkocelii ▴ 40

@mirkocelii-23498

Last seen 5.4 years ago

Hello, I have CAGE data and I want to find DEG genes with DESeq, so I don't want any normalization as DESeq requires integer values When running

quantilePositions(ce ,
                  clusters = "tagClusters",
                  qLow = 0.1,
                  qUp = 0.9, useMulticore = TRUE)

it loads in few minutes the first 12 samples, then it takes forever (process died after 24 hours) The last 3 ctss files are 10x bigger than the others, (the biggest is 378M, 21.200.855 lines) I gave 150GB for ram, 30 nodes, 16 cpu for each node

What can be the problem?

this is the sequence of commands I've run

ce <- CAGEexp( genomeName = "BSgenome.Hsapiens.UCSC.hg38" , inputFiles = samples    , inputFilesType = "ctss" , sampleLabels = labels )
getCTSS(ce)
CTSStagCountSE(ce)
CTSScoordinatesGR(ce)
CTSStagCountDF(ce)
CTSScoordinatesGR(ce)

gff=import.gff("gencode.v34.annotation.gff3")
annotateCTSS(ce, gff)

normalizeTagCount(ce, method = "none")
clusterCTSS(object = ce,
            threshold = 1,
            thresholdIsTpm = TRUE,
            nrPassThreshold = 1,
            method = "distclu",
            maxDist = 20,
            removeSingletons = TRUE,
            keepSingletonsAbove = 5)

cumulativeCTSSdistribution(ce, clusters = "tagClusters")
quantilePositions(ce ,
                  clusters = "tagClusters",
                  qLow = 0.1,
                  qUp = 0.9, useMulticore = TRUE)

aggregateTagClusters(ce,
                     tpmThreshold = 5,
                     qLow = 0.1,
                     qUp = 0.9,
                     maxDist = 100,useMulticore = TRUE)

CAGEr quantilePositions • 1.3k views

ADD COMMENT • link 5.5 years ago mirkocelii ▴ 40

0

Entering edit mode

To check if it is a performance issue or a bug, can you subset the "big" samples and check if there is a chromosome in particular with which quantilePositions()does not end in 24 hours ?

ADD REPLY • link 5.5 years ago Charles Plessy ▴ 180

0

Entering edit mode

Thanks Charles, I'll try ! however the big samples have just been sequenced deeper than the others

ADD REPLY • link 5.5 years ago mirkocelii ▴ 40

0

Entering edit mode

Dear Charles, As I said above, 3 files out of 15 had 10 times more data than the others. I subsampled these 3 files with runif(), hence I took roughly the same proportion of ctss per chromosomes and it worked, making them as big as the other 12. The whole script run in 1 hour. Do you think it's just a matter of file size? What cab be wrong with the quantile function?

ADD REPLY • link 5.5 years ago mirkocelii ▴ 40

0

Entering edit mode

It could be just a performance issue: quantilePositions() is one of the slowest functions in CAGEr: see its check report for instance.

ADD REPLY • link 5.5 years ago Charles Plessy ▴ 180