Question

Behaviour of getCTSS on a CAGEset - extra reads?

1

Entering edit mode

sarac ▴ 20

@sarac-21308

Last seen 4.8 years ago

When counting reads from a bamfile, I manually extracted the reads aligned to a particular location and found that CAGEr had clustered some reads at the location that did not exist in the bamfile.

The commands I am using are:

> My_CAGEset <- new("CAGEset", genomeName = "BSgenome.Hsapiens.UCSC.hg38", 
                     inputFiles = <input_path>, inputFilesType = "bam", 
                     sampleLabels = c(<sample_names>)) 
> ctss <-  getCTSS(My_CAGEset)

If I write ctss to out to a table/file and count the number of tags assigned to a position in this file vs manually counting the number of sequences that are aligned to that position in the bam/sam the numbers do not correspond to each other.

Is there some aspect of CAGEr I am misunderstanding that explains this behaviour?

cager cage ctss • 1.3k views

ADD COMMENT • link updated 4.0 years ago by Charles Plessy ▴ 180 • written 4.8 years ago by sarac ▴ 20

0

Entering edit mode

Hello,

the getCTSS function applies several filters on the data, and when using BAM files as input it will also attempt to correct the "G bias" of CAGE, as explained in the function's documentation. If this does not answer your question, can you give me a reproductible test case illustrating your problem ?

Have a nice day,

-- Charles

ADD REPLY • link 4.8 years ago Charles Plessy ▴ 180

0

Entering edit mode

Hi Charles, Many thanks for getting back to me.

When I output the ctss object above as a table I get, for example, the following:

chr5    150412750   -   2
chr5    150412751   +   9932
chr5    150412751   -   48
chr5    150412752   +   514
chr5    150412752   -   235
chr5    150412753   +   261
chr5    150412753   -   1920
chr5    150412754   +   607
chr5    150412754   -   66
chr5    150412755   -   47
chr5    150412756   -   1
chr5    150412757   +   71
chr5    150412757   -   16
chr5    150412758   +   3026

For a small region in chromosome 5 between 150412750-150412758

If I use genomecov as part of Bedtools, which displays different the data differently (piled up reads, not the start point of reads) I get the following:

Minus
chr5    150412750   -   16881
chr5    150412751   -   16792
chr5    150412752   -   16660
chr5    150412753   -   6574
chr5    150412754   -   4190
chr5    150412755   -   3867
chr5    150412756   -   3213
chr5    150412757   -   3150
chr5    150412758   -   119
Plus
chr5    150406961   +   1
chr5    150449786   +   1
chr5    150449787   +   1
chr5    150449788   +   1
chr5    150449789   +   1
chr5    150449790   +   1
chr5    150449791   +   1
chr5    150449792   +   1
chr5    150449793   +   1

The output is, of course, very different, but what is striking is the lack of reads in the + direction in the genomecov version compared to that from the ctss CAGEr output. I should note that the bamfiles used for the genomecov function were sorted and filtered for quality prior to use, but with a reasonably lax filtration step.

Finally, if I manually search the bam/samfile for positive sequences in this region I do not get any hits. This is the case if I expand the range to cover any 'pile up' sequences - although they should not be contributing.

If necessary, I can email you the bamfiles themselves?

ADD REPLY • link 4.8 years ago sarac ▴ 20

0

Entering edit mode

Yes, please send me a minimal BAM file to my maintainer address.

ADD REPLY • link 4.8 years ago Charles Plessy ▴ 180

score 0 · Answer 1 · 2019-07-22

0

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 7 months ago

Japan

Hi Sara, you are right, it seems that the correctSystematicG option is badly broken. I do not have time to fix this in the short term, so if your results do not critically depend on it, may I suggest to turn it to FALSE ?

ADD COMMENT • link 4.8 years ago Charles Plessy ▴ 180

0

Entering edit mode

Thank you for your reply and for advice on the work around.

ADD REPLY • link 4.8 years ago sarac ▴ 20

score 0 · Answer 2 · 2020-05-13

0

Entering edit mode

Charles Plessy ▴ 180

@charles-plessy-7857

Last seen 7 months ago

Japan

Dear Sarac,

We fixed a bug in G correction for CAGEset experiments at the end of last year:

https://github.com/charles-plessy/CAGEr/pull/26

Would you mind trying again ?

Have a nice day,

-- Charles

ADD COMMENT • link 4.0 years ago Charles Plessy ▴ 180