Question: Behaviour of getCTSS on a CAGEset - extra reads?
1
3 months ago by
sarac20
sarac20 wrote:

When counting reads from a bamfile, I manually extracted the reads aligned to a particular location and found that CAGEr had clustered some reads at the location that did not exist in the bamfile.

The commands I am using are:

> My_CAGEset <- new("CAGEset", genomeName = "BSgenome.Hsapiens.UCSC.hg38",
inputFiles = <input_path>, inputFilesType = "bam",
sampleLabels = c(<sample_names>))
> ctss <-  getCTSS(My_CAGEset)


If I write ctss to out to a table/file and count the number of tags assigned to a position in this file vs manually counting the number of sequences that are aligned to that position in the bam/sam the numbers do not correspond to each other.

Is there some aspect of CAGEr I am misunderstanding that explains this behaviour?

cager cage ctss • 149 views
modified 3 months ago by Charles Plessy110 • written 3 months ago by sarac20

Hello,

the getCTSS function applies several filters on the data, and when using BAM files as input it will also attempt to correct the "G bias" of CAGE, as explained in the function's documentation. If this does not answer your question, can you give me a reproductible test case illustrating your problem ?

Have a nice day,

-- Charles

Hi Charles, Many thanks for getting back to me.

When I output the ctss object above as a table I get, for example, the following:

chr5    150412750   -   2
chr5    150412751   +   9932
chr5    150412751   -   48
chr5    150412752   +   514
chr5    150412752   -   235
chr5    150412753   +   261
chr5    150412753   -   1920
chr5    150412754   +   607
chr5    150412754   -   66
chr5    150412755   -   47
chr5    150412756   -   1
chr5    150412757   +   71
chr5    150412757   -   16
chr5    150412758   +   3026


For a small region in chromosome 5 between 150412750-150412758

If I use genomecov as part of Bedtools, which displays different the data differently (piled up reads, not the start point of reads) I get the following:

Minus
chr5    150412750   -   16881
chr5    150412751   -   16792
chr5    150412752   -   16660
chr5    150412753   -   6574
chr5    150412754   -   4190
chr5    150412755   -   3867
chr5    150412756   -   3213
chr5    150412757   -   3150
chr5    150412758   -   119
Plus
chr5    150406961   +   1
chr5    150449786   +   1
chr5    150449787   +   1
chr5    150449788   +   1
chr5    150449789   +   1
chr5    150449790   +   1
chr5    150449791   +   1
chr5    150449792   +   1
chr5    150449793   +   1


The output is, of course, very different, but what is striking is the lack of reads in the + direction in the genomecov version compared to that from the ctss CAGEr output. I should note that the bamfiles used for the genomecov function were sorted and filtered for quality prior to use, but with a reasonably lax filtration step.

Finally, if I manually search the bam/samfile for positive sequences in this region I do not get any hits. This is the case if I expand the range to cover any 'pile up' sequences - although they should not be contributing.

If necessary, I can email you the bamfiles themselves?

Yes, please send me a minimal BAM file to my maintainer address.

0
3 months ago by
Japan
Charles Plessy110 wrote:

Hi Sara, you are right, it seems that the correctSystematicG option is badly broken. I do not have time to fix this in the short term, so if your results do not critically depend on it, may I suggest to turn it to FALSE ?