Question: mosdepth output into cn.mops?
0
gravatar for Philipp Bayer
6 weeks ago by
Australia/Perth/UWA
Philipp Bayer0 wrote:

Hi

I have some very large bam-files (>200TB in total) and cn.mops is taking a while, so I've been thinking of using mosdepth instead - is my approach like this valid/am I breaking any assumptions in cn.mops?

1) Run mosdepth with a 25000 bp window, identical to the ranges in the cn.mops manual (mosdepth -b 25000 –n output_prefix input_bam)

2) Load that into R –

library(GenomicRanges)
df <- read.table('sample1.bed')
colnames(df) <- c('chr', 'start', 'end', 'sample1')
#make GenomicRanges object
gr <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample1=df$sample1)

# make one big genomicranges object for all individuals
df <- read.table(‘sample2.bed’)
colnames(df) <- c('chr', 'start', 'end', 'sample2’)
gr2 <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample2=df$sample2)
gr$sample2 <- gr2$sample2 # mosdepth guarantees that all windows/IRanges are the same for all individuals

# and so on for all individuals in a loop

resCNMOPS <- cn.mops(gr)
# follow manual from here

gr looks like this:

> gr
GRanges object with 2 ranges and 2 metadata columns:
      seqnames      ranges strand |   sample1   sample2
         <rle>   <iranges>  <rle> | <integer> <integer>
  [1]     chr1     1-10000      * |         5        10
  [2]     chr1 10002-20000      * |        10      1000

Would this work?

cn.mops • 98 views
ADD COMMENTlink modified 6 weeks ago by Günter Klambauer540 • written 6 weeks ago by Philipp Bayer0
Answer: mosdepth output into cn.mops?
2
gravatar for Günter Klambauer
6 weeks ago by
Austria
Günter Klambauer540 wrote:

Dear Philipp,

Yes it is very well possible and also reasonable to use the output of mosdepth as input for cn.mops.

There is only one thing to check: does mosdepth return "depth of coverage" or "readcounts"? From the GRanges object it seems that these are integer numbers and therefore read counts. cn.mops assumes that inputs are read counts. If it is "depth of coverage" that is returned by mosdepth, then please convert them to read counts using the simple relation

DOC = number of reads * avg read length / window length

Thus,

number of reads = DOC * window length / avg read length

Good luck with the further analysis!

Regards, Günter

ADD COMMENTlink written 6 weeks ago by Günter Klambauer540

Thank you for your fast response - you're right, looking at mosdepth's regions.bed.gz it gives out the DOC:

ABC80423        0       703     91.83

So I will use your formula to get an integer for that, and then it's done - thank you!

ADD REPLYlink written 6 weeks ago by Philipp Bayer0

Hi

that worked perfectly - here's a crappy Python script that takes all mosdepth output and writes one table:

from glob import glob
import gzip
from collections import OrderedDict

individuals = []

bin_dict = OrderedDict()
files = sorted(glob('*25000_mosdepth.regions.bed.gz')) # ASSUMPTION: MOSDEPTH FILES END IN 25000_mosdepth.regions.bed.gz
first = files[0]
rest_files = files[1:]

firstname = first.split('_')[0] #  ASSUMPTION: FILENAME IS LIKE SRR1234_1.fastq_sorted_by_25000_mosdepth.regions.bed.gz
individuals.append(firstname)
with gzip.open(first, 'rb') as fh:
    for line in fh:
        ll = line.split()
        # ['NC_000001.11', '0', '25000', '27.94']
        doc = float(ll[-1])
        bin_name = '%sXXX%sXXX%s'%(ll[0], ll[1], ll[2])
        start, end = ll[1], ll[2]
        window_length = abs(int(end)-int(start))
        # number of reads = DOC * window length / avg read length
        number_of_reads = str(int(doc * window_length / 100)) # ASSUMPTION: READS ARE 100BP LONG
        bin_dict[bin_name] = [number_of_reads]

for f in rest_files:
    thisname = f.split('_')[0]
    individuals.append(thisname)
    with gzip.open(f, 'rb') as fh:
        for line in fh:
            ll = line.split()
            doc = float(ll[-1])
            bin_name = '%sXXX%sXXX%s'%(ll[0], ll[1], ll[2])
            start, end = ll[1], ll[2]
            window_length = abs(int(end)-int(start))
            # number of reads = DOC * window length / avg read length
            number_of_reads = str(int(doc * window_length / 100)) # ASSUMPTION: READS ARE 100BP LONG
            bin_dict[bin_name].append(number_of_reads)

header = ['chr','start','end'] + individuals
print('\t'.join(header))
for b in bin_dict:
    thischr, start, end = b.split('XXX')
    ll = [thischr, start, end] + bin_dict[b]
    print('\t'.join(ll))

this prints to STDOUT so redirect using >

Load into R using

library(regioneR)
library(cn.mops)
df <- read.table('./CNMOPS_TABLE.xt', head=T)
X <- toGRanges(df)
resCNMOPS <- cn.mops(X)
resCNMOPS <- calcIntegerCopyNumbers(resCNMOPS)

Edit: Replaced the matrix input by bed input so that the resulting GRanges input has nice chromosome and basepair positions, instead of undef, and fixed a bug with the range calculation when a contig is smaller than 25000

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Philipp Bayer0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 83 users visited in the last hour