Question

mosdepth output into cn.mops?

0

Entering edit mode

Philipp Bayer • 0

@philipp-bayer-14945

Last seen 6.7 years ago

Australia/Perth/UWA

Hi

I have some very large bam-files (>200TB in total) and cn.mops is taking a while, so I've been thinking of using mosdepth instead - is my approach like this valid/am I breaking any assumptions in cn.mops?

1) Run mosdepth with a 25000 bp window, identical to the ranges in the cn.mops manual (mosdepth -b 25000 –n output_prefix input_bam)

2) Load that into R –

library(GenomicRanges)
df <- read.table('sample1.bed')
colnames(df) <- c('chr', 'start', 'end', 'sample1')
#make GenomicRanges object
gr <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample1=df$sample1)

# make one big genomicranges object for all individuals
df <- read.table(‘sample2.bed’)
colnames(df) <- c('chr', 'start', 'end', 'sample2’)
gr2 <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample2=df$sample2)
gr$sample2 <- gr2$sample2 # mosdepth guarantees that all windows/IRanges are the same for all individuals

# and so on for all individuals in a loop

resCNMOPS <- cn.mops(gr)
# follow manual from here

gr looks like this:

> gr
GRanges object with 2 ranges and 2 metadata columns:
      seqnames      ranges strand |   sample1   sample2
         <rle>   <iranges>  <rle> | <integer> <integer>
  [1]     chr1     1-10000      * |         5        10
  [2]     chr1 10002-20000      * |        10      1000

Would this work?

cn.mops • 2.0k views

ADD COMMENT • link updated 6.7 years ago by Günter Klambauer ▴ 540 • written 6.7 years ago by Philipp Bayer • 0

score 2 · Accepted Answer · 2019-04-02

2

Entering edit mode

Günter Klambauer ▴ 540

@gunter-klambauer-5426

Last seen 4.9 years ago

Austria

Dear Philipp,

Yes it is very well possible and also reasonable to use the output of mosdepth as input for cn.mops.

There is only one thing to check: does mosdepth return "depth of coverage" or "readcounts"? From the GRanges object it seems that these are integer numbers and therefore read counts. cn.mops assumes that inputs are read counts. If it is "depth of coverage" that is returned by mosdepth, then please convert them to read counts using the simple relation

DOC = number of reads * avg read length / window length

Thus,

number of reads = DOC * window length / avg read length

Good luck with the further analysis!

Regards, Günter

ADD COMMENT • link 6.7 years ago Günter Klambauer ▴ 540

0

Entering edit mode

Thank you for your fast response - you're right, looking at mosdepth's regions.bed.gz it gives out the DOC:

ABC80423        0       703     91.83

So I will use your formula to get an integer for that, and then it's done - thank you!

ADD REPLY • link 6.7 years ago Philipp Bayer • 0

0

Entering edit mode

Hi

that worked perfectly - here's a crappy Python script that takes all mosdepth output and writes one table:

from glob import glob
import gzip
from collections import OrderedDict

individuals = []

bin_dict = OrderedDict()
files = sorted(glob('*25000_mosdepth.regions.bed.gz')) # ASSUMPTION: MOSDEPTH FILES END IN 25000_mosdepth.regions.bed.gz
first = files[0]
rest_files = files[1:]

firstname = first.split('_')[0] #  ASSUMPTION: FILENAME IS LIKE SRR1234_1.fastq_sorted_by_25000_mosdepth.regions.bed.gz
individuals.append(firstname)
with gzip.open(first, 'rb') as fh:
    for line in fh:
        ll = line.split()
        # ['NC_000001.11', '0', '25000', '27.94']
        doc = float(ll[-1])
        bin_name = '%sXXX%sXXX%s'%(ll[0], ll[1], ll[2])
        start, end = ll[1], ll[2]
        window_length = abs(int(end)-int(start))
        # number of reads = DOC * window length / avg read length
        number_of_reads = str(int(doc * window_length / 100)) # ASSUMPTION: READS ARE 100BP LONG
        bin_dict[bin_name] = [number_of_reads]

for f in rest_files:
    thisname = f.split('_')[0]
    individuals.append(thisname)
    with gzip.open(f, 'rb') as fh:
        for line in fh:
            ll = line.split()
            doc = float(ll[-1])
            bin_name = '%sXXX%sXXX%s'%(ll[0], ll[1], ll[2])
            start, end = ll[1], ll[2]
            window_length = abs(int(end)-int(start))
            # number of reads = DOC * window length / avg read length
            number_of_reads = str(int(doc * window_length / 100)) # ASSUMPTION: READS ARE 100BP LONG
            bin_dict[bin_name].append(number_of_reads)

header = ['chr','start','end'] + individuals
print('\t'.join(header))
for b in bin_dict:
    thischr, start, end = b.split('XXX')
    ll = [thischr, start, end] + bin_dict[b]
    print('\t'.join(ll))

this prints to STDOUT so redirect using >

Load into R using

library(regioneR)
library(cn.mops)
df <- read.table('./CNMOPS_TABLE.xt', head=T)
X <- toGRanges(df)
resCNMOPS <- cn.mops(X)
resCNMOPS <- calcIntegerCopyNumbers(resCNMOPS)

Edit: Replaced the matrix input by bed input so that the resulting GRanges input has nice chromosome and basepair positions, instead of undef, and fixed a bug with the range calculation when a contig is smaller than 25000

ADD REPLY • link 6.7 years ago Philipp Bayer • 0