Hi
I have some very large bam-files (>200TB in total) and cn.mops is taking a while, so I've been thinking of using mosdepth instead - is my approach like this valid/am I breaking any assumptions in cn.mops?
1) Run mosdepth with a 25000 bp window, identical to the ranges in the cn.mops manual (mosdepth -b 25000 –n output_prefix input_bam)
2) Load that into R –
library(GenomicRanges)
df <- read.table('sample1.bed')
colnames(df) <- c('chr', 'start', 'end', 'sample1')
#make GenomicRanges object
gr <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample1=df$sample1)
# make one big genomicranges object for all individuals
df <- read.table(‘sample2.bed’)
colnames(df) <- c('chr', 'start', 'end', 'sample2’)
gr2 <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample2=df$sample2)
gr$sample2 <- gr2$sample2 # mosdepth guarantees that all windows/IRanges are the same for all individuals
# and so on for all individuals in a loop
resCNMOPS <- cn.mops(gr)
# follow manual from here
gr looks like this:
> gr
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | sample1 sample2
<rle> <iranges> <rle> | <integer> <integer>
[1] chr1 1-10000 * | 5 10
[2] chr1 10002-20000 * | 10 1000
Would this work?

Thank you for your fast response - you're right, looking at mosdepth's regions.bed.gz it gives out the DOC:
So I will use your formula to get an integer for that, and then it's done - thank you!
Hi
that worked perfectly - here's a crappy Python script that takes all mosdepth output and writes one table:
from glob import glob import gzip from collections import OrderedDict individuals = [] bin_dict = OrderedDict() files = sorted(glob('*25000_mosdepth.regions.bed.gz')) # ASSUMPTION: MOSDEPTH FILES END IN 25000_mosdepth.regions.bed.gz first = files[0] rest_files = files[1:] firstname = first.split('_')[0] # ASSUMPTION: FILENAME IS LIKE SRR1234_1.fastq_sorted_by_25000_mosdepth.regions.bed.gz individuals.append(firstname) with gzip.open(first, 'rb') as fh: for line in fh: ll = line.split() # ['NC_000001.11', '0', '25000', '27.94'] doc = float(ll[-1]) bin_name = '%sXXX%sXXX%s'%(ll[0], ll[1], ll[2]) start, end = ll[1], ll[2] window_length = abs(int(end)-int(start)) # number of reads = DOC * window length / avg read length number_of_reads = str(int(doc * window_length / 100)) # ASSUMPTION: READS ARE 100BP LONG bin_dict[bin_name] = [number_of_reads] for f in rest_files: thisname = f.split('_')[0] individuals.append(thisname) with gzip.open(f, 'rb') as fh: for line in fh: ll = line.split() doc = float(ll[-1]) bin_name = '%sXXX%sXXX%s'%(ll[0], ll[1], ll[2]) start, end = ll[1], ll[2] window_length = abs(int(end)-int(start)) # number of reads = DOC * window length / avg read length number_of_reads = str(int(doc * window_length / 100)) # ASSUMPTION: READS ARE 100BP LONG bin_dict[bin_name].append(number_of_reads) header = ['chr','start','end'] + individuals print('\t'.join(header)) for b in bin_dict: thischr, start, end = b.split('XXX') ll = [thischr, start, end] + bin_dict[b] print('\t'.join(ll))this prints to STDOUT so redirect using >
Load into R using
library(regioneR) library(cn.mops) df <- read.table('./CNMOPS_TABLE.xt', head=T) X <- toGRanges(df) resCNMOPS <- cn.mops(X) resCNMOPS <- calcIntegerCopyNumbers(resCNMOPS)Edit: Replaced the matrix input by bed input so that the resulting GRanges input has nice chromosome and basepair positions, instead of
undef, and fixed a bug with the range calculation when a contig is smaller than 25000