Question

could not duplicate the fragment length in "Reads to Regions" csaw tutorial

0

Entering edit mode

richardallenfriedmanbrooklyn ▴ 20

@richardallenfriedmanbrooklyn-24118

Last seen 8 months ago

United States

Dear List,

I am trying to duplicate the analysis using csaw in the paper, "From reads to regions", https://pubmed.ncbi.nlm.nih.gov/26834993/ I received different values than the paper for the number of mapped and marked reads, the fragment length, and cross-correlation plot. I am sticking as closely as I can to the protocol. I am asking the discrepancy is due to a change in the software, an error on my part, and most importantly if it compromises the analysis in the rest of the protocol. I am pasting in the script and then giving the problematic output preceded by a '."

sra.numbers <- c("SRR499718", "SRR499719", "SRR499720", "SRR499721",
    "SRR499734", "SRR499735", "SRR499736", "SRR499737", "SRR499738")
grouping <- c("proB-8113", "proB-8113", "proB-8108", "proB-8108",
    "matureB-8059", "matureB-8059", "matureB-8059", "matureB-8059", "matureB-8086")

all.sra <- sra.numbers
data.frame(SRA=all.sra, Condition=grouping)

for (sra in all.sra) {
    code <- system(paste("/Applications/sratoolkit.2.10.9-mac64/bin/fastq-dump", sra))
     stopifnot(code==0L)
}
all.fastq <- paste0(sra.numbers, ".fastq")

by.group <- split(all.fastq, grouping)
for (group in names(by.group)) {
    code <- system(paste(c("cat", by.group[[group]], ">",
          paste0(group, ".fastq")), collapse=" "))
     stopifnot(code==0L)
}
group.fastq <- paste0(names(by.group), ".fastq")

library(Rsubread)
bam.files <- paste0(names(by.group), ".bam")
align(index="mm10", readfile1=group.fastq, TH1=2, type=1,
    input_format="FASTQ", output_file=bam.files)


library(Rsamtools)

for (bam in bam.files) {
       out <- suppressWarnings(sortBam(bam, "h3k9ac_temp"))
    file.rename(out, bam)
}


# Mark duplicates

temp.bam <- "h3k9ac_temp.bam"
temp.file <- "h3k9ac_metric.txt"
temp.dir <- "h3k9ac_working"
dir.create(temp.dir)
for (bam in bam.files) {
    code <- system(sprintf(
       "java -jar /Applications/picard.jar MarkDuplicates I=%s O=%s M=%s \\
      TMP_DIR=%s AS=true REMOVE_DUPLICATES=false \\
      VALIDATION_STRINGENCY=SILENT", bam, temp.bam,
      temp.file, temp.dir))
    stopifnot(code==0L)
     file.rename(temp.bam, bam)
}

# count reads and generate diagnostics

diagnostics <- list()
for (bam in bam.files) {
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
     flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
     flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[bam]] <- c(Total=total, Mapped=mapped, Marked=marked)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
> diag.stats
                    Total   Mapped  Marked Prop.mapped Prop.marked
matureB-8059.bam 16675372 12814838 2315308    76.84889   18.067400
matureB-8086.bam  6347683  5749367  392871    90.57426    6.833291
proB-8108.bam    10413135  9755975  611246    93.68912    6.265350
proB-8113.bam    10724526 10088710  730440    94.07138    7.240172

Results for Mapped and Marked differ from the published paper.

indexBam(bam.files)


library(rtracklayer)
ch <- import.chain("mm9ToMm10.over.chain")
original <- import("mm9-blacklist.bed")
blacklist <- liftOver(x=original, chain=ch)
blacklist <- unlist(blacklist)
saveRDS(file="mm10-blacklist.rds", blacklist)


celltype <- sub("-.*",  "", bam.files)
data.frame(BAM=bam.files,  CellType=celltype)
targets <- data.frame(BAM=bam.files,  CellType=celltype)

library(csaw)
standard.chr <- paste0("chr", c(1:19, "X", "Y"))
param <- readParam(minq=50, discard=blacklist,restrict=standard.chr)

x <- correlateReads(bam.files, param=reform(param, dedup=TRUE))
frag.len <- which.max(x) - 1
> frag.len
[1] 0

since I got a fragment length of 0, I used a slightly different "correlateReads" command.

 x <- correlateReads(bam.files,param=readParam(dedup=TRUE))
> frag.len <- which.max(x) - 1
> frag.len
[1] 20

This number differs from the frag.len of 148 in the paper. I then plotted the cross-correlation function:

plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")

cross-correlation plot It is seen that there is a spike at 20 bp and another max at about 180. Is there a way to resolve this discrepancy? If not, will this discrepancy compromise the csaw analysis?

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] csaw_1.24.3                 SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1       
 [5] matrixStats_0.58.0          rtracklayer_1.50.0          Rsamtools_2.6.0             Biostrings_2.58.0          
 [9] XVector_0.30.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.7         IRanges_2.24.1             
[13] S4Vectors_0.28.1            BiocGenerics_0.36.1         Rsubread_2.4.3             

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.4           Rcpp_1.0.6               lattice_0.20-44          prettyunits_1.1.1        assertthat_0.2.1        
 [6] utf8_1.2.1               BiocFileCache_1.14.0     R6_2.5.0                 RSQLite_2.2.7            httr_1.4.2              
[11] pillar_1.6.0             zlibbioc_1.36.0          rlang_0.4.11             GenomicFeatures_1.42.3   progress_1.2.2          
[16] curl_4.3.1               blob_1.2.1               Matrix_1.3-3             BiocParallel_1.24.1      stringr_1.4.0           
[21] RCurl_1.98-1.3           bit_4.0.4                biomaRt_2.46.3           DelayedArray_0.16.3      compiler_4.0.5          
[26] pkgconfig_2.0.3          askpass_1.1              openssl_1.4.4            tidyselect_1.1.1         tibble_3.1.1            
[31] GenomeInfoDbData_1.2.4   edgeR_3.32.1             XML_3.99-0.6             fansi_0.4.2              crayon_1.4.1            
[36] dplyr_1.0.6              dbplyr_2.1.1             rappdirs_0.3.3           GenomicAlignments_1.26.0 bitops_1.0-7            
[41] grid_4.0.5               lifecycle_1.0.0          DBI_1.1.1                magrittr_2.0.1           stringi_1.6.1           
[46] cachem_1.0.4             limma_3.46.0             xml2_1.3.2               ellipsis_0.3.2           vctrs_0.3.8             
[51] generics_0.1.0           tools_4.0.5              bit64_4.0.5              glue_1.4.2               purrr_0.3.4             
[56] hms_1.0.0                fastmap_1.1.0            AnnotationDbi_1.52.0     BiocManager_1.30.15      memoise_2.0.0           
>

Thanks and best wishes,

Rich

Richard Friedman,

Columbia University Cancer Center

chipseq csaw; crosscorrelation • 1.2k views

ADD COMMENT • link 3.0 years ago • updated 2.9 years ago richardallenfriedmanbrooklyn ▴ 20

score 1 · Answer 1 · 2021-05-14

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

Changes in the mapping proportions aren't surprising, the aligner has probably changed a fair bit in the last 6 years since the paper was published. (The relevant BAM files are probably even older, because I remember using them for a while during my PhD.) In fact, I remember regenerating the files 3 years ago for the chipseqDBData package, and the resulting mapping statistics were again quite different from those published in the paper.

I decided not to worry about it, given that the rest of the paper seemed fine. In your case, I suspect the phantom peak at 20 bp represents the read length; try using maximizeCcf() to get the actual fragment length. I also wouldn't be surprised if the minq=50 is too high for the new Rsubread output; you'll have to check what the "reasonable" range of MAPQ values are now.

ADD COMMENT • link 2.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron.

I redid the fragment length calculation with the protocol in your online book and with the bam files provided by chipseqDBData and I got exactly the fragment length and graph that was in the online book.

I redid the fragment length calculation with the bam files I had generated, and I got a reasonable fragment length (149) and the attached ccf graph.

ccf plot with my alignment

Does the spike at low distances present a problem insofar as further analyses are concerned.

Also I am still bothered by the discrepancy between number of mapped reads between the published "reads to regions paper", my own analysis, and the online book. Here is an example:

What I got: Total Mapped
matureB-8059.bam 16675372 12814838

online book. Total Mapped h3k9ac-matureB-8059 16675372 4670364

Reads to regions paper. Total Mapped
bmatureB-8059.bam 16675372 7752077

The number of mapped reads differ by almost a factor of 3!

I cannot help but suspect that this discreapncy is not simply due to chnages in Rsubread but also differences in the input parameters. was

align(index="index/mm10", readfile1=group.fastq, TH1=2, type=1, input_format="FASTQ", output_file=bam.files)

use to perform the alignment?

May I suggest that an up-to-date protocol from the sra files to the bam files be posted, so that users of csaw know exactly what to do?

Thanks and best wishes,

Rich

ADD REPLY • link 2.9 years ago richardallenfriedmanbrooklyn ▴ 20

1

Entering edit mode

Does the spike at low distances present a problem insofar as further analyses are concerned.

No. This is just a phantom peak, see the ENCODE papers for a discussion of why this occurs.

May I suggest that an up-to-date protocol from the sra files to the bam files be posted, so that users of csaw know exactly what to do?

Much easier said than done. As it is, the book already takes some effort to maintain and that doesn't even deal with the vagaries of the preprocessing software. I would not be willing to regenerate the BAM files from scratch at every release cycle and check that the results are consistent - this is simply not reasonable.

The most I can offer are the scripts that were used to generate the files in the chipseqDBData package. Version numbers of the software are embedded in the BAM files themselves; for example, for the H3K4me3 dataset, I see Rsubread 1.33.9 and MarkDuplicates 2.17.11.