How about our ATAC-Seq data quality analyzed by ATACseqQC
1
0
Entering edit mode
Gary ▴ 20
@gary-7967
Last seen 3.6 years ago

Hi,

We produced a test ATAC-Seq data with 26.8 million 87bp paired-end reads using Chinese bulbul's feathers. I used Bowtie2 for the alignment and ATACseqQC for the quality analysis. The below are (1) fragment size distribution and (2) library complexity produced by ATACseqQC. I think that we have to optimize the ratio of cell number and Tn5 enzyme concentration. However, I don't know whether I have to increase or decrease Tn5 enzyme concentration. Also, does the library complexity result means that we have to increase the sequencing depth? Any suggestions are very welcome.

Best,

Gary

1
Entering edit mode

Hello Gary,

I am one of the co-developer of the ATACseqQC package. Your library fragment size plot is not optimal. It lacks fragments less than 100bp, which should be from OPEN chromatin regions. On the other hand your plots did not show a apparent ladder pattern of mono-,bi-, tri-,...nucleosomes, which suggests overtransposition occurred in your experiment. Two important questions are that 1) did you remove the duplicates? 2) did you have some size selection intentionally or unintentionally, such as Ampure magnet beads or DNA purification columns, which remove fragments smaller than 100bp? Personally, I don't suggest to do more sequencing with your current libraries. It will not give you information of location of completely open chromatin regions.

Haibo Liu

0
Entering edit mode

Dear Haibo,

Thank you for your very useful information. I have not removed duplicated reads in the fragment size distribution and library complexity figures.

We didn't make a size selection but used a Qiagen MinElute PCR Purification Kit for the purification as suggested by Buenrostro, et al. (2015). Do you think that the Qiagen kit removes our fragments smaller than 100bp? We only used two feather follicles for this pilot experiment. The number of cells could be very limited. Next time, we may use five feather follicles. Do you think that we should follow a low-cell number protocol without the Qiagen MinElute purification below (Buenrostro, et al. 2013)? Many thanks.

Low–cell number protocol. To prepare the 500- and 5,000-cell reactions, we used the same protocol with some notable exceptions: the transposition reaction was done in a 5µL instead of 50µL reaction. Also, we eliminated the Qiagen MinElute purification before PCR and instead took the 5µL reaction immediately after transposition directly into the 50µL PCR.

Reference

Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 2013;10(12):1213-8. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current protocols in molecular biology / edited by Frederick M Ausubel [et al] 2015;109:21.9.1-9.

Best,

Gary

1
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 23 days ago
United States

Gary,

The library complexity plot indicates that the library is not complex enough. Therefore, increasing cell numbers should help.

Did you remove reads aligned to mitochondria? To find out whether there is too much/little Tn5 enzyme, could you please remove mitochondria reads and make additional plots as presented in https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-018-4559-3? Thanks!

Best regards, Julie

0
Entering edit mode

Hi Julie,

Thank you so much. Usually, I use the command below to remove reads aligned to mitochondria.

samtools idxstats CTNNB1_G8_13.bam | cut -f 1 | grep -v chrM | xargs samtools view -b CTNNB1_G8_13.bam > CTNNB1_G8_13_woChrM.bam


However, there are only 32 scaffolds, and no one is annotated as the chrM in the draft bulbul genome built by Illumina short reads provided by our cooperator. Therefore, I cannot remove any reads mapped to mitochondria, because I don't know whether there are mitochondrial scaffolds or contigs in this draft bird genome. Unfortunately, there is no bulbul genome in the BSgenome. Consequently, I cannot run many functions of ATACseqQC.

This sample is our pilot testing sequenced by an Illumina MiSeq. If it is successful, we will sequence more samples using an Illumina HiSeq 2500. The concentration and volume of this ATAC-Seq library are 30.8 ng/µl and 6 µl, respectively. To get more cells is possible but very difficult because the tissue we dissected is very small. ATACseqQC is very helpful to evaluate our ATAC-Seq data quality. Below is its Bioanalyzer result. Could you provide more suggestion to improve our future ATAC-Seq experiments? Thanks again.

Best,

Gary

1
Entering edit mode

Dear Gary,

Is it possible to use the mitochondria genome from https://www.ncbi.nlm.nih.gov/pubmed/25103433 to map your sequenced reads? Regarding the concentration of transposase to use, please refer the paper https://www.nature.com/articles/nmeth.2688. Please read the section “Low–cell number protocol” if it is difficult for you to obtain more cells.

We work out our cell number and corresponding transposase by performing a titration experiment, i.e., using various amount of transposase with a selected cell number. As you may already realize, high relative concentration of transposase to cell number leads to over digestion and too many reads with small fragment size, and vice versa. Both Bioanalyzer plot and fragment size distribution should show periodicity and decreasing trend from small fragment size towards the larger fragment size.

According to our subsampling analysis results at https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-018-4559-3, 3 million reads are good enough for diagnosis purpose. Therefore, you can sequence multiple libraries from your titration experiment with sample barcodes in one MiSeq run for QC purpose.

Hope this helps.

Best regards,

Julie

0
Entering edit mode

Dear Julie,

I appreciate your professional help very much. This time, I don't trim any sequence reads and use Bowtie2 with the very-sensitive-local parameter for the alignment. The percentage aligned reads are (1) 71.73% when I use Chinese bulbul genome only, (2) 10.92% when I use black-capped bulbul mitochondrial genome only, and (3) 81.63% when I use a chimeric genome including both Chinese bulbul genome and black-capped bulbul mitochondrial genome. Using bamQC in ATACseqQC, the $mitochondriaRate = 0.1325265 based on the chimeric genome. I draw a figure based on the$idxstats produced by bamQC below and believe that there is no mitochondrial DNA in the Chinese bulbul genome. Do you agree with me?

My colleague told me that she usually collects five feather follicles, but only samples two feather follicles for this ATAC-Seq testing experiment. Therefore, increasing cell numbers is OK with us.

Also, the duplicateRate = 0.3664853 produced by bamQC in ATACseqQC based on the chimeric genome. My colleague told me that the total PCR cycles is 15. The new fragment size distribution and library complexity based on the chimeric genome below for your reference. Based on the further information, do you know whether there is too much/little Tn5 enzyme in our experiment? I don't understand why you suggest that our library is not complex enough based on the library complexity plot. There is no saturation (i.e., no plateau) before 26.8 million (the real number of reads we have). Doesn't it mean that our library is complex enough and we have to sequence more reads to reach a plateau? Please teach me and many thanks. By the way, I used commands below to produce the library complexity plot. > bamfile <- "Bulbul.bam" > bamfile.label <- sub(".bam","",basename(bamfile)) > estimateLibComplexity(readsDupFreq(bamfile))  However, I don't know how to produce plots like fig4a and fig4b in Ou, et al. 2018. ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data. BMC Genomics. Could you teach me? Many thanks. Best, Gary ADD REPLY 1 Entering edit mode Gary, Ideally, % mitochondria read is < 5% since the library should be depleted of mitochondria DNA. The fragment size distribution is not idea, which might be partially due to large % of mitochondria sequences or the ratio of Transposase to cell number. It is great that you can increase the cell number. Please note that the enzyme concentration needs to be increased accordingly. As for figure 4a and 4b, peaks need to be called for the whole dataset, and for the subsampled dataset at different depth. As for the complexity plot, please look at figure 4c. The sample with curve reaches close to 100 is the best library in terms of complexity. Hope this clarifies things a bit. Best regards, Julie  ADD REPLY 1 Entering edit mode Please also look at the code in the supplementary section of the paper. ADD REPLY 0 Entering edit mode Julie, Thank you so much. Below are the number of peaks and the total peak width of our ATAC-Seq sample (based on FDR < 10^-5). I think that we have to sequence more reads. Do you also think so? In addition, we only sequenced 26 million reads. May I know how do ATACseqQC extrapolate distinct fragments after 26 million reads? More importantly, how do I know 100 (Distinct fragments X10^3) is good, and our ~35 is not enough? Many thanks. Best, Gary ADD REPLY 1 Entering edit mode Gary, Yes, I think so too. You are going to increase cell numbers with proportional increase in transposase too, correct? No plateau observed in the saturation curve, implies that additional reads will likely lead to more peaks. The absolute number here is not as meaningful. BTW, I assume you followed our paper as the following before calling peaks. Remove reads meeting the following criteria: (1) reads aligning to the mitochondrial genome; (2) reads from PCR/optical duplicates; (3) reads with mapping quality less than or equal to 20; (4) read pairs aligned discordantly; and (5) read pairs with mapping template shorter than 38 bp or greater than 2000 bp. Best, Julie ADD REPLY 0 Entering edit mode Dear Julie, Thank you very much for your help. We will try the fixed cell number with a different amount of Tn5 enzyme. For sequence depth analysis (i.e., the number of peaks and total peak width), I used samtools and MACS2 commands below to remove (1) reads aligning to the black-capped bulbul mitochondrial genome, and (2) reads from PCR duplicates. samtools idxstats Bulbul.bam | cut -f 1 | grep -v chrM | xargs samtools view -b Bulbul.bam > Bulbul_woChrM.bam macs2 callpeak -t Bulbul_woChrM.bam --format BAMPE --gsize 1000000000 --keep-dup auto --outdir Bulbul_woChrM --name Bulbul_woChrM --verbose 3 --qvalue 0.00001 --cutoff-analysis  Unfortunately, I don't know how to remove (1) reads with mapping quality less than or equal to 20, (2) read pairs aligned discordantly, and (3) read pairs with mapping template shorter than 38 or greater than 2000 bp. I re-read your paper and believe that bamQC can do it. However, I don't find how to do it in the manual. Could you teach me how to generate a filtered BAM file using bamQC? Many thanks. Best, Gary ADD REPLY 2 Entering edit mode Hello Gary, You can find the scripts for filtering the unwanted reads from the supplementary file 1. Here are some script excerpts: #!/bin/bash #BSUB -n 8 # minimal numbers of processors required for a parallel job #BSUB -R rusage[mem=16000] # ask for memory 16G #BSUB -W 12:00 #limit the job to be finished in 12 hours #BSUB -J "bwamem[1-25]" #BSUB -q long # which queue we want to run in #BSUB -o logs/out.%J.%I.txt # log #BSUB -e logs/err.%J.%I.txt # error #BSUB -R "span[hosts=1]" # All hosts on the same chassis" module load samtools/1.4.1 i=((LSB_JOBINDEX - 1)) ## "nuclear.chr.txt" is a file containing a list of all human nuclear chromosomes and unplaced ## scaffolds (no mitochondrial genome included) grep '^@SQ' SRR3295022.sorted.sam |cut -f2 | perl -n -e 's/SN://; print if !/MT/' > nuclear.chr.txt chromosomes=(cat nuclear.chr.txt) sams=(ls *.sam ) names=(ls *.sam| perl -p -e 's/.sam//' ) ## converting alignment output from SAM to BAM samtools view -b -h -o{names[${i}]}.bam -@ 8 -1${sams[${i}]} ## sorting the BAM file samtools sort -l 9 -m 8G -o${names[${i}]}.sorted.bam -O BAM -@ 8${names[${i}]}.bam samtools index -@ 1${names[{i}]}.sorted.bam ## obtaining statistics of read alignments samtools flagstat -@ 8{names[${i}]}.sorted.bam >${names[${i}]}.prefilter.stat ## extracting mitochondrial reads and summarize mapping statistics samtools view -h${bam[${i}]} 'MT' >${names[${i}]}.MT.bam samtools flagstat -@ 8${names[${i}]}.MT.bam >${names[{i}]}.MT.bam.stat ## filtering BAM files to remove mitochondrial reads and other alignments of no interest samtools view -h -O SAM{bam[${i}]}${chromosomes[@]} | awk  'BEGIN{FS=OFS="\t"} \
function abs(v) {return v < 0 ? -v : v}; /^@/ || ($7 == "=" \ && ($2 == 81 || $2 == 161||$2 == 97 || $2 == 145 ||$2 ==99 || \
$2 == 147 ||$2 == 83 || $2 ==163) && abs($9) <= 2000 && abs($9) >= 38 &&$5 >=20 ) {print}' | \
samtools view  -h  -b -o ${names[${i}]}.chr.filtered.bam  -

## sorting and indexing the BAM file

samtools sort -l 9 -m 8G  -o  ${names[${i}]}.chr.filtered.sorted.bam  -O BAM -@ 8  ${names[${i}]}.chr.filtered.bam
samtools index  -@ 1  ${names[${i}]}.chr.filtered.sorted.bam


Haibo

0
Entering edit mode

Dear Haibo,

After removing mitochondrial reads, duplicated reads, and other alignments of no interest (the partial bamQC results below), I reproduce the fragment size distribution and library complexity plots below. The library complexity profile is very different. Based on them, could you help to interpret our ATAC-Seq data quality? Many thanks.

> bamQC(bamfile, doubleCheckDup = TRUE, mitochondria = "chrM", outPath=NULL)
$totalQNAMEs [1] 8237291$duplicateRate
[1] 8.255141e-06
$mitochondriaRate [1] 0$properPairRate
[1] 1
$unmappedRate [1] 0$hasUnmappedMateRate
[1] 0
$notPassingQualityControlsRate [1] 0$nonRedundantFraction
[1] 0.9986667
$PCRbottleneckCoefficient_1 [1] 0.9995323$PCRbottleneckCoefficient_2
[1] 3154.259


This time, I used MarkDuplicates (Picard) and Filter (Bamtools) functions on Galaxy to filter our bam file (the detail below). Personally, I use a MacBook Pro, and have samtools and awk below. However, your command line is too difficult for me to understand how to run it. Could you teach me how to use your command line to filter our bulbul.bam file? Thanks again.

gary > uname
Darwin
gary > sw_vers
ProductName:    Mac OS X
ProductVersion: 10.12.6
BuildVersion:   16G1408
gary > samtools --version
samtools 1.5
Using htslib 1.5
Copyright (C) 2017 Genome Research Ltd.
gary > awk --version
awk version 20070501
gary > pwd
/Users/gary/Documents/Project/20181219_Bulbul/20181219_BulbulATACseqLibraryQualityAnalysis/bam/Bowtie2_VerySenstiveLocal_X5000
gary > ll
total 33775080
-rw-r-----  1 gary  staff   2.3G Feb 25 21:32 bulbul.bam
-rw-r-----  1 gary  staff   1.0M Feb 25 20:35 bulbul.bam.bai
-rw-r--r--  1 gary  staff    14G Feb 26 14:55 bulbul.sam
gary > samtools view  -h -O SAM  ${bam[${i}]}   ${chromosomes[@]} | awk 'BEGIN{FS=OFS="\t"} function abs(v) {return v < 0 ? -v : v}; /^@/ || ($7 == "=" && ($2 == 81 ||$2 == 161|| $2 == 97 ||$2 == 145 || $2 ==99 ||$2 == 147 || $2 == 83 ||$2 ==163) && abs($9) <= 2000 && abs($9) >= 38 && $5 >=20 ) {print}' | samtools view -h -b -o${names[{i}]}.chr.filtered.bam gary >  Best, Gary ADD REPLY 1 Entering edit mode Hello Gary, You new plots based on filtered BAM looks better: many spiky peaks due to duplicates are gone. But you still missing the 38-100bp fragments, which was permanently removed by previous wet lab steps. A good tutorial for analyzing ATAC-seq data is here: https://informatics.fas.harvard.edu/atac-seq-guidelines.html; and a improved protocol for reduced mitochondrial reads is here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623106/. You should look into literature about how to reduce losing small fragments (<100bp). I do have hands-on experience in generating ATAC-seq data. But I did analyzed dozens 0f ATAC-seq data. Based on your BAM QC output. You only have 8 M reads (pairs of reads?), which is definite a small number. In my opinion 30-50 M pairs of reads after removing the MT reads, duplicates and other inappropriately aligned reads are needed for a good analysis. But this is also depends on your biological question, if you are only interested in highly open chromatin regions (open in majority cells), 20 M pairs might be fine. Your library complexity plot suggest you definitely need more sequencing. my command is not that difficult to understand if you have some understanding of the Samtools and awk.  Haibo Haibo ADD REPLY 0 Entering edit mode Dear Haibo, I really appreciate your help in resolving our ATAC-Seq problem. The information you provided is very helpful. I totally agree with you that I have to learn more about the Samtools, awk, and Unix commands. Hopefully, I can use your command correctly in the near future. Thank you once again for your kind support. Best regards, Gary ADD REPLY 0 Entering edit mode Dear Haibo, Now I can run your command helped by my local friend (the detail below). However, we still don't understand the sub-command: awk 'BEGIN{FS=OFS="\t"} function abs(v) {return v < 0 ? -v : v}; /^@/. Does it mean that it removes duplicate reads in a BAM file? Many thanks for your kindly help. > samtools view -h -O SAM bulbul.bam chr{1..32} | awk 'BEGIN{FS=OFS="\t"} function abs(v) {return v < 0 ? -v : v}; /^@/ || (7 == "=" && ($2 == 81 ||$2 == 161|| $2 == 97 ||$2 == 145 || $2 ==99 ||$2 == 147 || $2 == 83 ||$2 ==163) && abs($9) <= 2000 && abs($9) >= 38 && $5 >=20 ) {print}' | samtools view -h -b -o bulbul_filtered.bam  Best, Gary ADD REPLY 0 Entering edit mode Hi Gary, I now filter bam before ATACseqQC and saw your question. I think the sub-command: awk 'BEGIN{FS=OFS="\t"} function abs(v) {return v < 0 ? -v : v}; /^@/. doesn't mean removes duplicate,it just function an absolute value function for the following abs($9) <= 2000 && abs($9) >= 38 and the removes duplicate steps is following ## removing duplicates from the filtered BAM file samtools rmdup${names[${i}]}.chr.filtered.sorted.bam${names[{i}]}.chr.filtered.sorted.rmdup.bam https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-018-4559-3/MediaObjects/1286420184559MOESM1ESM.txt in Additional file 1.enter link description here ADD REPLY 1 Entering edit mode Gary, Your command did not remove mitochondrial read alignment; it only output mitochondrial read alignments. One thin you can do is to do blast by using the black-capped bulbul mitochondrial genome as query to find out whether there is mitochondrial genome assembly in the draft genome assembly. If you have a draft genome, you can create a BSgenome by using the guide "https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf". Haibo ADD REPLY 0 Entering edit mode Dear Haibo, Many thanks for your suggestion and information. I have tried blast on the web, but it doesn't work (the figures below). Maybe the file size is too large, and I have to try a standalone blast. But it needs more time for me to learn how to install and run a standalone blast. By the way, I use Bowtie2 to align ATAC-Seq reads on a chimeric genome, including (1) Chinese bulbul chr1 - chr32, and (2) black-capped bulbul mitochondrial genome. After that, I draw a figure based on theidxstats produced by bamQC below and believe that there is no or very few mitochondrial DNA sequence in the Chinese bulbul genome. Because chr1 - chr32 follow a linear regression, and only black-capped bulbul chrM is an outlier. Do you agree with me? Thanks for your help again.

Best,

Gary

1
Entering edit mode

Hello Gary,

 Yes, I agree with you that there is no or very few mitochondrial DNA sequence in the Chinese bulbul genome assembly.


Haibo

0
Entering edit mode

Dear Julie and Haibo,

Thank you for your very useful help. I am not sure whether you saw my responses. I integrate my additional questions below. Could you help me again? Many thanks.

After removing mitochondrial reads, duplicated reads, and other alignments of no interest (the partial bamQC results below), I reproduce the fragment size distribution and library complexity plots below. The library complexity profile is very different. Based on them, could you help to interpret our ATAC-Seq data quality? May I say that our library complexity is good enough? Many thanks.

> bamQC(bamfile, doubleCheckDup = TRUE, mitochondria = "chrM", outPath=NULL)
$totalQNAMEs [1] 8237291$duplicateRate
[1] 8.255141e-06
$mitochondriaRate [1] 0$properPairRate
[1] 1
$unmappedRate [1] 0$hasUnmappedMateRate
[1] 0
$notPassingQualityControlsRate [1] 0$nonRedundantFraction
[1] 0.9986667
$PCRbottleneckCoefficient_1 [1] 0.9995323$PCRbottleneckCoefficient_2
[1] 3154.259


This time, I used MarkDuplicates (Picard), and Filter (Bamtools) functions on Galaxy to filter our bam file (the detail below). I use a MacBook Pro and have samtools and awk below. However, your command line is too complicated for me to understand how to run it. Could you teach me how to use your command line to filter our bulbul.bam file? Thanks again.

gary > uname
Darwin
gary > sw_vers
ProductName:    Mac OS X
ProductVersion: 10.12.6
BuildVersion:   16G1408
gary > samtools --version
samtools 1.5
Using htslib 1.5
Copyright (C) 2017 Genome Research Ltd.
gary > awk --version
awk version 20070501
gary > pwd
/Users/gary/Documents/Project/20181219_Bulbul/20181219_BulbulATACseqLibraryQualityAnalysis/bam/Bowtie2_VerySenstiveLocal_X5000
gary > ll
total 33775080
-rw-r-----  1 gary  staff   2.3G Feb 25 21:32 bulbul.bam
-rw-r-----  1 gary  staff   1.0M Feb 25 20:35 bulbul.bam.bai
-rw-r--r--  1 gary  staff    14G Feb 26 14:55 bulbul.sam
gary > samtools view  -h -O SAM  ${bam[${i}]}   ${chromosomes[@]} | awk 'BEGIN{FS=OFS="\t"} function abs(v) {return v < 0 ? -v : v}; /^@/ || ($7 == "=" && ($2 == 81 ||$2 == 161|| $2 == 97 ||$2 == 145 || $2 ==99 ||$2 == 147 || $2 == 83 ||$2 ==163) && abs($9) <= 2000 && abs($9) >= 38 && $5 >=20 ) {print}' | samtools view -h -b -o${names[\${i}]}.chr.filtered.bam
gary >


We didn't make a size selection but used a Qiagen MinElute PCR Purification Kit for the purification as suggested by Buenrostro, et al. (2015). Do you think that the Qiagen kit removes our fragments smaller than 100bp? We only used two feather follicles for this pilot experiment. The number of cells could be very limited. Next time, we may use five feather follicles. Do you think that we should follow a low-cell number protocol without the Qiagen MinElute purification below (Buenrostro, et al. 2013)?

Low–cell number protocol. To prepare the 500- and 5,000-cell reactions, we used the same protocol with some notable exceptions: the transposition reaction was done in a 5µL instead of 50µL reaction. Also, we eliminated the Qiagen MinElute purification before PCR and instead took the 5µL reaction immediately after transposition directly into the 50µL PCR.

Reference Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 2013;10(12):1213-8. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current protocols in molecular biology / edited by Frederick M Ausubel [et al] 2015;109:21.9.1-9.

Best regards,

Gary