The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: How about our ATAC-Seq data quality analyzed by ATACseqQC
0
13 days ago by
Gary0
Gary0 wrote:

Hi,

We produced a test ATAC-Seq data with 26.8 million 87bp paired-end reads using Chinese bulbul's feathers. I used Bowtie2 for the alignment and ATACseqQC for the quality analysis. The below are (1) fragment size distribution and (2) library complexity produced by ATACseqQC. I think that we have to optimize the ratio of cell number and Tn5 enzyme concentration. However, I don't know whether I have to increase or decrease Tn5 enzyme concentration. Also, does the library complexity result means that we have to increase the sequencing depth? Any suggestions are very welcome.

Best,

Gary

modified 13 days ago by Julie Zhu4.0k • written 13 days ago by Gary0
1

Hello Gary,

I am one of the co-developer of the ATACseqQC package. Your library fragment size plot is not optimal. It lacks fragments less than 100bp, which should be from OPEN chromatin regions. On the other hand your plots did not show a apparent ladder pattern of mono-,bi-, tri-,...nucleosomes, which suggests overtransposition occurred in your experiment. Two important questions are that 1) did you remove the duplicates? 2) did you have some size selection intentionally or unintentionally, such as Ampure magnet beads or DNA purification columns, which remove fragments smaller than 100bp? Personally, I don't suggest to do more sequencing with your current libraries. It will not give you information of location of completely open chromatin regions.

Haibo Liu

1
13 days ago by
Julie Zhu4.0k
United States
Julie Zhu4.0k wrote:

Gary,

The library complexity plot indicates that the library is not complex enough. Therefore, increasing cell numbers should help.

Did you remove reads aligned to mitochondria? To find out whether there is too much/little Tn5 enzyme, could you please remove mitochondria reads and make additional plots as presented in https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-018-4559-3? Thanks!

Best regards, Julie

Hi Julie,

Thank you so much. Usually, I use the command below to remove reads aligned to mitochondria.

samtools idxstats CTNNB1_G8_13.bam | cut -f 1 | grep -v chrM | xargs samtools view -b CTNNB1_G8_13.bam > CTNNB1_G8_13_woChrM.bam


However, there are only 32 scaffolds, and no one is annotated as the chrM in the draft bulbul genome built by Illumina short reads provided by our cooperator. Therefore, I cannot remove any reads mapped to mitochondria, because I don't know whether there are mitochondrial scaffolds or contigs in this draft bird genome. Unfortunately, there is no bulbul genome in the BSgenome. Consequently, I cannot run many functions of ATACseqQC.

This sample is our pilot testing sequenced by an Illumina MiSeq. If it is successful, we will sequence more samples using an Illumina HiSeq 2500. The concentration and volume of this ATAC-Seq library are 30.8 ng/µl and 6 µl, respectively. To get more cells is possible but very difficult because the tissue we dissected is very small. ATACseqQC is very helpful to evaluate our ATAC-Seq data quality. Below is its Bioanalyzer result. Could you provide more suggestion to improve our future ATAC-Seq experiments? Thanks again.

Best,

Gary

1

Dear Gary,

Is it possible to use the mitochondria genome from https://www.ncbi.nlm.nih.gov/pubmed/25103433 to map your sequenced reads? Regarding the concentration of transposase to use, please refer the paper https://www.nature.com/articles/nmeth.2688. Please read the section “Low–cell number protocol” if it is difficult for you to obtain more cells.

We work out our cell number and corresponding transposase by performing a titration experiment, i.e., using various amount of transposase with a selected cell number. As you may already realize, high relative concentration of transposase to cell number leads to over digestion and too many reads with small fragment size, and vice versa. Both Bioanalyzer plot and fragment size distribution should show periodicity and decreasing trend from small fragment size towards the larger fragment size.

According to our subsampling analysis results at https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-018-4559-3, 3 million reads are good enough for diagnosis purpose. Therefore, you can sequence multiple libraries from your titration experiment with sample barcodes in one MiSeq run for QC purpose.

Hope this helps.

Best regards,

Julie

Dear Julie,

I appreciate your professional help very much. This time, I don't trim any sequence reads and use Bowtie2 with the very-sensitive-local parameter for the alignment. The percentage aligned reads are (1) 71.73% when I use Chinese bulbul genome only, (2) 10.92% when I use black-capped bulbul mitochondrial genome only, and (3) 81.63% when I use a chimeric genome including both Chinese bulbul genome and black-capped bulbul mitochondrial genome. Using bamQC in ATACseqQC, the $mitochondriaRate = 0.1325265 based on the chimeric genome. I draw a figure based on the$idxstats produced by bamQC below and believe that there is no mitochondrial DNA in the Chinese bulbul genome. Do you agree with me?

My colleague told me that she usually collects five feather follicles, but only samples two feather follicles for this ATAC-Seq testing experiment. Therefore, increasing cell numbers is OK with us.

Also, the \$duplicateRate = 0.3664853 produced by bamQC in ATACseqQC based on the chimeric genome. My colleague told me that the total PCR cycles is 15. The new fragment size distribution and library complexity based on the chimeric genome below for your reference. Based on the further information, do you know whether there is too much/little Tn5 enzyme in our experiment? I don't understand why you suggest that our library is not complex enough based on the library complexity plot. There is no saturation (i.e., no plateau) before 26.8 million (the real number of reads we have). Doesn't it mean that our library is complex enough and we have to sequence more reads to reach a plateau? Please teach me and many thanks.

By the way, I used commands below to produce the library complexity plot.

> bamfile <- "Bulbul.bam"
> bamfile.label <- sub(".bam","",basename(bamfile))


However, I don't know how to produce plots like fig4a and fig4b in Ou, et al. 2018. ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data. BMC Genomics. Could you teach me? Many thanks.

Best,

Gary

1
Gary,
Ideally, % mitochondria read is < 5% since the library should be depleted of mitochondria DNA.
The fragment size distribution is not idea, which might be partially due to large % of mitochondria sequences or the ratio of Transposase to cell number.
It is great that you can increase the cell number. Please note that the enzyme concentration needs to be increased accordingly.
As for figure 4a and 4b, peaks need to be called for the whole dataset, and for the subsampled dataset at different depth.
As for the complexity plot, please look at figure 4c. The sample with curve reaches close to 100 is the best library in terms of complexity.
Hope this clarifies things a bit.
Best regards,
Julie

1

Please also look at the code in the supplementary section of the paper.

Julie, Thank you so much. Below are the number of peaks and the total peak width of our ATAC-Seq sample (based on FDR < 10^-5). I think that we have to sequence more reads. Do you also think so?

In addition, we only sequenced 26 million reads. May I know how do ATACseqQC extrapolate distinct fragments after 26 million reads? More importantly, how do I know 100 (Distinct fragments X10^3) is good, and our ~35 is not enough? Many thanks. Best, Gary

1

Gary,

Yes, I think so too. You are going to increase cell numbers with proportional increase in transposase too, correct?

No plateau observed in the saturation curve, implies that additional reads will likely lead to more peaks. The absolute number here is not as meaningful.

BTW, I assume you followed our paper as the following before calling peaks. Remove reads meeting the following criteria: (1) reads aligning to the mitochondrial genome; (2) reads from PCR/optical duplicates; (3) reads with mapping quality less than or equal to 20; (4) read pairs aligned discordantly; and (5) read pairs with mapping template shorter than 38 bp or greater than 2000 bp.

Best,

Julie

1

Gary, Your command did not remove mitochondrial read alignment; it only output mitochondrial read alignments. One thin you can do is to do blast by using the black-capped bulbul mitochondrial genome as query to find out whether there is mitochondrial genome assembly in the draft genome assembly.

If you have a draft genome, you can create a BSgenome by using the guide "https://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf".

Haibo