Hello everyone, I am planning to use spike in nuclei from Drosophila (AvtiveMotif), in my samples from mouse sperm, and create ATACseq libraries. The manual suggests for every 100,000 diploid cells to use 10,000 Drosophila nuclei. So my first question is 1) if I use 100,000 sperm, since are haploid, should I use 50,000 Drosophila nuclei?
The protocol below makes very vague for my knowledge, suggestions how to normalize and some steps are not clear to me how to analyze them e.g. code to use. Can someone please help me with the questions that I have below (in bold)?
Here is the manual's steps:
Map the ATAC-Seq data to the test reference genome (e.g. human, mouse, or other).
Map ATAC-Seq data to the Drosophila reference genome.
Count uniquely aligning Drosophila sequence tags and identify the sample containing the least number of tags.--> When they say tags, do they mean the Total Number of Reads after bowtie2 and before the filtering? Or after filtering? i.e. the reads after bowtie2 and after removal of duplicates, unpaired, low map quality and ChrM/chrY?
Divide the aligned Drosophila tag value from the sample with the lowest Drosophila tag count by the Drosophila tag count value from all other samples to and generate a normalization factor for each sample. (Sample 1 with lowest tag count/Sample 2) = Normalization factor. The sample with the lowest drosophila tag count will have a normalization factor of 1.
Generate the normalization factors for all samples using the strategy from step 6.
Use the normalization factors to down-sample the read counts for each sample. -->How can I down-sample the read-counts for each sample?
After obtaining normalized mouse read counts, use a standard ATAC-Seq pipeline starting with the downsampled tag counts for each sample for peak calling and generation of bigWigs.**
I would be deeply grateful if you can suggest me the code for some parts. Has anyone done something similar? Thank you very much for your time,
Katerina