4 weeks ago by
Thanks for the feedback! All the GUIDE-seq data we analyzed has the sample barcode and UMI sequence appended at the end of the header of each read in the following order and orientation.
P7 index barcode (reversed) + P5 index barcode + UMI
The scripts binReads.sh and getBarode.pl are for parsing the read header and assigning reads to different samples using the P7 index and P5 index information, assuming the read header contains the above information.
Please let me know the format of your sequencing output, especially the files containing the P7 index and P5 index barcode, and I will see what can be modified to bin your reads. Thanks!
Please note that simply sorting reads based on perfect matches to the barcode indices may leave large number of reads unassigned due to sequencing errors within each index, especially if the barcode is long (e.g., 16 bases for GUIDE-seq). Many additional reads can be properly assigned if one or two mismatches are allowed in the index reads. To capture mutated indices, build a bowtie index of all the barcodes, then map the sequenced barcode portion to the barcode index allowing one mismatch. Below is the command to separate samples according to 16 base barcodes using bowtie1 with 8 threads allowing 1 mismatch.
./binReads.sh fastqFolder barcodes 1 8 16 p7.index p5.index usedBarcodes
where fastqFolder contains the fastq files and barcodes is the barcode index (barcode.bowtie1.index.tar.gz) that can be downloaded at http://mccb.umassmed.edu/GUIDE-seq/, index.p7 and index.p5 are text files containing the GUIDE-seq sample barcodes. If different barcodes are being utilized than present in the downloaded index.p7 and index.p5 files, then these files will need to be modified, and then a custom bowtie index will need to be generated by running the function createBarcodeFasta in GUIDEseq package followed by bowtie-build barcodes.fa barcodes with bowtie1. Please download getBarcode.pl and getUsedBarcodes.R (also available in GUIDEseq package) called in binReads.sh to the current working directory.
Please note that you need to have bowtie1 and R installed for this step, and bowtie2  installed for mapping to the genome. If you are running a batch job for Platform LSF, you do not need to modify the script. Otherwise, change the "module load" command in binBarcode.sh to include bowtie1 and R in your search path.
More detailed description on generating input files for the GUIDEseq Bioconductor package can be found at https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-017-3746-y/MediaObjects/1286420173746MOESM1ESM.pdf.
modified 4 weeks ago
4 weeks ago by
Julie Zhu • 4.0k