Tom,
Thanks for the feedback! All the GUIDE-seq data we analyzed has the sample barcode and UMI sequence appended at the end of the header of each read in the following order and orientation.
P7 index barcode (reversed) + P5 index barcode + UMI
The scripts binReads.sh and getBarode.pl are for parsing the read header and assigning reads to different samples using the P7 index and P5 index information, assuming the read header contains the above information.
Please let me know the format of your sequencing output, especially the files containing the P7 index and P5 index barcode, and I will see what can be modified to bin your reads. Thanks!
Please note that simply sorting reads based on perfect matches to the barcode indices may leave large number of reads unassigned due to sequencing errors within each index, especially if the barcode is long (e.g., 16 bases for GUIDE-seq). Many additional reads can be properly assigned if one or two mismatches are allowed in the index reads. To capture mutated indices, build a bowtie index of all the barcodes, then map the sequenced barcode portion to the barcode index allowing one mismatch. Below is the command to separate samples according to 16 base barcodes using bowtie1 with 8 threads allowing 1 mismatch.
./binReads.sh fastqFolder barcodes 1 8 16 p7.index p5.index usedBarcodes
where fastqFolder contains the fastq files and barcodes is the barcode index (barcode.bowtie1.index.tar.gz) that can be downloaded at http://mccb.umassmed.edu/GUIDE-seq/, index.p7 and index.p5 are text files containing the GUIDE-seq sample barcodes. If different barcodes are being utilized than present in the downloaded index.p7 and index.p5 files, then these files will need to be modified, and then a custom bowtie index will need to be generated by running the function createBarcodeFasta in GUIDEseq package followed by bowtie-build barcodes.fa barcodes with bowtie1. Please download getBarcode.pl and getUsedBarcodes.R (also available in GUIDEseq package) called in binReads.sh to the current working directory.
Please note that you need to have bowtie1 and R installed for this step, and bowtie2 [2] installed for mapping to the genome. If you are running a batch job for Platform LSF, you do not need to modify the script. Otherwise, change the "module load" command in binBarcode.sh to include bowtie1 and R in your search path.
More detailed description on generating input files for the GUIDEseq Bioconductor package can be found at https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-017-3746-y/MediaObjects/1286420173746MOESM1ESM.pdf.
Thanks!
Best regards,
Julie
Hi Julie,
Thanks for your reply.
As far as I know, GUIDEseq samples are typically sequenced using this configuration: 150/8/16/150 R1/Index1/Index2/R2 where Index 1 and 2 are the i7 and i5 sides, respectively. It appears that binReads.sh, getBarcode.pl and getUMI.pl do no read standard sequencing output (see below); getUMI.pl, for instance, looks for the UMI in a read header.
Standard output of R2 from undemultiplexed fastq: (truncated; notice the header has no index) @MN01027:30:000H2TM5J:1:11101:24286:1047 1:N:0:0 CATGGGGCACCTCGAGTGC + =FF6F6F66/AA/AF/FF/AFFF#F
It seems that even SRR1695666 from Tsai et al Nat biotech 2014 has a structure incompatible with these perl scripts: @SRR1695666.2 2 length=150 CAGTACGAATACAGACCGTGAA +SRR1695666.2 2 length=150 A?1>AFAAAAAFF1EG1AAF0FBB
From what I can see, Zhu et al BMC Genomics does not the address in the main text or Additional File 1 the question of bcl -> fastq generation and therefore the fastq structure required for input. Perhaps it requires manifest generation similar to https://github.com/aryeelab/guideseq#writing-a-manifest-file ? If you know how to generate compatible raw fastq files, I would very much like to understand how it works.
It would be helpful to discuss the generation of the fastq files used at the very-first step of this analysis (unbinnned, undemultiplexed). Also, is the raw sequencing data from Zhu et al available on SRA or another repository?
To address these issues of preprocessing, I developed the following workarounds: 1. Generation of a tab-separated UMI table. These UMIs can be found as the reverse-compliment of the first 8 bases in the I2 read: @MN01027:30:000H2TM5J:1:11101:24286:1047 CTAGGTA @MN01027:30:000H2TM5J:1:11101:20499:1048 CCTNNTT @MN01027:30:000H2TM5J:1:11101:10617:1048 CGACAAC ... 2. Standard dumultiplexing of sample (masking UMI) using bcl2fastq > sam > bam > merged.bam
Should these inputs work? Also, how will UMI mismatches (CCTNNTT) be interpreted by GUIDEseqAnalysis given these inputs?
Thanks again!
Tomás
Tom,
The sequence read header information were lost after being submitted to SRA and I had to email the authors to get the original GUIDE-seq fastq files for SRR1695666. Here is what a sequence read of GUIDE-seq look like, i.e., the end of the header line contains 8bp P7 index (reversed), 8bp P5 index, and 8bp UMI.
@HWI-M02096:147:000000000-AFU9U:1:1101:15747:1825 1:N:0:CGTACTAGCTCTCTATGGTTAAGT CTACAGCAGAGAGTGTTTAGGAAGCACCTTTTCGTTAGGCAGATTTTTATCAAAGTTAGTCATCCTTTCGATGAAGGACTGGGCAGAGGCCCCCTTATCCACGACTTCCTCGAAGTTCCAGGGAGTGATGGTCTCTTCTGATTTGCGAGT + 3>AAAFFFFC?FGFGGGGGGGGHHHHHHHHHHHGHHGGHHBEFHHHGHHHHHHHHHHGHHHHHHHHHHH?FHHHEHFFFHHHHHHGEHEGGGFGGGGHHHHHG/EGGHGHHC/?/FHHFFEHF/EEHEHEF2FHHGHHHHFBGGHHCFGG
The two-step workarounds you developed using bcl2fastq should work.
Regarding UMI mismatches, e.g., CCTNNTT, GUIDEseqAnalysis in the GUIDEseq package will collapse all reads with the same exact UMI, e.g., CCTNNTT within the same sample. N is not treated as wild card here, meaning that CCTNNTT and CCTAATT are considered as two different UMIs.
Best regards,
Julie