I'm working on creating a small workflow to look at random CRISPR guide sequences. Essentially I'm generating all of the putative CRISPR locations on a particular chromosome and would like to be able to manipulate these a little bit. Ultimately my goal is to find CRISPRs that cut at multiple locations in the genome. I know these may not occur frequently but it certainly occurs in repetitive genome locations like rDNA. I've managed to get r to give me the locations and sequences of all the locations in chromosome 1, but that's where I'm stuck.
1) how can I force the output "table" to be written in a tab-delimited format that could theoretically go into excel it it wasn't so huge? I've toyed around with the data.frame and writeTable commands, but haven't had much success. These are a it confusing for a beginner
2) Can I take the output and force r to find those sequences that are duplicated (i.e. the far right column)? Can it bin them into groups depending on the number of times a particular pattern is repeated?
3) Since this should be a more manageable list, how do I send the output of these duplicated sequences to a tab-delimited file? In other words, can I essentially create a setup where I have a list of CRISPR guide sequences that are repeated 2 or more times on this particular chromosome?
4) Can I expand this to work on the whole genome (I tried to simplify to start).
The Script:
p1="nnnnnnnnnnnnnnnnnnnnngg" library(BSgenome.Hsapiens.UCSC.hg38) chr1<-Hsapiens[["chr1"]] masks(chr1)<-null allsites<-matchPattern(p1, chr1, fixed="subject") allsites
The output:
Views on a 248956422-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN views: start end width [1] 10451 10473 23 [AACCCTAACCCTAACCCTCGCGG] [2] 10464 10486 23 [ACCCTCGCGGTACCCTCAGCCGG] [3] 10477 10499 23 [CCTCAGCCGGCCCGCCCGCCCGG] [4] 10478 10500 23 [CTCAGCCGGCCCGCCCGCCCGGG] [5] 10490 10512 23 [GCCCGCCCGGGTCTGACCTGAGG] ... ... ... ... ... [12491308] 248946388 248946410 23 [AGGGTTAGGGTTAGGGTTAAGGG] [12491309] 248946393 248946415 23 [TAGGGTTAGGGTTAAGGGTTAGG] [12491310] 248946394 248946416 23 [AGGGTTAGGGTTAAGGGTTAGGG] [12491311] 248946399 248946421 23 [TAGGGTTAAGGGTTAGGGTTAGG] [12491312] 248946400 248946422 23 [AGGGTTAAGGGTTAGGGTTAGGG]
Hi Julie,
GUIDEseq successfully installed. However, the 1st step requires .bed & .bam files as input, while all we have are .fastq raw data files from HiSeq.
How should I proceed?
Thanks!
-- Mo