Hi,
I am using the GenomicAlignments package to analyze deep sequencing HIV data (bam input format). My overarching goal is to evaluate the frequency and relative position of a given nucleotide (cytosine) and to extract the 5 nucleotide on both sides of each C for all reads (along with the relative position).
Thanks to your precious help, I was able to scan the bam file ("scanBam") and generate the nucleotide freq (including C's) with "alphabetFrequencyFromBam".
I did the following so far for HIV sequences covering 300bp of the genome:
<which <- GRanges("HIV:1-300")
p1 <- ScanBamParam(which=which, what=scanBamWhat())
res1 <- scanBam(bamfile1, param=p1)
alphabetFrequencyFromBam(bamfile1, index=bamfile1, param=p1)>
The output looks like that:
A C G T M R W S Y K V H D B N - + .
[1,] 0 0 27 372 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 0 0 399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 1 0 0 399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 412 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 0 1 0 489 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(...)
Next, I want to extract the ±5nucleotides on both sides of each cytosine (rare) on all the sequence reads and keep track (if possible) of the relative position.
As suggested by one of your great contributor, I tried the following:
which <- GRanges(“HIV", resize(IRanges(genome == "C", width=1L), 5L,
fix="center"))
But it raises the following error:
Error in genome == "C" :
comparison (1) is possible only for atomic and list types
The problem has probably to do with that "genome" is not considered as a character vector of my HIV sequences input bam file. Does that make sense? would you have any suggestion?
Please let me know if you need more info.
Thanks in advance and looking forward to reading you and all great insights!
a
BTW, what would be the easiest ways (if any):
#1 - to exclude primer sequences (short strings of 25-30 nt) from my input fasta file within R
e.g. if I want to exclude all the match 'CAAACTCAAATCTAATCTAACCAAAAAAAC'
#2 - to filter out the short reads (< a 100 bp)?
#3 - to exclude reverse oriented sequences?
I am using outside R tools but it would be great to have all running in R...
thanks and have a great week!
a
This should probably be posted as a separate question. For #1, you could use
vmatchPattern()
and remove the reads that match. For #2, just remove those with a value ofwidth()
less than 100. For #3, you'll probably need to look intoreadGAlignments()
and read the BAM file directly instead of the fasta. It returns a GAlignments object that provides thestrand()
.