GenomicAlignments - extracting sequence signatures from bam file
12
0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Hi,

I am using the GenomicAlignments package to analyze deep sequencing HIV data (bam input format). My overarching goal is to evaluate the frequency and relative position of a given nucleotide (cytosine) and to extract the 5 nucleotide on both sides of each C for all reads (along with the relative position).

Thanks to your precious help, I was able to scan the bam file ("scanBam") and generate the nucleotide freq (including C's) with "alphabetFrequencyFromBam".

I did the following so far for HIV sequences covering 300bp of the genome:

<which <- GRanges("HIV:1-300")
p1 <- ScanBamParam(which=which, what=scanBamWhat())
res1 <- scanBam(bamfile1, param=p1)

alphabetFrequencyFromBam(bamfile1, index=bamfile1, param=p1)>

The output looks like that:

A   C    G    T M R W S Y K V H D B N - + .
[1,]    0   0   27  372 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,]    1   0    0  399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,]    1   0    0  399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,]    0   0    0  412 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,]    0   1    0  489 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(...)

Next, I want to extract the ±5nucleotides on both sides of each cytosine (rare) on all the sequence reads and keep track (if possible) of the relative position.

As suggested by one of your great contributor, I tried the following:

which <- GRanges(“HIV", resize(IRanges(genome == "C", width=1L), 5L,
fix="center"))

But it raises the following error:

Error in genome == "C" :
comparison (1) is possible only for atomic and list types

The problem has probably to do with that "genome" is not considered as a character vector of my HIV sequences input bam file. Does that make sense? would you have any suggestion?

Thanks in advance and looking forward to reading you and all great insights!

a

bam file HIV cytosine • 2.2k views
0
Entering edit mode

BTW, what would be the easiest ways (if any):

#1 - to exclude primer sequences (short strings of 25-30 nt) from my input fasta file within R

e.g. if I want to exclude all the match 'CAAACTCAAATCTAATCTAACCAAAAAAAC'

#2 - to filter out the short reads (< a 100 bp)?

#3 - to exclude reverse oriented sequences?

I am using outside R tools but it would be great to have all running in R...

thanks and have a great week!

a

0
Entering edit mode

This should probably be posted as a separate question. For #1, you could use vmatchPattern() and remove the reads that match. For #2, just remove those with a value of width() less than 100. For #3, you'll probably need to look into readGAlignments() and read the BAM file directly instead of the fasta. It returns a GAlignments object that provides the strand().

0
Entering edit mode
@michael-lawrence-3846
Last seen 10 weeks ago
United States

If you have a FASTA file of your HIV genome, read it with something like:

genome <- read.DNAStringSet("hiv.fasta")[[1L]]

This assumes the genome is a single sequence.

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Thanks Michael,

You assume that 'hiv.fasta' is my reference genome (to which I mapped the input bam file) or the input bam file converted into fasta?

If I consider hiv.fa as my reference genome, I got the following :
> genome <- readDNAStringSet("HIV.fa")[[1L]]
> summary(genome)
Length     Class      Mode
352 DNAString        S4
> which <- GRanges("HIV", resize(IRanges(genome == "C", width=1L), 5L,
fix="center"))
Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘pcompare’ for signature ‘"DNAString", "character"’

Does that help? Sorry for not being very helpful but I am sure we are pretty close...

a

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Some more info if it helps about the hiv reference to which I mapped my input bam file:

genome <- readDNAStringSet("HIV.fasta")[[1L]]
summary(genome)
genome
303-letter "DNAString" instance
seq: TTGATTTAGATAAGGTAGAAGAGGTTAATGAAGGAGAGAATAATTGTTTGTTATATTTTATAAGTTAGTATGGG...GTGGTTTGGGTGGGATTGGGGAGTGGTGAGTTTTTAGATGTTGTATATAAGTAGTTGTTTTTTGTTTGTATTGG
>

I tried to read genome as character but no more success:

> which <- GRanges("TD3_converted_amplicon", resize(IRanges(as.character(genome) == "C", width=1L), 5L,
+                                fix="center"))
Error in IRanges(as.character(genome) == "C", width = 1L) :
'end' and 'width' must be NULLs when 'start' is a logical vector or logical Rle<font face="sans-serif, Arial, Verdana, Trebuchet MS">
</font>

I also tried using the fasta format like this but I got the same results...

genome <- readDNAStringSet("bamfileconvertedtofasta.fa")[[1L]]

0
Entering edit mode

The call to IRanges() should be:

IRanges(which(genome == DNAString("C")), width=1L)
0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

I feel like we are so close...

> genome <- readDNAStringSet("/myfile.fasta")[[1L]]
> IRanges(which(genome == DNAString("C")), width=1L)
IRanges object with 0 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
> which <- GRanges("HIV", resize(IRanges(which(genome == DNAString("C")), width=1L), 5L,
+                                fix="center"))
Error in validObject(.Object) :
invalid class “GRanges” object: 'mcols(x)' is not parallel to 'x'

I am sure I am missing something obvious - I also tried the following but I got the same results:

> Cpattern= matchPattern("C", genome)
> Cpattern
Views on a 303-letter DNAString subject
subject: NGGATTTAGATAAGGTAGAAGAGGTTAATGAAGGAGAGAATAATTGTTTGTTATATTTTATAAGTTAGTATG...GGTTTGGGTGGGATTGGGGAGTGTTGAGTTTTTAGATGTTGTAT----------------------------
views:
start end width
[1]   180 180     1 [C]
> which <- GRanges("HIV", resize(IRanges(which(genome == Cpattern), width=1L), 5L,
+                                              fix="center"))
Error in validObject(.Object) :
invalid class “GRanges” object: 'mcols(x)' is not parallel to 'x'

0
Entering edit mode

Try:

which <- GRanges("HIV", resize(ranges(Cpattern), 5L, fix="center"))

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

I think it works! I just modified the width to 11

The final step would be to output the index position  along with all strings of 11 bases... The stackStringsFromBam did output all strings but not their position...

thanks a million+++

>which <- GRanges("HIV", resize(ranges(Cpattern), 11L, fix="center"))

>p <- ScanBamParam(which=which, what=scanBamWhat())

>test = stackStringsFromBam("myfile.bam",
index="myfile.bam",

> test
A DNAStringSet instance of length 1235885
width seq
[1]    11 GTTGACATTGA
[2]    11 GTTGATATTGA
[3]    11 GTTGATATTGA
[4]    11 GTTGATATTGA
[5]    11 GTTGATATTGA
...   ... ...
[1235881]    11 GTTGATATTGA
[1235882]    11 GTTGATATTGA
[1235883]    11 GTTGATATTGA
[1235884]    11 NTTGATATTGA
[1235885]    11 GTTGATATTGA

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Oups... It looks like I was too optimistic. It only extracts the strings of 11 nucleotides that correspond to the position of the C from the first/reference sequences ... What I would need is to look for "C" in all reads/sequences of my input files and extract those elements of 11 nucleotides

I tried with vmatchPatterns and the following but I am not sure if this is the good strategy...

>genome2 <- readDNAStringSet("/myfile.fasta")

>Cpattern2= vmatchPattern("C", genome2)

> which <- GRanges("TD3_converted_amplicon", resize(ranges(Cpattern2), 11L, fix="center"))
Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘ranges’ for signature ‘"ByPos_MIndex"’

0
Entering edit mode

Hi,

If your FASTA file contains a single sequence (which should be the case if it's HIV reference genome) then everything is going to be much easier if you load it as Michael suggested i.e. with genome <- read.DNAStringSet()[[1]]. Then genome will be a DNAString object instead of a DNAStringSet object, and you'll be able to use matchPattern() on it instead of vmatchPattern(). This will return an XStringViews object instead of an MIndex object. And finally ranges() will work (it works on XStringViews objects but not on MIndex objects). Generally speaking it's always a good idea to try to use the simplest objects/functions.

Also you said you wanted to look for "C" in all reads/sequences of your input files (I guess you mean BAM files?). Note that what you are trying to do with your FASTA file is going to give you a GRanges object that contains 11-nucleotide regions centered on the "C" positions in the reference genome. That's not the same.

H.

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

My goal is to look for 'strings of 11 nucleotides centered by a C'  in all reads of my bam file (or converted into fasta) and not only in the reference seq... Is there a way to do that (that's why I tried  vmatchPattern above but it raised an error...)

thanks in advance for all advices and suggestions!

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Is there a looping option that can work with matchPattern()?

I tried a more trivial approach where 'myfile' is a large alignment of thousands of mapped reads and I converted it into a large matrix... It is not working but I wonder this is something one can consider (my guess is that it will be too computationally intensive for R when dealing with thousands of reads). Would you mind taking a look at this first draft of my loop below? What is wrong with it? can we loop through all reads with matchPattern and GRanges?

Is there a way to store the 'CytosinePattern' efficiently in the loop below?

thanks (again)!

a

myfasta<- readAAStringSet("myfile.fasta")

myfasta2 = as.matrix(myfasta)

seqlength=length(myfasta2[1,])

seqnumber=length(myfasta2[,1])

for(i in 1:seqnumber){
for(j in 6:(seqlength-6)){
if (myfasta2[i,j]=="C"){
CytosinePattern=myfasta2[i,(j-5):(j+5)]
stringname= paste("CytosinePattern",j, sep="-")
}

stored = cbind(CytosinePattern, Cstring)
}
0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Here's what I've finally done... I am not proud of it and if anyone has a suggestion to make it more efficient, I am interested ;-)

MyOutPut  <- NULL
for(i in 1:seqnumber){
for(j in 6:(seqlength-6)){
if (myfasta2[i,j]=="C"){
CytosinePattern=myfasta2[i,(j-5):(j+5)]
stringname= paste("CytosinePattern",j, sep="-")
Position <- stringname
Pattern <-paste(CytosinePattern,collapse="")
MyOutPut <- cbind(MyOutPut, Position,Pattern)
}
}
}
MyOutPut
0
Entering edit mode

This is a revised answer that switches from the complicated object returned from vmatchPattern() to a (conceptually) 'tidy' DataFrame before additional, straight-forward, computation.

Here's some DNA sequence that we can all work with

set.seed(123)
library(Biostrings)
library(hgu95av2probe)
x2 <- sample(hgu95av2probe$sequence, 100, replace=TRUE) dna <- DNAStringSet(x2) Find all the C nucleotides and represent this as a 'tidy' DataFrame (use names(dna) instead of seq_along(dna) if the names on dna are unique) matches <- vmatchPattern("C", dna) df <- DataFrame( Seqnames = rep(seq_along(dna), lengths(matches)), Position = unlist(start(matches)) ) Update the data frame to contain the start, end and DNA sequence surrounding the C nucleotide df$Start <- df$Position - 6L df$End <- df$Position + 6L df$Pattern <- DNAStringSet( substr(dna[df$Seqnames], df$Start, df$End) ) Optionally, remove matches that were too close to the start or end of the sequence df <- subset(df, width(Pattern) == 13L) or convert to GRanges ​GenomicRanges::GRanges(df) ADD REPLY 1 Entering edit mode Because I am not good at the offset math, I would have done something like this: si <- Seqinfo(names(dna), width(dna)) gr <- GRanges(vmatchPattern("C", dna), seqinfo=si) window.size <- 11L windows <- subset(trim(resize(gr, window.size, fix="center")), width == window.size) gr$pattern <- dna[windows]

I think it expresses the intent: find the C's, form windows, but exclude those that would extend off the end of the sequence. This was harder than I thought. There should be an easier way to exclude out of bounds windows. Perhaps also MIndex objects should know their Seqinfo.

Another nice thing is that it returns the data as a GRanges, which is a useful object.

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

That is just PERFECT and so much efficient than my loop attempt...

thanks to all of you for these precious advices!

a

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

Hello

I jus wanted to thank you all for your very precious help on this topic.

Solution suggested by Martin works well for me - I also like the alternative one proposed by Michael but it was more computationally intensive given the size of my input files (up to several 100,000 of reads)

Best,

a

0
Entering edit mode

Interesting. Which part is slow, the last extraction?

0
Entering edit mode
achaillon • 0
@achaillon-14117
Last seen 3 months ago

actually, it occurs at the subsetting step (run for hour(s))

> windows <- subset(trim(resize(gr, window.size, fix="center"),
+                        width < window.size))
0
Entering edit mode

Maybe you could break that down into separate calls? So we can tell whether it is resize, trim or subset? Also, in the future, please just add comments, don't add a new answer.

0
Entering edit mode

The brackets are not correct; see the edited version -- width < window.size is an argument to subset, not to trim.

0
Entering edit mode

... and it's width == window.size

0
Entering edit mode

thanks! - I will give it a try and keep you informed (I posted another question for the other steps)