Search
Question: GenomicAlignments - extracting sequence signatures from bam file
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Hi,

I am using the GenomicAlignments package to analyze deep sequencing HIV data (bam input format). My overarching goal is to evaluate the frequency and relative position of a given nucleotide (cytosine) and to extract the 5 nucleotide on both sides of each C for all reads (along with the relative position). 

Thanks to your precious help, I was able to scan the bam file ("scanBam") and generate the nucleotide freq (including C's) with "alphabetFrequencyFromBam".

I did the following so far for HIV sequences covering 300bp of the genome:

<which <- GRanges("HIV:1-300")
p1 <- ScanBamParam(which=which, what=scanBamWhat())
res1 <- scanBam(bamfile1, param=p1)

alphabetFrequencyFromBam(bamfile1, index=bamfile1, param=p1)>

 

The output looks like that:

          A   C    G    T M R W S Y K V H D B N - + .
  [1,]    0   0   27  372 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [2,]    1   0    0  399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [3,]    1   0    0  399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [4,]    0   0    0  412 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [5,]    0   1    0  489 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(...)

Next, I want to extract the ±5nucleotides on both sides of each cytosine (rare) on all the sequence reads and keep track (if possible) of the relative position.

As suggested by one of your great contributor, I tried the following:

which <- GRanges(“HIV", resize(IRanges(genome == "C", width=1L), 5L,
fix="center"))

But it raises the following error:

Error in genome == "C" :
comparison (1) is possible only for atomic and list types

The problem has probably to do with that "genome" is not considered as a character vector of my HIV sequences input bam file. Does that make sense? would you have any suggestion? 

Please let me know if you need more info. 

Thanks in advance and looking forward to reading you and all great insights!

a

 

 

 

 

 

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by achaillon0

BTW, what would be the easiest ways (if any): 

#1 - to exclude primer sequences (short strings of 25-30 nt) from my input fasta file within R

e.g. if I want to exclude all the match 'CAAACTCAAATCTAATCTAACCAAAAAAAC'

#2 - to filter out the short reads (< a 100 bp)?

#3 - to exclude reverse oriented sequences?

I am using outside R tools but it would be great to have all running in R...

thanks and have a great week!

a

 

 

ADD REPLYlink written 6 weeks ago by achaillon0

This should probably be posted as a separate question. For #1, you could use vmatchPattern() and remove the reads that match. For #2, just remove those with a value of width() less than 100. For #3, you'll probably need to look into readGAlignments() and read the BAM file directly instead of the fasta. It returns a GAlignments object that provides the strand().

ADD REPLYlink written 6 weeks ago by Michael Lawrence9.8k
0
gravatar for Michael Lawrence
6 weeks ago by
United States
Michael Lawrence9.8k wrote:

If you have a FASTA file of your HIV genome, read it with something like:

genome <- read.DNAStringSet("hiv.fasta")[[1L]]

This assumes the genome is a single sequence.

ADD COMMENTlink written 6 weeks ago by Michael Lawrence9.8k
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Thanks Michael,

You assume that 'hiv.fasta' is my reference genome (to which I mapped the input bam file) or the input bam file converted into fasta?

If I consider hiv.fa as my reference genome, I got the following :
> genome <- readDNAStringSet("HIV.fa")[[1L]]
> summary(genome)
   Length     Class      Mode 
      352 DNAString        S4 
> which <- GRanges("HIV", resize(IRanges(genome == "C", width=1L), 5L,
                                fix="center"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘pcompare’ for signature ‘"DNAString", "character"’

Does that help? Sorry for not being very helpful but I am sure we are pretty close...

a

ADD COMMENTlink written 6 weeks ago by achaillon0
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Some more info if it helps about the hiv reference to which I mapped my input bam file:

genome <- readDNAStringSet("HIV.fasta")[[1L]]
summary(genome)
 genome
  303-letter "DNAString" instance
seq: TTGATTTAGATAAGGTAGAAGAGGTTAATGAAGGAGAGAATAATTGTTTGTTATATTTTATAAGTTAGTATGGG...GTGGTTTGGGTGGGATTGGGGAGTGGTGAGTTTTTAGATGTTGTATATAAGTAGTTGTTTTTTGTTTGTATTGG
>

I tried to read genome as character but no more success:

> which <- GRanges("TD3_converted_amplicon", resize(IRanges(as.character(genome) == "C", width=1L), 5L,
+                                fix="center"))
Error in IRanges(as.character(genome) == "C", width = 1L) : 
  'end' and 'width' must be NULLs when 'start' is a logical vector or logical Rle<font face="sans-serif, Arial, Verdana, Trebuchet MS">
</font>

 

I also tried using the fasta format like this but I got the same results...

genome <- readDNAStringSet("bamfileconvertedtofasta.fa")[[1L]]

 

 

ADD COMMENTlink written 6 weeks ago by achaillon0

The call to IRanges() should be:

IRanges(which(genome == DNAString("C")), width=1L)
ADD REPLYlink written 6 weeks ago by Michael Lawrence9.8k
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

I feel like we are so close...

> genome <- readDNAStringSet("/myfile.fasta")[[1L]]
> IRanges(which(genome == DNAString("C")), width=1L)
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>
> which <- GRanges("HIV", resize(IRanges(which(genome == DNAString("C")), width=1L), 5L,
+                                fix="center"))
Error in validObject(.Object) : 
  invalid class “GRanges” object: 'mcols(x)' is not parallel to 'x'

 

I am sure I am missing something obvious - I also tried the following but I got the same results:

> Cpattern= matchPattern("C", genome)
> Cpattern
  Views on a 303-letter DNAString subject
subject: NGGATTTAGATAAGGTAGAAGAGGTTAATGAAGGAGAGAATAATTGTTTGTTATATTTTATAAGTTAGTATG...GGTTTGGGTGGGATTGGGGAGTGTTGAGTTTTTAGATGTTGTAT----------------------------
views:
    start end width
[1]   180 180     1 [C]
> which <- GRanges("HIV", resize(IRanges(which(genome == Cpattern), width=1L), 5L,
+                                              fix="center"))
Error in validObject(.Object) : 
  invalid class “GRanges” object: 'mcols(x)' is not parallel to 'x'

 

 

ADD COMMENTlink written 6 weeks ago by achaillon0

Try:

which <- GRanges("HIV", resize(ranges(Cpattern), 5L, fix="center"))

 

ADD REPLYlink written 6 weeks ago by Michael Lawrence9.8k
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

 

I think it works! I just modified the width to 11

The final step would be to output the index position  along with all strings of 11 bases... The stackStringsFromBam did output all strings but not their position...

thanks a million+++

>which <- GRanges("HIV", resize(ranges(Cpattern), 11L, fix="center"))

>p <- ScanBamParam(which=which, what=scanBamWhat())

>test = stackStringsFromBam("myfile.bam", 
                                 index="myfile.bam",  
                                 param=p, use.names=FALSE, D.letter="-", N.letter=".", Lpadding.letter="+", Rpadding.letter="+")

> test
  A DNAStringSet instance of length 1235885
          width seq
      [1]    11 GTTGACATTGA
      [2]    11 GTTGATATTGA
      [3]    11 GTTGATATTGA
      [4]    11 GTTGATATTGA
      [5]    11 GTTGATATTGA
      ...   ... ...
[1235881]    11 GTTGATATTGA
[1235882]    11 GTTGATATTGA
[1235883]    11 GTTGATATTGA
[1235884]    11 NTTGATATTGA
[1235885]    11 GTTGATATTGA

 

 

ADD COMMENTlink written 6 weeks ago by achaillon0
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Oups... It looks like I was too optimistic. It only extracts the strings of 11 nucleotides that correspond to the position of the C from the first/reference sequences ... What I would need is to look for "C" in all reads/sequences of my input files and extract those elements of 11 nucleotides

 

I tried with vmatchPatterns and the following but I am not sure if this is the good strategy...

>genome2 <- readDNAStringSet("/myfile.fasta")


>Cpattern2= vmatchPattern("C", genome2)

> which <- GRanges("TD3_converted_amplicon", resize(ranges(Cpattern2), 11L, fix="center"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘ranges’ for signature ‘"ByPos_MIndex"’
ADD COMMENTlink written 6 weeks ago by achaillon0

Hi,

If your FASTA file contains a single sequence (which should be the case if it's HIV reference genome) then everything is going to be much easier if you load it as Michael suggested i.e. with genome <- read.DNAStringSet()[[1]]. Then genome will be a DNAString object instead of a DNAStringSet object, and you'll be able to use matchPattern() on it instead of vmatchPattern(). This will return an XStringViews object instead of an MIndex object. And finally ranges() will work (it works on XStringViews objects but not on MIndex objects). Generally speaking it's always a good idea to try to use the simplest objects/functions.

Also you said you wanted to look for "C" in all reads/sequences of your input files (I guess you mean BAM files?). Note that what you are trying to do with your FASTA file is going to give you a GRanges object that contains 11-nucleotide regions centered on the "C" positions in the reference genome. That's not the same.

H.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Hervé Pagès ♦♦ 13k
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

My goal is to look for 'strings of 11 nucleotides centered by a C'  in all reads of my bam file (or converted into fasta) and not only in the reference seq... Is there a way to do that (that's why I tried  vmatchPattern above but it raised an error...)

thanks in advance for all advices and suggestions!

ADD COMMENTlink written 6 weeks ago by achaillon0
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Is there a looping option that can work with matchPattern()?

I tried a more trivial approach where 'myfile' is a large alignment of thousands of mapped reads and I converted it into a large matrix... It is not working but I wonder this is something one can consider (my guess is that it will be too computationally intensive for R when dealing with thousands of reads). Would you mind taking a look at this first draft of my loop below? What is wrong with it? can we loop through all reads with matchPattern and GRanges?

Is there a way to store the 'CytosinePattern' efficiently in the loop below?

thanks (again)!

a

 

myfasta<- readAAStringSet("myfile.fasta")

myfasta2 = as.matrix(myfasta)

seqlength=length(myfasta2[1,])

seqnumber=length(myfasta2[,1])

for(i in 1:seqnumber){
        for(j in 6:(seqlength-6)){
                if (myfasta2[i,j]=="C"){
                        CytosinePattern=myfasta2[i,(j-5):(j+5)]
                        stringname= paste("CytosinePattern",j, sep="-")
                }

stored = cbind(CytosinePattern, Cstring)
}
ADD COMMENTlink written 6 weeks ago by achaillon0
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Here's what I've finally done... I am not proud of it and if anyone has a suggestion to make it more efficient, I am interested ;-)

 

MyOutPut  <- NULL
for(i in 1:seqnumber){
        for(j in 6:(seqlength-6)){
                if (myfasta2[i,j]=="C"){
                        CytosinePattern=myfasta2[i,(j-5):(j+5)]
                        stringname= paste("CytosinePattern",j, sep="-")
                        Position <- stringname
                        Pattern <-paste(CytosinePattern,collapse="")
                        MyOutPut <- cbind(MyOutPut, Position,Pattern)
                }
}
}
MyOutPut
ADD COMMENTlink written 6 weeks ago by achaillon0

This is a revised answer that switches from the complicated object returned from vmatchPattern() to a (conceptually) 'tidy' DataFrame before additional, straight-forward, computation.

Here's some DNA sequence that we can all work with

set.seed(123)
library(Biostrings)
library(hgu95av2probe)
x2 <- sample(hgu95av2probe$sequence, 100, replace=TRUE)
dna <- DNAStringSet(x2)

Find all the C nucleotides and represent this as a 'tidy' DataFrame (use names(dna) instead of seq_along(dna) if the names on dna are unique)

matches <- vmatchPattern("C", dna)
df <- DataFrame(
    Seqnames = rep(seq_along(dna), lengths(matches)),
    Position = unlist(start(matches))
)

Update the data frame to contain the start, end and DNA sequence surrounding the C nucleotide

df$Start <- df$Position - 6L
df$End <- df$Position + 6L
df$Pattern <- DNAStringSet( substr(dna[df$Seqnames], df$Start, df$End) )

Optionally, remove matches that were too close to the start or end of the sequence

df <- subset(df, width(Pattern) == 13L)

or convert to GRanges

​GenomicRanges::GRanges(df)
ADD REPLYlink modified 5 weeks ago • written 6 weeks ago by Martin Morgan ♦♦ 20k
1

Because I am not good at the offset math, I would have done something like this:

si <- Seqinfo(names(dna), width(dna))
gr <- GRanges(vmatchPattern("C", dna), seqinfo=si)
window.size <- 11L
windows <- subset(trim(resize(gr, window.size, fix="center")),
                  width == window.size)
gr$pattern <- dna[windows]

I think it expresses the intent: find the C's, form windows, but exclude those that would extend off the end of the sequence. This was harder than I thought. There should be an easier way to exclude out of bounds windows. Perhaps also MIndex objects should know their Seqinfo.

Another nice thing is that it returns the data as a GRanges, which is a useful object.

ADD REPLYlink modified 6 weeks ago by Martin Morgan ♦♦ 20k • written 6 weeks ago by Michael Lawrence9.8k
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

That is just PERFECT and so much efficient than my loop attempt...

thanks to all of you for these precious advices!

a

 

ADD COMMENTlink written 6 weeks ago by achaillon0
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

Hello

I jus wanted to thank you all for your very precious help on this topic.

Solution suggested by Martin works well for me - I also like the alternative one proposed by Michael but it was more computationally intensive given the size of my input files (up to several 100,000 of reads)

Best,

a

ADD COMMENTlink written 6 weeks ago by achaillon0

Interesting. Which part is slow, the last extraction?

ADD REPLYlink written 6 weeks ago by Michael Lawrence9.8k
0
gravatar for achaillon
6 weeks ago by
achaillon0
achaillon0 wrote:

actually, it occurs at the subsetting step (run for hour(s))

> windows <- subset(trim(resize(gr, window.size, fix="center"),
+                        width < window.size))
ADD COMMENTlink written 6 weeks ago by achaillon0

Maybe you could break that down into separate calls? So we can tell whether it is resize, trim or subset? Also, in the future, please just add comments, don't add a new answer.

ADD REPLYlink written 6 weeks ago by Michael Lawrence9.8k

The brackets are not correct; see the edited version -- width < window.size is an argument to subset, not to trim.

ADD REPLYlink written 6 weeks ago by Martin Morgan ♦♦ 20k

... and it's width == window.size

ADD REPLYlink written 6 weeks ago by Martin Morgan ♦♦ 20k

thanks! - I will give it a try and keep you informed (I posted another question for the other steps)

 

ADD REPLYlink written 6 weeks ago by achaillon0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 237 users visited in the last hour