Question

GenomicAlignments - extracting sequence signatures from bam file

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

Hi,

I am using the GenomicAlignments package to analyze deep sequencing HIV data (bam input format). My overarching goal is to evaluate the frequency and relative position of a given nucleotide (cytosine) and to extract the 5 nucleotide on both sides of each C for all reads (along with the relative position).

Thanks to your precious help, I was able to scan the bam file ("scanBam") and generate the nucleotide freq (including C's) with "alphabetFrequencyFromBam".

I did the following so far for HIV sequences covering 300bp of the genome:

<which <- GRanges("HIV:1-300")
p1 <- ScanBamParam(which=which, what=scanBamWhat())
res1 <- scanBam(bamfile1, param=p1)

alphabetFrequencyFromBam(bamfile1, index=bamfile1, param=p1)>

The output looks like that:

A C G T M R W S Y K V H D B N - + .
[1,] 0 0 27 372 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 0 0 399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 1 0 0 399 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 412 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 0 1 0 489 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(...)

Next, I want to extract the ±5nucleotides on both sides of each cytosine (rare) on all the sequence reads and keep track (if possible) of the relative position.

As suggested by one of your great contributor, I tried the following:

which <- GRanges(“HIV", resize(IRanges(genome == "C", width=1L), 5L,
fix="center"))

But it raises the following error:

Error in genome == "C" :
comparison (1) is possible only for atomic and list types

The problem has probably to do with that "genome" is not considered as a character vector of my HIV sequences input bam file. Does that make sense? would you have any suggestion?

Please let me know if you need more info.

Thanks in advance and looking forward to reading you and all great insights!

a

bam file HIV cytosine • 4.8k views

ADD COMMENT • link 6.6 years ago achaillon • 0

0

Entering edit mode

BTW, what would be the easiest ways (if any):

#1 - to exclude primer sequences (short strings of 25-30 nt) from my input fasta file within R

e.g. if I want to exclude all the match 'CAAACTCAAATCTAATCTAACCAAAAAAAC'

#2 - to filter out the short reads (< a 100 bp)?

#3 - to exclude reverse oriented sequences?

I am using outside R tools but it would be great to have all running in R...

thanks and have a great week!

a

ADD REPLY • link 6.6 years ago achaillon • 0

0

Entering edit mode

This should probably be posted as a separate question. For #1, you could use vmatchPattern() and remove the reads that match. For #2, just remove those with a value of width() less than 100. For #3, you'll probably need to look into readGAlignments() and read the BAM file directly instead of the fasta. It returns a GAlignments object that provides the strand().

ADD REPLY • link 6.6 years ago Michael Lawrence ★ 11k

score 0 · Answer 1 · 2017-10-05

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

If you have a FASTA file of your HIV genome, read it with something like:

genome <- read.DNAStringSet("hiv.fasta")[[1L]]

This assumes the genome is a single sequence.

ADD COMMENT • link 6.6 years ago Michael Lawrence ★ 11k

score 0 · Answer 2 · 2017-10-06

Thanks Michael,

You assume that 'hiv.fasta' is my reference genome (to which I mapped the input bam file) or the input bam file converted into fasta?

If I consider hiv.fa as my reference genome, I got the following :
> genome <- readDNAStringSet("HIV.fa")[[1L]]
> summary(genome)
   Length     Class      Mode 
      352 DNAString        S4 
> which <- GRanges("HIV", resize(IRanges(genome == "C", width=1L), 5L,
                                fix="center"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘pcompare’ for signature ‘"DNAString", "character"’

Does that help? Sorry for not being very helpful but I am sure we are pretty close...

a

score 0 · Answer 3 · 2017-10-06

Some more info if it helps about the hiv reference to which I mapped my input bam file:

genome <- readDNAStringSet("HIV.fasta")[[1L]]
summary(genome)
 genome
  303-letter "DNAString" instance
seq: TTGATTTAGATAAGGTAGAAGAGGTTAATGAAGGAGAGAATAATTGTTTGTTATATTTTATAAGTTAGTATGGG...GTGGTTTGGGTGGGATTGGGGAGTGGTGAGTTTTTAGATGTTGTATATAAGTAGTTGTTTTTTGTTTGTATTGG
>

I tried to read genome as character but no more success:

> which <- GRanges("TD3_converted_amplicon", resize(IRanges(as.character(genome) == "C", width=1L), 5L,
+                                fix="center"))
Error in IRanges(as.character(genome) == "C", width = 1L) : 
  'end' and 'width' must be NULLs when 'start' is a logical vector or logical Rle<font face="sans-serif, Arial, Verdana, Trebuchet MS">
</font>

I also tried using the fasta format like this but I got the same results...

genome <- readDNAStringSet("bamfileconvertedtofasta.fa")[[1L]]

score 0 · Answer 4 · 2017-10-06

I feel like we are so close...

> genome <- readDNAStringSet("/myfile.fasta")[[1L]]
> IRanges(which(genome == DNAString("C")), width=1L)
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>
> which <- GRanges("HIV", resize(IRanges(which(genome == DNAString("C")), width=1L), 5L,
+                                fix="center"))
Error in validObject(.Object) : 
  invalid class “GRanges” object: 'mcols(x)' is not parallel to 'x'

I am sure I am missing something obvious - I also tried the following but I got the same results:

> Cpattern= matchPattern("C", genome)
> Cpattern
  Views on a 303-letter DNAString subject
subject: NGGATTTAGATAAGGTAGAAGAGGTTAATGAAGGAGAGAATAATTGTTTGTTATATTTTATAAGTTAGTATG...GGTTTGGGTGGGATTGGGGAGTGTTGAGTTTTTAGATGTTGTAT----------------------------
views:
    start end width
[1]   180 180     1 [C]
> which <- GRanges("HIV", resize(IRanges(which(genome == Cpattern), width=1L), 5L,
+                                              fix="center"))
Error in validObject(.Object) : 
  invalid class “GRanges” object: 'mcols(x)' is not parallel to 'x'

score 0 · Answer 5 · 2017-10-06

I think it works! I just modified the width to 11

The final step would be to output the index position along with all strings of 11 bases... The stackStringsFromBam did output all strings but not their position...

thanks a million+++

>which <- GRanges("HIV", resize(ranges(Cpattern), 11L, fix="center"))

>p <- ScanBamParam(which=which, what=scanBamWhat())

>test = stackStringsFromBam("myfile.bam", 
                                 index="myfile.bam",  
                                 param=p, use.names=FALSE, D.letter="-", N.letter=".", Lpadding.letter="+", Rpadding.letter="+")

> test
  A DNAStringSet instance of length 1235885
          width seq
      [1]    11 GTTGACATTGA
      [2]    11 GTTGATATTGA
      [3]    11 GTTGATATTGA
      [4]    11 GTTGATATTGA
      [5]    11 GTTGATATTGA
      ...   ... ...
[1235881]    11 GTTGATATTGA
[1235882]    11 GTTGATATTGA
[1235883]    11 GTTGATATTGA
[1235884]    11 NTTGATATTGA
[1235885]    11 GTTGATATTGA

score 0 · Answer 6 · 2017-10-06

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

Oups... It looks like I was too optimistic. It only extracts the strings of 11 nucleotides that correspond to the position of the C from the first/reference sequences ... What I would need is to look for "C" in all reads/sequences of my input files and extract those elements of 11 nucleotides

I tried with vmatchPatterns and the following but I am not sure if this is the good strategy...

>genome2 <- readDNAStringSet("/myfile.fasta")


>Cpattern2= vmatchPattern("C", genome2)

> which <- GRanges("TD3_converted_amplicon", resize(ranges(Cpattern2), 11L, fix="center"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘ranges’ for signature ‘"ByPos_MIndex"’

ADD COMMENT • link 6.6 years ago achaillon • 0

0

Entering edit mode

Hi,

If your FASTA file contains a single sequence (which should be the case if it's HIV reference genome) then everything is going to be much easier if you load it as Michael suggested i.e. with genome <- read.DNAStringSet()[[1]]. Then genome will be a DNAString object instead of a DNAStringSet object, and you'll be able to use matchPattern() on it instead of vmatchPattern(). This will return an XStringViews object instead of an MIndex object. And finally ranges() will work (it works on XStringViews objects but not on MIndex objects). Generally speaking it's always a good idea to try to use the simplest objects/functions.

Also you said you wanted to look for "C" in all reads/sequences of your input files (I guess you mean BAM files?). Note that what you are trying to do with your FASTA file is going to give you a GRanges object that contains 11-nucleotide regions centered on the "C" positions in the reference genome. That's not the same.

H.

ADD REPLY • link 6.6 years ago Hervé Pagès 16k

score 0 · Answer 7 · 2017-10-06

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

My goal is to look for 'strings of 11 nucleotides centered by a C' in all reads of my bam file (or converted into fasta) and not only in the reference seq... Is there a way to do that (that's why I tried vmatchPattern above but it raised an error...)

thanks in advance for all advices and suggestions!

a

ADD COMMENT • link 6.6 years ago achaillon • 0

score 0 · Answer 8 · 2017-10-07

Is there a looping option that can work with matchPattern()?

I tried a more trivial approach where 'myfile' is a large alignment of thousands of mapped reads and I converted it into a large matrix... It is not working but I wonder this is something one can consider (my guess is that it will be too computationally intensive for R when dealing with thousands of reads). Would you mind taking a look at this first draft of my loop below? What is wrong with it? can we loop through all reads with matchPattern and GRanges?

Is there a way to store the 'CytosinePattern' efficiently in the loop below?

thanks (again)!

a

myfasta<- readAAStringSet("myfile.fasta")

myfasta2 = as.matrix(myfasta)

seqlength=length(myfasta2[1,])

seqnumber=length(myfasta2[,1])

for(i in 1:seqnumber){
        for(j in 6:(seqlength-6)){
                if (myfasta2[i,j]=="C"){
                        CytosinePattern=myfasta2[i,(j-5):(j+5)]
                        stringname= paste("CytosinePattern",j, sep="-")
                }

stored = cbind(CytosinePattern, Cstring)
}

Martin Morgan · Answer 9 · 2017-10-07

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

Here's what I've finally done... I am not proud of it and if anyone has a suggestion to make it more efficient, I am interested ;-)

MyOutPut  <- NULL
for(i in 1:seqnumber){
        for(j in 6:(seqlength-6)){
                if (myfasta2[i,j]=="C"){
                        CytosinePattern=myfasta2[i,(j-5):(j+5)]
                        stringname= paste("CytosinePattern",j, sep="-")
                        Position <- stringname
                        Pattern <-paste(CytosinePattern,collapse="")
                        MyOutPut <- cbind(MyOutPut, Position,Pattern)
                }
}
}
MyOutPut

ADD COMMENT • link 6.6 years ago achaillon • 0

0

Entering edit mode

This is a revised answer that switches from the complicated object returned from vmatchPattern() to a (conceptually) 'tidy' DataFrame before additional, straight-forward, computation.

Here's some DNA sequence that we can all work with

set.seed(123)
library(Biostrings)
library(hgu95av2probe)
x2 <- sample(hgu95av2probe$sequence, 100, replace=TRUE)
dna <- DNAStringSet(x2)

Find all the C nucleotides and represent this as a 'tidy' DataFrame (use names(dna) instead of seq_along(dna) if the names on dna are unique)

matches <- vmatchPattern("C", dna)
df <- DataFrame(
    Seqnames = rep(seq_along(dna), lengths(matches)),
    Position = unlist(start(matches))
)

Update the data frame to contain the start, end and DNA sequence surrounding the C nucleotide

df$Start <- df$Position - 6L
df$End <- df$Position + 6L
df$Pattern <- DNAStringSet( substr(dna[df$Seqnames], df$Start, df$End) )

Optionally, remove matches that were too close to the start or end of the sequence

df <- subset(df, width(Pattern) == 13L)

or convert to GRanges

GenomicRanges::GRanges(df)

ADD REPLY • link 6.6 years ago Martin Morgan 25k

1

Entering edit mode

Because I am not good at the offset math, I would have done something like this:

si <- Seqinfo(names(dna), width(dna))
gr <- GRanges(vmatchPattern("C", dna), seqinfo=si)
window.size <- 11L
windows <- subset(trim(resize(gr, window.size, fix="center")),
                  width == window.size)
gr$pattern <- dna[windows]

I think it expresses the intent: find the C's, form windows, but exclude those that would extend off the end of the sequence. This was harder than I thought. There should be an easier way to exclude out of bounds windows. Perhaps also MIndex objects should know their Seqinfo.

Another nice thing is that it returns the data as a GRanges, which is a useful object.

ADD REPLY • link updated 6.6 years ago by Martin Morgan 25k • written 6.6 years ago by Michael Lawrence ★ 11k

score 0 · Answer 10 · 2017-10-07

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

That is just PERFECT and so much efficient than my loop attempt...

thanks to all of you for these precious advices!

a

ADD COMMENT • link 6.6 years ago achaillon • 0

score 0 · Answer 11 · 2017-10-09

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

Hello

I jus wanted to thank you all for your very precious help on this topic.

Solution suggested by Martin works well for me - I also like the alternative one proposed by Michael but it was more computationally intensive given the size of my input files (up to several 100,000 of reads)

Best,

a

ADD COMMENT • link 6.6 years ago achaillon • 0

0

Entering edit mode

Interesting. Which part is slow, the last extraction?

ADD REPLY • link 6.6 years ago Michael Lawrence ★ 11k

score 0 · Answer 12 · 2017-10-09

0

Entering edit mode

achaillon • 0

@achaillon-14117

Last seen 3.3 years ago

actually, it occurs at the subsetting step (run for hour(s))

> windows <- subset(trim(resize(gr, window.size, fix="center"),
+                        width < window.size))

ADD COMMENT • link 6.6 years ago achaillon • 0

0

Entering edit mode

Maybe you could break that down into separate calls? So we can tell whether it is resize, trim or subset? Also, in the future, please just add comments, don't add a new answer.

ADD REPLY • link 6.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

The brackets are not correct; see the edited version -- width < window.size is an argument to subset, not to trim.

ADD REPLY • link 6.6 years ago Martin Morgan 25k

0

Entering edit mode

... and it's width == window.size

ADD REPLY • link 6.6 years ago Martin Morgan 25k

0

Entering edit mode

thanks! - I will give it a try and keep you informed (I posted another question for the other steps)

ADD REPLY • link 6.6 years ago achaillon • 0