Question: stackStringsFromBam equivalent for GAlignments with seq
0
11 weeks ago by
Janet Young730
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Janet Young730 wrote:

hi there,

It's been a while since I posted - hi all! I miss you bioc people here at the Hutch.

I am wondering whether a function exists that works like stackStringsFromBam, but operates on a GAlignments object (that contains a "seq" column) rather than a bam file. I'm asking because I want to do this series of operations (a) read in a bam file as GAlignments (has thousands of reads), e.g. aln <- readGAlignments(bamfile, use.names=TRUE, param=ScanBamParam(what="seq")) (b) do some filtering on those alignments (now have fewer thousands of reads) aln_filt <- aln[myFilters] (c) get an alignment of reads overlapping a region of interest (~10 reads in each region)

If I want to skip the filtering step, I can get the alignment like this:
myAln <- stackStringsFromBam(bamfile, param=myRegionAsGRanges, use.names=TRUE) but that operates on a bam rather than GAlignments object.

It'd be really handy for me there was an analagous function like stackStringsFromGAlignments - would that be easy to implement? (or does it already exist?)

I think I can see a path towards getting what I need using the sequenceLayer function but that path seems more complicated than it should be. I can see other potential paths too that will extract sequence chunks given a reference sequence, but they lose the read names, which I want to use in later processing steps.

Am I missing something? It seems like there's probably a function out there that'll do what I want but I can't figure out what it is.

thanks very much!

Janet

ADD COMMENTlink
modified 11 weeks ago • written 11 weeks ago by Janet Young730
Answer: stackStringsFromBam equivalent for GAlignments with seq
1
11 weeks ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Janet,

Long time no see. Welcome back!

I added stackStringsFromGAlignments() to GenomicAlignments 1.21.4 (devel). See commit here.

GenomicAlignments 1.21.4 should propagate to BioC 3.10 (devel) in the next 24-48h.

Best,

H.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Hervé Pagès ♦♦ 14k
Answer: stackStringsFromBam equivalent for GAlignments with seq
0
11 weeks ago by
Janet Young730
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Janet Young730 wrote:

Wow - so quick! thanks, Herve - really appreciate it.

It's been a while since I've used devel (things are so stable these days), so I'll have to remind myself how to run that, but I'll try this out ASAP.

Janet

ADD COMMENTlink written 11 weeks ago by Janet Young730

(deleted comment - meant to submit it as a reply)

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Janet Young730
Answer: stackStringsFromBam equivalent for GAlignments with seq
0
11 weeks ago by
Janet Young730
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Janet Young730 wrote:

Tried stackStringsFromGAlignments - looks great! Thanks again, Herve.

On a quick test it works really well. I did a sanity check with test alignment/region, and I noticed something that I'm sure I can ignore, but wondered if you might care about it for consistency.

I noticed on a test region that the objects returned by stackStringsFromBam and stackStringsFromGAlignments are not fully identical. The parts I care about (seq and names) are identical, which is great, but there's an mcols element that's not. I don't really understand what that mcols object is, but I don't think I need to - seems like an internal thing for DNAStringSets that won't affect me (?).

Anyway, a code example will be the easiest to explain. First, I get an alignment either from bam file or from GAlignments object:

bamfile <- BamFile("myfile.sorted.bam")
aln <- readGAlignments(bamfile, use.names=TRUE, param=ScanBamParam(what="seq"))
myAln <- stackStringsFromBam(bamfile, param=temp[35], use.names=TRUE)
myAln2 <- stackStringsFromGAlignments(aln, region=temp[35])


I'm happy that sequence and names are identical:

   identical(seq(myAln),seq(myAln2)) # TRUE
identical(names(myAln),names(myAln2)) # TRUE


but the overall DNAStringSet object is not

identical(myAln,myAln2)   # FALSE


and that's because of the mcols:

mcols(myAln)
DataFrame with 12 rows and 0 columns

mcols(myAln2)
DataFrame with 12 rows and 1 column
seq
<DNAStringSet>
10/8454466//0_2661  AAAGCTTTTC...TCGATGAGCG


etc

other than mcols, the two alignments are identical:

mcols(myAln) <- mcols(myAln2)
identical(myAln,myAln2) # TRUE


thanks again!

Janet

ADD COMMENTlink written 11 weeks ago by Janet Young730

Right. This is expected. For myAln you din't explicitly request the seq metadata column. If you explicitly request it, e.g. with:

param <- ScanBamParam(which=temp[35], what="seq")
myAln <- stackStringsFromBam(bamfile, param=param, use.names=TRUE)


then myAln and myAln2 should have the same metadata columns.

An additional note: please keep in mind that identical() is not reliable on objects that use external pointers internally, like XStringSet objects. It can produce some embarrassing false positives:

x <- DNAStringSet(c("AA", "CCC"))
y <- DNAStringSet(c("GG", "TTT"))
identical(x, y)
# [1] TRUE


A more reliable way to check that 2 XStringSet objects contain the same sequences is to do identical(as.character(x), as.character(y)). However note that this does not compare the class or the metadata columns of the 2 objects.

H.

ADD REPLYlink written 11 weeks ago by Hervé Pagès ♦♦ 14k

Also I'm not so sure that your test identical(seq(myAln),seq(myAln2)) is doing what you have in mind. seq() is a base function that only works by chance on XStringSet objects but doesn't do anything interesting on them. It seems to be doing something like seq_along() on them.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Hervé Pagès ♦♦ 14k

And I realize now that the show method for XStringSet objects is probably misleading here, by suggesting the existence of a seq() getter for these objects. Hmm... not good. I wonder how many users got misled by this. I need to do something about this.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Hervé Pagès ♦♦ 14k

thanks again. that makes sense (the mcols thing).

I'd better steer clear of identical! that's unpredictable and a little scary.

and yes, I'd made a wrong assumption about 'seq' - not sure where that came into my head from that it might be a function to get the seqs. It could have been the 'show' output, or perhaps some other object type where there's a seq getter (?? I can't think what though). Would 'seqs' be a better column header in the show method, to reduce the chance of misleading users?

ADD REPLYlink written 11 weeks ago by Janet Young730

Not sure what's the best way to deal with this. I just opened an issue on GitHub about this so we keep track of the problem.

H.

ADD REPLYlink written 11 weeks ago by Hervé Pagès ♦♦ 14k
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 190 users visited in the last hour