Question

GAlignments and the query sequence (SEQ) field of BAM files

0

Entering edit mode

Robert K. Bradley ▴ 40

@robert-k-bradley-5997

Last seen 22 months ago

United States

Hello,

I'm trying to figure out how to use Bioconductor for the following use case. I want to parse a subset of aligned reads from a BAM file, optionally trim the alignments, and then access the (possibly trimmed) query sequences. The first two operations--reading from the BAM file and trimming the alignments--can be accomplished easily with GAlignments. However, my understanding is that query sequences are discarded when the GAlignments are populated from the BAM file.

My first question is whether you have contemplated adding an optional slot to hold the query sequences within a GAlignments object. I understand that storing these sequences is expensive, but being able to access them directly via GAlignments would be extremely helpful for many studies, including those of genetic variation and allele-specific expression.

If that's not on the horizon, then would you be willing to please suggest a possible workflow to accomplish my goal? The best that I can come up with is to load GAligments including the query names, identify the set of reads of interest, load the query sequences via Rsamtools, and then match the GAlignments to the corresponding query sequence by using the query names. This is cumbersome and probably slow, but seems like it might work.

Thank you,

Rob

[galignments] bam • 3.1k views

ADD COMMENT • link 10.3 years ago Robert K. Bradley ▴ 40

0

Entering edit mode

Hi Hervé,

Thank you for the quick and very helpful response. I apologize for overlooking this feature of readGAlignments; it's exactly what I was looking for.

I noticed that functions like qnarrow() don't affect the sequence when loaded as metadata (for understandable reasons). Is there a recommended programmatic way to modify metadata that corresponds to strings of the same length as the reads (e.g., query sequence, quality scores, etc.), or is it reasonable to just do it by hand by taking substrings and then setting the metadata accordingly?

Thank you,

Rob

ADD REPLY • link 10.3 years ago Robert K. Bradley ▴ 40

0

Entering edit mode

(To clarify, I am currently using narrow() on the DNAStringSet, which works fine; I'm just wondering if there is a more appropriate method that I should use to manipulate the metadata of a GAlignments object when the metadata is a string of length equal to the width of the read.)

ADD REPLY • link 10.3 years ago Robert K. Bradley ▴ 40

0

Entering edit mode

Yes qnarrow() propagates the metadata columns as-is. More generally speaking, vectorized transformations (i.e. transformations that return an object parallel to the input) either propagate the metadata columns as-is to the result, or drop them, or sometimes add new metadata columns. But they generally don't alter the metadata columns that are propagated to the result. So yes, you need to adjust the sequences in the metadata columns of your trimmed GAlignments object by hand. I confirm that qnarrow'ing the alignments translates into narrow'ing the query sequences (SEQ field) or the quality strings (QUAL field). So in your case:

param <- ScanBamParam(what=c("seq", "qual"))
gal <- readGAlignments(file, use.names=TRUE, param=param)
gal2 <- qnarrow(gal, ...)
mcols(gal2)$seq <- narrow(mcols(gal)$seq, ...)
mcols(gal2)$qual <- narrow(mcols(gal)$qual, ...)

I'll add an example like that to ?`qnarrow,GAlignments-method`.

Hope this helps,

H.

ADD REPLY • link 10.3 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you. This is perfect. On 10/21/15 11:39 AM, Hervé Pagès [bioc] wrote: > Activity on a post you are following on support.bioconductor.org > <https: support.bioconductor.org=""> > > User Hervé Pagès <https: support.bioconductor.org="" u="" 1542=""/> wrote > Comment: GAlignments and the query sequence (SEQ) field of BAM files > <https: support.bioconductor.org="" p="" 73618="" #73649="">: > > Yes |qnarrow()| propagates the metadata columns as-is. More generally > speaking, vectorized transformations (i.e. transformations that return > an object /parallel/ to the input) either propagate the metadata columns > as-is to the result, or drop them, or sometimes add new metadata > columns. But they generally don't alter the metadata columns that are > propagated to the result. So yes, you need to adjust the sequences in > the metadata columns of your trimmed GAlignments object by hand. I > confirm that |qnarrow|'ing the alignments translates into |narrow|'ing > the query sequences (SEQ field) or the quality strings (QUAL field). So > in your case: > > param <- ScanBamParam(what=c("seq", "qual")) > gal <- readGAlignments(file, use.names=TRUE, param=param) > gal2 <- qnarrow(gal, ...) > mcols(gal2)$seq <- narrow(mcols(gal)$seq, ...) > mcols(gal2)$qual <- narrow(mcols(gal)$qual, ...) > > I'll add an example like that to |?`qnarrow,GAlignments-method`|. > > Hope this helps, > > H. > > ------------------------------------------------------------------------ > > Post tags: [galignments], bam > > You may reply via email or visit > C: GAlignments and the query sequence (SEQ) field of BAM files >

ADD REPLY • link 10.3 years ago Robert K. Bradley ▴ 40

score 0 · Answer 1 · 2015-10-21

Hi Rob,

Like with scanBam(), you can load any BAM field and/or tag with readGAlignments(). They will end up as metadata columns on the returned GAlignments object. For example to load the query sequences:

library(GenomicAlignments)
file <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="seq")
gal <- readGAlignments(file, use.names=TRUE, param=param)
gal
# GAlignments object with 3271 alignments and 1 metadata column:
#                          seqnames strand       cigar    qwidth
#                             <Rle>  <Rle> <character> <integer>
#      B7_591:4:96:693:509     seq1      +         36M        36
#   EAS54_65:7:152:368:113     seq1      +         35M        35
#      EAS51_64:8:5:734:57     seq1      +         35M        35
#     B7_591:1:289:587:906     seq1      +         36M        36
#    EAS56_59:8:38:671:758     seq1      +         35M        35
#                      ...      ...    ...         ...       ...
#     EAS56_63:4:141:9:811     seq2      +         35M        35
#  EAS114_30:6:277:397:932     seq2      +         35M        35
# EAS139_11:7:50:1229:1313     seq2      -         35M        35
#    EAS54_65:3:320:20:250     seq2      -         35M        35
#    EAS114_26:7:37:79:581     seq2      -         35M        35
#                             start       end     width     njunc
#                         <integer> <integer> <integer> <integer>
#      B7_591:4:96:693:509        1        36        36         0
#   EAS54_65:7:152:368:113        3        37        35         0
#      EAS51_64:8:5:734:57        5        39        35         0
#     B7_591:1:289:587:906        6        41        36         0
#    EAS56_59:8:38:671:758        9        43        35         0
#                      ...      ...       ...       ...       ...
#     EAS56_63:4:141:9:811     1524      1558        35         0
#  EAS114_30:6:277:397:932     1524      1558        35         0
# EAS139_11:7:50:1229:1313     1528      1562        35         0
#    EAS54_65:3:320:20:250     1532      1566        35         0
#    EAS114_26:7:37:79:581     1533      1567        35         0
#                           |                                  seq
#                           |                       <DNAStringSet>
#      B7_591:4:96:693:509  | CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
#   EAS54_65:7:152:368:113  |  CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
#      EAS51_64:8:5:734:57  |  AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
#     B7_591:1:289:587:906  | GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
#    EAS56_59:8:38:671:758  |  GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
#                      ... ..                                  ...
#     EAS56_63:4:141:9:811  |  TTTCTTTTCTCCTTTTTTTTTTTTTTTTTCTACAT
#  EAS114_30:6:277:397:932  |  TTTCTTTTCACTTTTTTTTTTTTTTTTTTTTACTT
# EAS139_11:7:50:1229:1313  |  TTTTTTCTTTTTTTTTTTTTTTTTTTTGCATGCCA
#    EAS54_65:3:320:20:250  |  TTTTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAA
#    EAS114_26:7:37:79:581  |  TTTTTTTTTTTTTTTTTTTTTTTCATGCCAGAAAA
#  -------
#  seqinfo: 2 sequences from an unspecified genome

Then to access the query sequences:

mcols(gal)$seq
#  A DNAStringSet instance of length 3271
#        width seq
#    [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
#    [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
#    [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
#    [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
#    [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
#    ...   ... ...
# [3267]    35 TTTCTTTTCTCCTTTTTTTTTTTTTTTTTCTACAT
# [3268]    35 TTTCTTTTCACTTTTTTTTTTTTTTTTTTTTACTT
# [3269]    35 TTTTTTCTTTTTTTTTTTTTTTTTTTTGCATGCCA
# [3270]    35 TTTTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAA
# [3271]    35 TTTTTTTTTTTTTTTTTTTTTTTCATGCCAGAAAA

See ?readGAlignments for more information and examples.

H.