get trimmed sequences based on cigar string in genomic alignments
1
0
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 20 months ago
United States

Hi,

I have rna seq data that I have mapped with STAR which allows soft clipping of reads. I have imported the bam file with the following code:

param <- ScanBamParam(flag=scanBamFlag(isSecondaryAlignment = F), tag=c('NH'), what='seq')
reads <- readGAlignments('mapped.bam', use.names=T, param = param)

I know that the cigar string shows how many nucleotides have been soft clipped from either the 5' and 3' end of a read with the #S#M or #M#S. I'd like to extract just the nucleotides that have been clipped from the 5' and the 3' end for each read. I saw that there are some functions to operate on cigar strings and genomicAlignments, but didn't see any that looked like they would easily do this. Is there an easy way to extract this information? Thanks

 

genomicalignments • 2.7k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 23 hours ago
Seattle, WA, United States

Hi Jake,

Yep the cigar utilities in GenomicAlignments are pretty low-level and not particularly easy to use but they do the job:

library(GenomicAlignments)
cigar <- c("3S15M3I24S5H", "25M300N15M", "30M10S")
cigarRangesAlongQuerySpace(cigar, ops="S")
# IRangesList of length 3
# [[1]]
# IRanges of length 2
#     start end width
# [1]     1   3     3
# [2]    22  45    24
# 
# [[2]]
# IRanges of length 0
# 
# [[3]]
# IRanges of length 1
#     start end width
# [1]    31  40    10

Then use this IRangesList object to extract the soft-clipped sequences from the query sequences with extractAt() from the Biostrings package.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Great this is exactly what I was looking for. Is there anyway to separate out 5' soft clip vs 3' soft clip?

ADD REPLY
0
Entering edit mode

Yes. For alignments located on the + strand, 5' soft clip is on the left and 3' soft clip is on the right. For alignments on the - strand it's the other way around. Load the BAM file as a GAlignments object with gal <- readGAlignments(..., param=ScanBamParam(what="seq")). Then get the strand with strand(gal), the CIGAR strings with cigar(gal), and the query sequences with mcols(gal)$seq. Compute the soft clip ranges with Sranges <- cigarRangesAlongQuerySpace(cigar(gal), ops="S"). Sranges is an IRangesList object parallel to gal. Ranges in Sranges that start at 1 are on the left, and ranges that end at width(mcols(gal)$seq) (which is the same as qwidth(gal)) are on the right.

Hope this helps,

H.

ADD REPLY

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6