1
5 weeks ago by
Jiping Wang70
Jiping Wang70 wrote:

I encountered problems when dealing with paired-end read using readGAlignmentPairs function. I tested on the data the package comes with (i.e. extdata), no problem, this function returns the paired end data as expected. However when I applied to my own paired-end data (aligned from tophat), I got no records for some reason. I tested it using testPairedEndBam function, it returned TRUE. I also manually examined the SAM file, all reads appeared in pairs. Anyone has any hints on this?

> library(Rsamtools)
> library("GenomicAlignments")
> setwd("~/degnorm")

> bamfile="SRR873822.bam"
> ##testing whether single or paired-end
> testPairedEndBam(bamfile)
[1] TRUE
>
> ##subsetting the bam file for test
> which <- GRanges(names(t), IRanges(1, unname(t)))
>
> ##only look at chrI
> param <- ScanBamParam(which=which[1], what=character(), flag=scanBamFlag(isProperPair=TRUE,
+                                                                          isDuplicate=FALSE,
+                                                                          isSecondaryAlignment=FALSE))
> galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param)
> galp2
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
seqnames strand :    ranges --    ranges
<Rle>  <Rle> : <IRanges> -- <IRanges>
-------
seqinfo: 25 sequences from an unspecified genome

software error • 175 views
modified 4 weeks ago by Martin Morgan ♦♦ 23k • written 5 weeks ago by Jiping Wang70

I would start by getting the relevant GRanges with

bf = BamFile("SRR873822.bam")
gr = as(seqinfo(bf), "GRanges")
chr1 = gr[1] ## or maybe gr["chr1"]


It might also be instructive to look at the number of mapped reads on each seqname

idxstatsBam(bf)


I would then suggest working through your filtering commands, starting with no filters

param = ScanBamParam(which = gr["chr1"])


and adding filters / etc until you identify the problem. It's not clear how the BAM file was created, so it's hard to provide specific advice. You could make a subset of it (e.g., using filterBam() and share that via dropbox or similar...

1

Hi Martin,

Thanks so much for help. My data is paired-end RNA-seq data, aligned using tophat. I followed your recommendation, here is what got:

> bf = BamFile("SRR873822.bam")
> gr = as(seqinfo(bf), "GRanges")
> chr1 = gr[1] ## or maybe gr["chr1"]
> idxstatsBam(bf)
seqnames seqlength  mapped unmapped
1      chr1 249250621 6020257        0
2      chr2 243199373 3338466        0
3      chr3 198022430 3204069        0
4      chr4 191154276 1509773        0
5      chr5 180915260 2832690        0
6      chr6 171115067 3186513        0
7      chr7 159138663 2996980        0
8      chr8 146364022 1475509        0
9      chr9 141213431 1913632        0
10    chr10 135534747 2361487        0
11    chr11 135006516 2559868        0
12    chr12 133851895 3399573        0
13    chr13 115169878  695887        0
14    chr14 107349540 1308601        0
15    chr15 102531392 2320640        0
16    chr16  90354753 1777503        0
17    chr17  81195210 3904006        0
18    chr18  78077248  583429        0
19    chr19  59128983 3552096        0
20    chr20  63025520 1465676        0
21    chr21  48129895  426795        0
22    chr22  51304566  991203        0
23     chrX 155270560 1783619        0
24     chrY  59373566   81558        0
25     chrM     16571 1952807        0
> param = ScanBamParam(which = gr["chr1"])
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
seqnames strand :    ranges --    ranges
<Rle>  <Rle> : <IRanges> -- <IRanges>
-------
seqinfo: 25 sequences from an unspecified genome


A subset of the bam file (only chr1) can be downloaded from here: https://www.dropbox.com/s/qgjh74lmucp13r6/SRR873822_chr1.bam?dl=0

A next step might be

Rsamtools::quickBamFlagSummary(bf, param = param)


and interpret the output to ensure that your alignment is actually paired end.

If so, you might try readGAlignments(bf, param = param) and readGAlignmentsList(bf, param = param).

1

I do not quite understand the output of quckBamFlagSummary. What does it tell about the paired-end or single-end?

> Rsamtools::quickBamFlagSummary(bf, param = param)
group |    nb of |    nb of | mean / max
of |  records |   unique | records per
records | in group |   QNAMEs | unique QNAME
All records........................ A |  6020257 |  5809496 | 1.04 / 20
o template has single segment.... S |        0 |        0 |   NA / NA
o template has multiple segments. M |  6020257 |  5809496 | 1.04 / 20
- first segment.............. F |  3069014 |  2961423 | 1.04 / 20
- last segment............... L |  2951243 |  2848073 | 1.04 / 20
- other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.

Details for group M:
o record is mapped.............. M1 |  6020257 |  5809496 | 1.04 / 20
- primary alignment......... M2 |  5728998 |  5728998 |    1 / 1
- secondary alignment....... M3 |   291259 |   140970 | 2.07 / 19
o record is unmapped............ M4 |        0 |        0 |   NA / NA

Details for group F:
o record is mapped.............. F1 |  3069014 |  2961423 | 1.04 / 20
- primary alignment......... F2 |  2919904 |  2919904 |    1 / 1
- secondary alignment....... F3 |   149110 |    72292 | 2.06 / 19
o record is unmapped............ F4 |        0 |        0 |   NA / NA

Details for group L:
o record is mapped.............. L1 |  2951243 |  2848073 | 1.04 / 20
- primary alignment......... L2 |  2809094 |  2809094 |    1 / 1
- secondary alignment....... L3 |   142149 |    68678 | 2.07 / 19
o record is unmapped............ L4 |        0 |        0 |   NA / NA


It says that most reads are paired end, and most are 'primary' alignments. This all suggests that they should be input by readGAlignmentPairs().

It looks like in my code I should have suggested

bf = BamFile("SRR873822.bam", asMates = TRUE)
param = ScanBamParam(which = gr["chr1"])


Does it 'work' to

param = ScanBamParam(which = gr["chr1"], what = "mapq")
galn = readGAlignments(bf, param = param)


and then to summarize map quality

table(mcols(galn)$mapq)  ADD REPLYlink written 5 weeks ago by Martin Morgan ♦♦ 23k 1 Thanks again! Still I cannot get the paired end counts. > bf = BamFile("SRR873822.bam", asMates = TRUE) > param = ScanBamParam(which = gr["chr1"]) > readGAlignmentPairs(bf, param = param) GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns: seqnames strand : ranges -- ranges <Rle> <Rle> : <IRanges> -- <IRanges> ------- seqinfo: 25 sequences from an unspecified genome  > param = ScanBamParam(which = gr["chr1"], what = "mapq") > galn = readGAlignments(bf, param = param) > table(mcols(galn)$mapq)

0       1       3      50
160227   56229  206403 5597398


Thanks for your patience; this shows that there are actually reads there, and that they have high map quality. They are simply not being paired. It could be that the flags are somehow excluding proper pairing, so

param = ScanBamParam(which = gr["chr1"], what = c("qname", "flag"))
galn = readGAlignments(bf, param = param)
colSums(bamFlagAsBitMatrix(mcols(galn)$flag))  might be informative. Another possibility is that the read names have been mangled, e.g., by adding suffix .1 / .2; the following table(table(mcols(galn)$qname))


will I think generate a short table where most of the entries should be '2'.

1

Thanks, Martin! This is very helpful. I think we are close to the answer why I couldn't get the paired-end count. Indeed the qname of the two mates have the extension .1 and .2. For example, the following are two paired ends. I thought this is the standard for naming the two mates and the R package should automatically detect it since this is alignment output from tophat, which is still widely used. Following your instruction, here is the output. What would you recommend to do next? thanks.

SRR873822.22653443.1    355     chr1    11930   3       101M    =       12005   176     CTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGGTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTG   ???DBD;BC;D;?EEIID+@:<CEECCEE<EDDADD>D@<//(8;BAC.=AC57@CDC=;A).;;?;.66@;>A;>A<>A=<A::>A:>>;+:4>A><>AA   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:61C39      YT:Z:UU XS:A:+  NH:i:2  CC:Z:chr15      CP:i:102519140  HI:i:0
SRR873822.22653443.2    403     chr1    12005   3       101M    =       11930   -176    TGCAGGTGTCTGACTTCCAGCAACTGCTGGCGTGTGCCAGGG

> param = ScanBamParam(which = gr["chr1"], what = c("qname", "flag"))
> galn = readGAlignments(bf, param = param)
> colSums(bamFlagAsBitMatrix(mcols(galn)$flag)) isPaired isProperPair 6020257 4758548 isUnmappedQuery hasUnmappedMate 0 349905 isMinusStrand isMateMinusStrand 3009041 2835259 isFirstMateRead isSecondMateRead 3069014 2951243 isSecondaryAlignment isNotPassingQualityControls 291259 0 isDuplicate isSupplementaryAlignment 0 0 > table(table(mcols(galn)$qname))

1       2       3       4       5       6       7       8       9
5742483   44302    6968    4162    1223    1583     653     625     682
10      11      12      13      14      15      16      17      18
647     278     252      60     131     246     272     279     199
19      20
257    4194

0
4 weeks ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

OK I think we have enough information for an answer!

The SRA appended .1 and .2 to the qname. This propagates through Tophat. I believe that one can tell BamFile() to ignore the trailing .1 and .2 (there are two arguments qnamePrefixEnd= / qnameSuffixStart=; I would have guessed that one wanted qnamePrefixEnd=, but reading the help page ?BamFile suggests qnameSuffixStart=...)

bf = BamFile("SRR873822.bam", asMates = TRUE, qnameSuffixStart = ".")
gr = as(seqinfo(bf), "GRanges")
param = ScanBamParam(which = gr["chr1"])

1

I missed that these data are from the SRA. Presumably the OP used fastq-dump to convert the SRR file to fastq, in which case it is usually helpful to use the --split-files argument and omit the -I argument, as that is what appends the offending .1 and .2 to the spot IDs.

Yes, it's from SRA. Thanks for help.

1

Hi Martin,

Thanks so much for such detailed guidance and help! It seems to be working now. I will further try whether the output from the paired alignment file will lead to right coverage score calculation on transcript/genes.

> bf = BamFile("SRR873822.bam", asMates = TRUE, qnameSuffixStart = ".")
> gr = as(seqinfo(bf), "GRanges")
> param = ScanBamParam(which = gr["chr1"])
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
120213 alignments with ambiguous pairing were dumped.
Use 'getDumpedAlignments()' to retrieve them from the dump environment.
> length(gal2)
[1] 2830882