readGAlignmentPairs fail to read paired end data
1
1
Entering edit mode
Jiping Wang ▴ 90
@jiping-wang-6687
Last seen 19 months ago
United States

I encountered problems when dealing with paired-end read using readGAlignmentPairs function. I tested on the data the package comes with (i.e. extdata), no problem, this function returns the paired end data as expected. However when I applied to my own paired-end data (aligned from tophat), I got no records for some reason. I tested it using testPairedEndBam function, it returned TRUE. I also manually examined the SAM file, all reads appeared in pairs. Anyone has any hints on this?

> library(Rsamtools)
> library("GenomicAlignments")
> setwd("~/degnorm")

#read in paired-end bam file

> bamfile="SRR873822.bam"
> ##testing whether single or paired-end
> testPairedEndBam(bamfile)
[1] TRUE
> 
> ##subsetting the bam file for test
> t <- scanBamHeader(bamfile)[[1]][["targets"]]
> which <- GRanges(names(t), IRanges(1, unname(t)))
> 
> ##only look at chrI
> param <- ScanBamParam(which=which[1], what=character(), flag=scanBamFlag(isProperPair=TRUE,
+                                                                          isDuplicate=FALSE,
+                                                                          isSecondaryAlignment=FALSE))
> galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param)
> galp2
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
   seqnames strand :    ranges --    ranges
      <Rle>  <Rle> : <IRanges> -- <IRanges>
  -------
  seqinfo: 25 sequences from an unspecified genome
software error • 1.8k views
ADD COMMENT
0
Entering edit mode

I would start by getting the relevant GRanges with

bf = BamFile("SRR873822.bam")
gr = as(seqinfo(bf), "GRanges")
chr1 = gr[1] ## or maybe gr["chr1"]

It might also be instructive to look at the number of mapped reads on each seqname

idxstatsBam(bf)

I would then suggest working through your filtering commands, starting with no filters

param = ScanBamParam(which = gr["chr1"])
readGAlignmentPairs(bf, param = param)

and adding filters / etc until you identify the problem. It's not clear how the BAM file was created, so it's hard to provide specific advice. You could make a subset of it (e.g., using filterBam() and share that via dropbox or similar...

ADD REPLY
1
Entering edit mode

Hi Martin,

Thanks so much for help. My data is paired-end RNA-seq data, aligned using tophat. I followed your recommendation, here is what got:

> bf = BamFile("SRR873822.bam")
> gr = as(seqinfo(bf), "GRanges")
> chr1 = gr[1] ## or maybe gr["chr1"]
> idxstatsBam(bf)
   seqnames seqlength  mapped unmapped
1      chr1 249250621 6020257        0
2      chr2 243199373 3338466        0
3      chr3 198022430 3204069        0
4      chr4 191154276 1509773        0
5      chr5 180915260 2832690        0
6      chr6 171115067 3186513        0
7      chr7 159138663 2996980        0
8      chr8 146364022 1475509        0
9      chr9 141213431 1913632        0
10    chr10 135534747 2361487        0
11    chr11 135006516 2559868        0
12    chr12 133851895 3399573        0
13    chr13 115169878  695887        0
14    chr14 107349540 1308601        0
15    chr15 102531392 2320640        0
16    chr16  90354753 1777503        0
17    chr17  81195210 3904006        0
18    chr18  78077248  583429        0
19    chr19  59128983 3552096        0
20    chr20  63025520 1465676        0
21    chr21  48129895  426795        0
22    chr22  51304566  991203        0
23     chrX 155270560 1783619        0
24     chrY  59373566   81558        0
25     chrM     16571 1952807        0
> param = ScanBamParam(which = gr["chr1"])
> readGAlignmentPairs(bf, param = param)
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
   seqnames strand :    ranges --    ranges
      <Rle>  <Rle> : <IRanges> -- <IRanges>
  -------
  seqinfo: 25 sequences from an unspecified genome

A subset of the bam file (only chr1) can be downloaded from here: https://www.dropbox.com/s/qgjh74lmucp13r6/SRR873822_chr1.bam?dl=0

ADD REPLY
0
Entering edit mode

A next step might be

Rsamtools::quickBamFlagSummary(bf, param = param)

and interpret the output to ensure that your alignment is actually paired end.

If so, you might try readGAlignments(bf, param = param) and readGAlignmentsList(bf, param = param).

ADD REPLY
1
Entering edit mode

I do not quite understand the output of quckBamFlagSummary. What does it tell about the paired-end or single-end?

> Rsamtools::quickBamFlagSummary(bf, param = param)
                                group |    nb of |    nb of | mean / max
                                   of |  records |   unique | records per
                              records | in group |   QNAMEs | unique QNAME
All records........................ A |  6020257 |  5809496 | 1.04 / 20
  o template has single segment.... S |        0 |        0 |   NA / NA
  o template has multiple segments. M |  6020257 |  5809496 | 1.04 / 20
      - first segment.............. F |  3069014 |  2961423 | 1.04 / 20
      - last segment............... L |  2951243 |  2848073 | 1.04 / 20
      - other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.

Details for group M:
  o record is mapped.............. M1 |  6020257 |  5809496 | 1.04 / 20
      - primary alignment......... M2 |  5728998 |  5728998 |    1 / 1
      - secondary alignment....... M3 |   291259 |   140970 | 2.07 / 19
  o record is unmapped............ M4 |        0 |        0 |   NA / NA

Details for group F:
  o record is mapped.............. F1 |  3069014 |  2961423 | 1.04 / 20
      - primary alignment......... F2 |  2919904 |  2919904 |    1 / 1
      - secondary alignment....... F3 |   149110 |    72292 | 2.06 / 19
  o record is unmapped............ F4 |        0 |        0 |   NA / NA

Details for group L:
  o record is mapped.............. L1 |  2951243 |  2848073 | 1.04 / 20
      - primary alignment......... L2 |  2809094 |  2809094 |    1 / 1
      - secondary alignment....... L3 |   142149 |    68678 | 2.07 / 19
  o record is unmapped............ L4 |        0 |        0 |   NA / NA
ADD REPLY
0
Entering edit mode

It says that most reads are paired end, and most are 'primary' alignments. This all suggests that they should be input by readGAlignmentPairs().

It looks like in my code I should have suggested

bf = BamFile("SRR873822.bam", asMates = TRUE)
param = ScanBamParam(which = gr["chr1"])
readGAlignmentPairs(bf, param = param)

Does it 'work' to

param = ScanBamParam(which = gr["chr1"], what = "mapq")
galn = readGAlignments(bf, param = param)

and then to summarize map quality

table(mcols(galn)$mapq)
ADD REPLY
1
Entering edit mode

Thanks again! Still I cannot get the paired end counts.

> bf = BamFile("SRR873822.bam", asMates = TRUE)
> param = ScanBamParam(which = gr["chr1"])
> readGAlignmentPairs(bf, param = param)
GAlignmentPairs object with 0 pairs, strandMode=1, and 0 metadata columns:
   seqnames strand :    ranges --    ranges
      <Rle>  <Rle> : <IRanges> -- <IRanges>
  -------
  seqinfo: 25 sequences from an unspecified genome
> param = ScanBamParam(which = gr["chr1"], what = "mapq")
> galn = readGAlignments(bf, param = param)
> table(mcols(galn)$mapq)

      0       1       3      50 
 160227   56229  206403 5597398 
ADD REPLY
0
Entering edit mode

Thanks for your patience; this shows that there are actually reads there, and that they have high map quality. They are simply not being paired. It could be that the flags are somehow excluding proper pairing, so

param = ScanBamParam(which = gr["chr1"], what = c("qname", "flag"))
galn = readGAlignments(bf, param = param)
colSums(bamFlagAsBitMatrix(mcols(galn)$flag))

might be informative. Another possibility is that the read names have been mangled, e.g., by adding suffix .1 / .2; the following

table(table(mcols(galn)$qname))

will I think generate a short table where most of the entries should be '2'.

ADD REPLY
1
Entering edit mode

Thanks, Martin! This is very helpful. I think we are close to the answer why I couldn't get the paired-end count. Indeed the qname of the two mates have the extension .1 and .2. For example, the following are two paired ends. I thought this is the standard for naming the two mates and the R package should automatically detect it since this is alignment output from tophat, which is still widely used. Following your instruction, here is the output. What would you recommend to do next? thanks.

SRR873822.22653443.1    355     chr1    11930   3       101M    =       12005   176     CTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGGTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTG   ???DBD;BC;D;?EEIID+@:<CEECCEE<EDDADD>D@<//(8;BAC.=AC57@CDC=;A).;;?;.66@;>A;>A<>A=<A::>A:>>;+:4>A><>AA   AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:61C39      YT:Z:UU XS:A:+  NH:i:2  CC:Z:chr15      CP:i:102519140  HI:i:0
SRR873822.22653443.2    403     chr1    12005   3       101M    =       11930   -176    TGCAGGTGTCTGACTTCCAGCAACTGCTGGCGTGTGCCAGGG
> param = ScanBamParam(which = gr["chr1"], what = c("qname", "flag"))
> galn = readGAlignments(bf, param = param)
> colSums(bamFlagAsBitMatrix(mcols(galn)$flag))
                   isPaired                isProperPair 
                    6020257                     4758548 
            isUnmappedQuery             hasUnmappedMate 
                          0                      349905 
              isMinusStrand           isMateMinusStrand 
                    3009041                     2835259 
            isFirstMateRead            isSecondMateRead 
                    3069014                     2951243 
       isSecondaryAlignment isNotPassingQualityControls 
                     291259                           0 
                isDuplicate    isSupplementaryAlignment 
                          0                           0 
> table(table(mcols(galn)$qname))

      1       2       3       4       5       6       7       8       9 
5742483   44302    6968    4162    1223    1583     653     625     682 
     10      11      12      13      14      15      16      17      18 
    647     278     252      60     131     246     272     279     199 
     19      20 
    257    4194 
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 4 weeks ago
United States

OK I think we have enough information for an answer!

The SRA appended .1 and .2 to the qname. This propagates through Tophat. I believe that one can tell BamFile() to ignore the trailing .1 and .2 (there are two arguments qnamePrefixEnd= / qnameSuffixStart=; I would have guessed that one wanted qnamePrefixEnd=, but reading the help page ?BamFile suggests qnameSuffixStart=...)

bf = BamFile("SRR873822.bam", asMates = TRUE, qnameSuffixStart = ".")
gr = as(seqinfo(bf), "GRanges")
param = ScanBamParam(which = gr["chr1"])
readGAlignmentPairs(bf, param = param)
ADD COMMENT
1
Entering edit mode

I missed that these data are from the SRA. Presumably the OP used fastq-dump to convert the SRR file to fastq, in which case it is usually helpful to use the --split-files argument and omit the -I argument, as that is what appends the offending .1 and .2 to the spot IDs.

ADD REPLY
0
Entering edit mode

Yes, it's from SRA. Thanks for help.

ADD REPLY
1
Entering edit mode

Hi Martin,

Thanks so much for such detailed guidance and help! It seems to be working now. I will further try whether the output from the paired alignment file will lead to right coverage score calculation on transcript/genes.

> bf = BamFile("SRR873822.bam", asMates = TRUE, qnameSuffixStart = ".")
> gr = as(seqinfo(bf), "GRanges")
> param = ScanBamParam(which = gr["chr1"])
> gal2=readGAlignmentPairs(bf, param = param)
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
    120213 alignments with ambiguous pairing were dumped.
    Use 'getDumpedAlignments()' to retrieve them from the dump environment.
> length(gal2)
[1] 2830882
ADD REPLY

Login before adding your answer.

Traffic: 747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6