0
23 months ago by
tangming2005140
United States
tangming2005140 wrote:

Hi,

I was running QC for my whole exome data (paired-end sequencing) and

got this error:

> readpairs <- reads2pairs(reads)

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] TEQC_3.10.0          hwriter_1.3.2        Rsamtools_1.22.0
[4] Biostrings_2.38.4    XVector_0.10.0       GenomicRanges_1.22.4
[7] GenomeInfoDb_1.6.3   IRanges_2.4.8        S4Vectors_0.8.11
[10] BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6         futile.options_1.0.0 zlibbioc_1.16.0
[4] futile.logger_1.4.3  BiocParallel_1.4.3   lambda.r_1.1.9
[7] tools_3.2.3          Biobase_2.30.0      

Any idea why is that? for paired-end bam, each ID should replicate 2 times for the mates? right?

Thanks,

TOmmy

teqc • 971 views
modified 22 months ago • written 23 months ago by tangming2005140
0
22 months ago by
m.hummel10
Germany
m.hummel10 wrote:

Hi Tommy,

yes, each ID should be there twice (or only once, if the mate could not be mapped). The error occurs if there are IDs that appear more than 2 times, e.g. when there multi-mapping reads.

When reading a bam file with 'get.reads', automatically only primary alignments should be kept. But maybe your 'reads' object was derived in a different way, e.g. from a bed file or by some other function?

Best

Manuela

I actually used bwa mem to map to get the bam files, so bwa mem is keeping the multi-mapped reads I guess.

You mentioned that get.reads only read in the primary alignments? but why it is still there?

Maybe I need to remove those reads in bam before feeding into TEQC?

Thank you!

Tommy

That only primary alignments are kept in 'get.reads' was introduced in TEQC version 3.13.1.

If the error still appears for recent TEQC versions and/or after removing secondary alignments from the bam, I should have a closer look.

Manuela

I was using 3.1.0. A side question is how TEQC determine the not uniquely mapped reads for BWA-MEM aligned bam files.

BWA-mem added XA tag to a read if it is mapped to multiple places. some people also use

samtools view -q 1

but q=0 is only a convention for multi-mapped reads by BWA-mem not by other aligners.

Tommy

TEQC reads bam files with the 'scanBam' function in the 'Rsamtools' package. This function uses 'scanBamFlag' to determine which reads to keep based on the FLAG field in the bam file (second column).

Secondary alignments have flag = 256, and filtering those out with 'scanBam' should be equivalent to:

samtools view -F 256 file.bam

(Additionally TEQC specifies isUnmappedQuery=FALSE in order to remove reads without match to the reference.)

I don't know how this refers to the specifications that BWA-mem adds to other fields in the bam...

samtools view -F 256 will remove some reads, but not all not uniquely mapped reads. Have you tried TEQC on any BWA-MEM aligned bam files?

from Heng Li:

https://www.biostars.org/p/59281/#59303

Eland is probably the first short read aligner. It reports a tag, which can be '[UR][0-2]|NM', for each read to indicate if the read is mapped uniquely, repetitively or unmapped. Since then, we are used to talking about `mapping uniqueness' and extend the concept to generic alignments without asking ourselves what is the exact definiation of uniqueness and if such a concept is useful in practice at all.

For Eland, mapping uniqness is clearly defined. A read is said to be aligned uniquely if the best two alignments have identical number of mismatches. Eland requires the full-length read to be aligned and it does not do gapped alignment. Such a definition is useful to downstream analyses.

However, when we consider base quality, the usefulness of uniqueness becomes less obvious. Suppose a read has no perfect match and two 1-mismatch hits. The first hit has a Q5 mismatch and the second has a Q30 mismatch. If the quality is accurate, the first hit is clearly better than the second. Why couldn't we trust the first hit?

In addition, as is pointed out by one of the anonymous reviewers of my BWA paper, an aligner may not be able to find the best hits if heuristics are in use, and in this case, the aligner is only able to find 'unique' reads by its own definition.

Furthermore, once we allow gaps, mapping uniqueness becomes even less useful. Firstly, we need to redefine uniqueness as we have gaps. One possible way is to define a read as being uniquely mapped if the best two alignments have identical number of differences (mismatches plus gaps). The definition is clear, but not useful. We know on Illumina reads indel errors occur rarely. A hit containing one mismatch is definitely preferred over a hit with one gap.

Things get even worse when we clip reads as what we do for capillary reads. We can only define a read being uniquely mapped when the top two alignments have identical alignment score. However, this is almost practically useless at all. For long reads, frequently we get alignments with similar scores, but we seldom get two with identical scores.

Uniqueness was initially introduced to measure the reliability of ungapped short read alignment with a read aligned in full length. It is not a proper concept for generic alignments. For generic alignments, what is much more useful is mapping quality, first introduced in my maq paper. Mapping quality is phred-scaled probability of the alignment being wrong. It unambiguously measures the alignment reliability in a universal way. Calculating mapping quality is related to a proper statistical alignment/error model, and this is the right thing to do. I would strongly recommend all aligners to report mapping quality. Mapping uniqueness was not widely used two years ago and will not be widely used two years later. It is just a temporary concept, reflecting our lack of knowledge on measuring the reliability of an alignment.