Search
Question: reads2pairs from TEQC error:Error in reads2pairs(reads) : read pair IDs do not seem to be unique
0
gravatar for tangming2005
4 months ago by
tangming200590
United States
tangming200590 wrote:

Hi,

I was running QC for my whole exome data (paired-end sequencing) and

got this error:

> readpairs <- reads2pairs(reads)
Error in reads2pairs(reads) : read pair IDs do not seem to be unique

 sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] TEQC_3.10.0          hwriter_1.3.2        Rsamtools_1.22.0    
 [4] Biostrings_2.38.4    XVector_0.10.0       GenomicRanges_1.22.4
 [7] GenomeInfoDb_1.6.3   IRanges_2.4.8        S4Vectors_0.8.11    
[10] BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6         futile.options_1.0.0 zlibbioc_1.16.0     
[4] futile.logger_1.4.3  BiocParallel_1.4.3   lambda.r_1.1.9      
[7] tools_3.2.3          Biobase_2.30.0      

 

Any idea why is that? for paired-end bam, each ID should replicate 2 times for the mates? right?

Thanks,

TOmmy

 

 

 

ADD COMMENTlink modified 4 months ago • written 4 months ago by tangming200590
0
gravatar for m.hummel
4 months ago by
m.hummel10
Germany
m.hummel10 wrote:

Hi Tommy,

yes, each ID should be there twice (or only once, if the mate could not be mapped). The error occurs if there are IDs that appear more than 2 times, e.g. when there multi-mapping reads.

When reading a bam file with 'get.reads', automatically only primary alignments should be kept. But maybe your 'reads' object was derived in a different way, e.g. from a bed file or by some other function?

Best

Manuela

ADD COMMENTlink written 4 months ago by m.hummel10

I actually used bwa mem to map to get the bam files, so bwa mem is keeping the multi-mapped reads I guess. 

You mentioned that `get.reads` only read in the primary alignments? but why it is still there? 

Maybe I need to remove those reads in bam before feeding into TEQC?

Thank you!

Tommy

ADD REPLYlink written 4 months ago by tangming200590

That only primary alignments are kept in 'get.reads' was introduced in TEQC version 3.13.1.

If the error still appears for recent TEQC versions and/or after removing secondary alignments from the bam, I should have a closer look.

Manuela

ADD REPLYlink written 4 months ago by m.hummel10

I was using 3.1.0. A side question is how TEQC determine the not uniquely mapped reads for BWA-MEM aligned bam files.

I read here:

https://www.biostars.org/p/76893/  
https://www.biostars.org/p/167892/  

BWA-mem added XA tag to a read if it is mapped to multiple places. some people also use 

samtools view -q 1

but q=0 is only a convention for multi-mapped reads by BWA-mem not by other aligners.

Tommy

ADD REPLYlink written 4 months ago by tangming200590

TEQC reads bam files with the 'scanBam' function in the 'Rsamtools' package. This function uses 'scanBamFlag' to determine which reads to keep based on the FLAG field in the bam file (second column).

Secondary alignments have flag = 256, and filtering those out with 'scanBam' should be equivalent to:

samtools view -F 256 file.bam

(Additionally TEQC specifies isUnmappedQuery=FALSE in order to remove reads without match to the reference.)

I don't know how this refers to the specifications that BWA-mem adds to other fields in the bam...

ADD REPLYlink written 4 months ago by m.hummel10

samtools view -F 256 will remove some reads, but not all not uniquely mapped reads. Have you tried TEQC on any BWA-MEM aligned bam files?

from Heng Li:

https://www.biostars.org/p/59281/#59303

Eland is probably the first short read aligner. It reports a tag, which can be '[UR][0-2]|NM', for each read to indicate if the read is mapped uniquely, repetitively or unmapped. Since then, we are used to talking about `mapping uniqueness' and extend the concept to generic alignments without asking ourselves what is the exact definiation of uniqueness and if such a concept is useful in practice at all.

For Eland, mapping uniqness is clearly defined. A read is said to be aligned uniquely if the best two alignments have identical number of mismatches. Eland requires the full-length read to be aligned and it does not do gapped alignment. Such a definition is useful to downstream analyses.

However, when we consider base quality, the usefulness of uniqueness becomes less obvious. Suppose a read has no perfect match and two 1-mismatch hits. The first hit has a Q5 mismatch and the second has a Q30 mismatch. If the quality is accurate, the first hit is clearly better than the second. Why couldn't we trust the first hit?

In addition, as is pointed out by one of the anonymous reviewers of my BWA paper, an aligner may not be able to find the best hits if heuristics are in use, and in this case, the aligner is only able to find 'unique' reads by its own definition.

Furthermore, once we allow gaps, mapping uniqueness becomes even less useful. Firstly, we need to redefine uniqueness as we have gaps. One possible way is to define a read as being uniquely mapped if the best two alignments have identical number of differences (mismatches plus gaps). The definition is clear, but not useful. We know on Illumina reads indel errors occur rarely. A hit containing one mismatch is definitely preferred over a hit with one gap.

Things get even worse when we clip reads as what we do for capillary reads. We can only define a read being uniquely mapped when the top two alignments have identical alignment score. However, this is almost practically useless at all. For long reads, frequently we get alignments with similar scores, but we seldom get two with identical scores.

Uniqueness was initially introduced to measure the reliability of ungapped short read alignment with a read aligned in full length. It is not a proper concept for generic alignments. For generic alignments, what is much more useful is mapping quality, first introduced in my maq paper. Mapping quality is phred-scaled probability of the alignment being wrong. It unambiguously measures the alignment reliability in a universal way. Calculating mapping quality is related to a proper statistical alignment/error model, and this is the right thing to do. I would strongly recommend all aligners to report mapping quality. Mapping uniqueness was not widely used two years ago and will not be widely used two years later. It is just a temporary concept, reflecting our lack of knowledge on measuring the reliability of an alignment.

 

 

ADD REPLYlink written 4 months ago by tangming200590
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 119 users visited in the last hour