duplicate reads in mRNA-Seq
1
0
Entering edit mode
jason0701 ▴ 190
@jason0701-3921
Last seen 5.0 years ago
Hi, Some of you may have answers for this. It seems that the duplicate reads are very common in mRNA-seq data. Duplicate reads are those being mapped to exact the same chromosome location and on the same strand (maybe from PCR amplification). I would like to know what are the general practice to deal with it? I suspect some of those may contribute to the large overdispersion in the final count data. Thanks, Jason
• 1.7k views
ADD COMMENT
0
Entering edit mode
Simon Anders ★ 3.8k
@simon-anders-3855
Last seen 4.3 years ago
Zentrum für Molekularbiologie, Universi…
Hi Jason > It seems that the duplicate reads are very common in mRNA-seq data. > Duplicate reads are those being mapped to exact the same chromosome > location and on the same strand (maybe from PCR amplification). I > would like to know what are the general practice to deal with it? I > suspect some of those may contribute to the large overdispersion in > the final count data. I know it is soemtimes recommended to remove them but I'd advise against this. One of the advantages of RNA-Seq over expression microarrays is the large gain in dynamic range. On arrays, lowly expressed genes drown in background flourescence and highly expressed genes saturate the hybridisation, giving you a dynamic range of typically little more 25 dB (i.e., ratios of up to at most 1:300). In RNA-Seq, very weak genes give rise to less than 10 counts while the strongest genes may give more well above 100,000 counts, i.e., the usable dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000). Now, imagine you would count several reads mapping to the same position at most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 counts, even if it were one of those strongly expressed ones with 5-figure raw count. Hence, you would dramatically squash your dynamic range and lose all hope for linearity (i.e., you cannot expect any more that the count rate is at least roughly proportional to the concentration). Of course, if there are PCR artifacts, they destroy the linearity as well. So, if you have an exon, to which only very few reads map except for one specific position that shows a pile of hundreds of reads, all with precisely the same coordinates, then is reason for concern. I have seen such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all recently, but I think they were a common concern two years ago. I wonder where they went. Did they maybe improve the PCR steps of the library preparation protocols?) Simon
ADD COMMENT
0
Entering edit mode
Thanks Simon for the insightful comments. I think you are right on this. From an empirical comparison I just did between the RNA-Seq and quantitative-PCR data, the unfiltered one seems to give better concordance with the PCR data (based on fc). Thanks again, Jason On Sat, Feb 12, 2011 at 12:39 PM, Simon Anders <anders at="" embl.de=""> wrote: > Hi Jason > >> It seems that the duplicate reads are very common in mRNA-seq data. >> Duplicate reads are those being mapped to exact the same chromosome >> location and on the same strand (maybe from PCR amplification). I >> would like to know what are the general practice to deal with it? I >> suspect some of those may contribute to the large overdispersion in >> the final count data. > > I know it is soemtimes recommended to remove them but I'd advise against > this. > > One of the advantages of RNA-Seq over expression microarrays is the large > gain in dynamic range. On arrays, lowly expressed genes drown in background > flourescence and highly expressed genes saturate the hybridisation, giving > you a dynamic range of typically little more 25 dB (i.e., ratios of up to > at most 1:300). > > In RNA-Seq, very weak genes give rise to less than 10 counts while the > strongest genes may give more well above 100,000 counts, i.e., the usable > dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000). > > Now, imagine you would count several reads mapping to the same position at > most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 > counts, even if it were one of those strongly expressed ones with 5-figure > raw count. Hence, you would dramatically squash your dynamic range and lose > all hope for linearity (i.e., you cannot expect any more that the count > rate is at least roughly proportional to the concentration). > > Of course, if there are PCR artifacts, they destroy the linearity as well. > So, if you have an exon, to which only very few reads map except for one > specific position that shows a pile of hundreds of reads, all with > precisely the same coordinates, then is reason for concern. I have seen > such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all > recently, but I think they were a common concern two years ago. I wonder > where they went. Did they maybe improve the PCR steps of the library > preparation protocols?) > > ?Simon > > > >
ADD REPLY
0
Entering edit mode
Hi Dr. Anders and Dr. Jason, May I ask, what is the frequency of duplicates that you have had in your data? I have had ~0.6 duplicates in my final aligned and filtered (unique match and number of mismatches) dataset. As of now I have run analysis without them. Thanks, Fernando -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Simon Anders Sent: Saturday, February 12, 2011 11:39 AM To: Jason Lu Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] duplicate reads in mRNA-Seq Hi Jason > It seems that the duplicate reads are very common in mRNA-seq data. > Duplicate reads are those being mapped to exact the same chromosome > location and on the same strand (maybe from PCR amplification). I > would like to know what are the general practice to deal with it? I > suspect some of those may contribute to the large overdispersion in > the final count data. I know it is soemtimes recommended to remove them but I'd advise against this. One of the advantages of RNA-Seq over expression microarrays is the large gain in dynamic range. On arrays, lowly expressed genes drown in background flourescence and highly expressed genes saturate the hybridisation, giving you a dynamic range of typically little more 25 dB (i.e., ratios of up to at most 1:300). In RNA-Seq, very weak genes give rise to less than 10 counts while the strongest genes may give more well above 100,000 counts, i.e., the usable dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000). Now, imagine you would count several reads mapping to the same position at most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 counts, even if it were one of those strongly expressed ones with 5-figure raw count. Hence, you would dramatically squash your dynamic range and lose all hope for linearity (i.e., you cannot expect any more that the count rate is at least roughly proportional to the concentration). Of course, if there are PCR artifacts, they destroy the linearity as well. So, if you have an exon, to which only very few reads map except for one specific position that shows a pile of hundreds of reads, all with precisely the same coordinates, then is reason for concern. I have seen such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all recently, but I think they were a common concern two years ago. I wonder where they went. Did they maybe improve the PCR steps of the library preparation protocols?) Simon _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6