Search
Question: featureCounts - paired-end data
0
gravatar for samuel collombet
3.4 years ago by
France
samuel collombet120 wrote:

Hi,

In single-end mode, does feature counts use the the template length (col9 in sam file) to check for an overlap with the feature, or the read length?

I have paired-end data with no unmated reads, I allow multimatchs, so I would have assume that feature count use the template length, my count in in single-end mode should be exactly twice than my count in paired-end, but it does not seems to be the case...

I would like to use the single end mode because my bam is quite big, and already sorted by coordinates, and I takes much more time to ask featureCounts to re-order the reads to use paire end mode (and I have a lot of bam to count, onto various features, so it is taking too long, and I cannot keep the bam sorted by names on my disk because of storage problems...).

ADD COMMENTlink modified 3.4 years ago by Gordon Smyth35k • written 3.4 years ago by samuel collombet120

Have you read the featureCounts webpage or paper? They both seem to answer your questions.

ADD REPLYlink written 3.4 years ago by James W. MacDonald47k

Hi James,

I did read it but was not sure to understand ; I re-read it after your message, does:

"A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature" (webpage)

and

"featureCounts preforms precise read assignment by comparing mapping location of every base in the read or fragment with the genomic region spanned by each feature" (paper)

mean that it take into account the read himself, and not the fragment length? (now that you tell me the answer is there I guess it is so, but I was not sure before...sorry)

ADD REPLYlink written 3.4 years ago by samuel collombet120
1

Hi Samuel,

It's just a logic problem, and I think you are over thinking it. Let's say you have paired end data, and it looks like this:

XXXXXXXXXXXXXX              XXXXXXXXXXXX  <- these are your pairs

          *******************                           <- this is an exon

That read pair overlaps the exon of that gene, so if you are using them as paired ends, you get a single count. However, if you say they are single ends, you still get one count. Now consider

XXXXXXXXXXXXXXX             XXXXXXXXXXXXXXXXXX <- paired reads

   ***********************                     *****************  <- exons from same gene

In this case you get one count from paired-end mode, and two counts from single end mode.

XXXXXXXXXXXXXXXX              XXXXXXXXXXXXXX  <- paired reads

*************                      +++++++++++++    <- exons from two different genes

In this situation, counting in paired end mode gives you zero counts, but in single end mode gives you TWO counts, one for each gene.

XXXXXXXXXXXX                          XXXXXXXXXXXXXX <- paired reads

                          *************                                   <- exon

What do we get here? Zero counts! Regardless of how you count, because the reads don't overlap the exon. You could argue that this should count as one count for this gene (and it would if you aligned against the entire extent of each gene, from 5' UTR to 3' UTR, rather than the exonic regions).

You would be surprised to see how many more counts you get if you summarize based on the overlap within the gene body, rather than just the exons - there is apparently lots of nascent mRNA floating about, that hasn't had the introns excised yet. The same is true for the Affy arrays that have intronic controls that are supposed to measure background - these controls have surprisingly high binding. But I'm getting off on a tangent here...

ADD REPLYlink written 3.4 years ago by James W. MacDonald47k

Hi James,

I do see your point and I do see why my question does not make sense for you :) I should have mention that I am not working with RNAseq data (ie counting reads on exons of genes) but with ChIPseq data, counting fragments of IP on enriched regions. That's why I am more interested in the fragment as one entity and not as two read mates paired together. Maybe I should rather transform my reads data as single fragments data and count those, that would solve the problem...

ADD REPLYlink written 3.4 years ago by samuel collombet120
1

If you are doing ChIP-Seq, then I doubt you want to be using featureCounts anyway. Instead you should probably be using the csaw package.

ADD REPLYlink written 3.4 years ago by James W. MacDonald47k

Ho Thanks a lot, I did not know about this package!! I was using DESeq with counts on peaks (with an home made variant of the pipeline developed in the DiffBind package), but this look actually much better than my tinkering!

ADD REPLYlink written 3.4 years ago by samuel collombet120
0
gravatar for Wei Shi
3.4 years ago by
Wei Shi2.9k
Australia
Wei Shi2.9k wrote:

For paired end reads, you should count read pairs (fragments) rather than reads because counting fragments will give you more accurate counts.

There are several reasons why you cannot get the fragment counts by simply dividing the counts you got from counting reads by two. One reason is that a fragment with two mapped reads will give you two counts when you count reads, but a fragment with only one mapped read will only contribute one count (this fragment should get 1 count in fragment counting but it ended up with 0.5 count when you count reads instead of fragments).  Another reason is that some reads may be found to be overlapping with more than one gene and therefore were not counted, but the corresponding fragments may be counted because the ambiguity was solved by longer fragment length.

 

 

ADD COMMENTlink modified 3.4 years ago by Gordon Smyth35k • written 3.4 years ago by Wei Shi2.9k

Thanks Wei Shi, However I still would like to know what is the region considered to be covered by a read/fragment :)

Concerning you answer, as I mention I do not have un-mated pairs (I use STAR for mapping, which consider the pair as one sequence to map, so he does not give un-mated reads from paired end).
For your second point, although I agree that taking into account the pair will change the assignment of the reads, I do not see how your example is possible: a single mate reads is not assign because overlapping 2 features, but the pair is there is no more ambiguity? Is that possible? I see how the opposite is possible (single reads is unambiguously assign to one feature, but the pair is not because the 2 mates overlap 2 different features), but not this scenario.
Anyway if the program consider the fragment length in the SAM 9th column, paired end or not does not change anything as I have only mated pairs, and I am not taking the strand into account, so there are always 2 fragment overlaping the exact same region :)

ADD REPLYlink written 3.4 years ago by samuel collombet120

Firstly, it sounds strange to me that you used STAR to map your chip-seq data since STAR is designed for mapping RNA-seq data.

Un-mated pairs are those pairs which have one or both ends unmapped or have both ends mapped but mapped to a distance much greater than the nominal fragment length (or mapped to different chromosomes). Are you sure you did not have any un-mated read pairs in your mapping results? I have never seen such mapping results.

A problem with using fragment length to assign fragments to features is that the reported fragment lengths in the mapping results could be very big, meaning that they may span more than one feature. As a result of that, the fragments will be incorrectly assigned to all of their overlapping features or not assigned at all (depending on the setting).

For each read pair, featureCounts checks if each of the two reads in the pair can be unambiguously assigned. As long as one read is found to be able to be unambiguously assigned, featureCounts will take the whole read pair as a unambiguously assigned fragment. This is the reason why read pairs are more likely to be assigned unambiguously than reads in featureCounts.

What you really want to have might be to extend the reads to their full fragment length before assigning them to features. This is a better way to take into account fragment length in your read assignment. The options "readExtension5" and "readExtension3" in featureCounts can help you to do this. But the csaw package is certainly a good alternative.

 

ADD REPLYlink written 3.4 years ago by Wei Shi2.9k

For the use of STAR for ChIPseq (or genomic DNA in general), I think I am not the only one who tried (see https://groups.google.com/forum/#!topic/rna-star/E_mKqm9jDm0). I compared STAR mapping to bowtie2 and got quite similar results (in therm of mapping efficiency, signal  correlation and enriched regions). So i use it as it is much faster. But if you have some examples of bad results, it would be great to have your feedback!

for the readExtension5/3, you have to give a fixed length, right? My problem is that I would like to use the fragment length, wich is different for each read... 
At the end in this case, it is just easier (and probably better) to use csaw, but thanks for all the clarifications :)

 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by samuel collombet120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 316 users visited in the last hour