Question

strange results with featureCounts

1

Entering edit mode

inah ▴ 10

@inah-13176

Last seen 7.4 years ago

Hi,

I have human total RNA-seq data (PE, Next-Seq) from a pilot study with two samples and am getting results from featureCounts that do not make any sense to me. I process the data as follows: (1) I perform adapter trimming using ea-utils mcf and mild quality trimming using btrim. I use STAR for alignment to the genome. STAR tells me that the first sample has 99,018,190 input reads (these are read pairs) and the 2nd sample has 126,164,150 input reads. For the first sample, 86,571,963 reads were aligned (70,579,007 uniquely), and for the 2nd sample, 112,525,928 reads were aligned (89,730,382 uniquely). Now when I run featureCounts on these data, it prints out this information:

First sample: Total fragments: 123128857, Successfully assigned fragments : 63971619 (52.0%)

2nd sample: Total fragments: 168328570, Successfully assigned fragments : 83223103 (49.4%)

and this warning:

WARNING: reads from the same pair were found not adjacent to each other in the input (due to read sorting by location or reporting of multi-mapping read pairs).

It seems that the total fragments from featureCounts do not match up at all with the read counts from STAR.

Thanks, Ina

RNA-seq featureCounts total RNA • 8.0k views

ADD COMMENT • link updated 3.1 years ago by snijesh ▴ 200 • written 7.5 years ago by inah ▴ 10

2

Entering edit mode

Are your data strand-specific paired end or what is -s parameter assigned in your data?

ADD REPLY • link 3.1 years ago snijesh ▴ 200

score 3 · Answer 1 · 2018-05-15

3

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

FeatureCounts reports number of alignments whereas STAR reports number of reads (number of reads pairs in this case since the data is paired end). The reporting of a mapped read may include one or more alignments. A uniquely mapped read will lead to the reporting of one single alignment, but a multi-mapping read will result in more than one alignment being reported. So the difference you observed was caused by the reporting and counting of multi-mapping reads.

If you instruct STAR to output uniquely mapped reads only, then featureCounts will report the same total count. When STAR is allowed to output multi-mapping reads, the total count from featureCounts is always higher because it reports the number of alignments rather than number of reads.

ADD COMMENT • link 7.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

thank you very much for the quick response, Wei. I have one other question: The percentage of successfully assigned fragments is 52% and 49.4% in these two total RNA samples. Is this unusually low?

Thanks again, Ina

ADD REPLY • link 7.5 years ago Ina Hoeschele ▴ 620

0

Entering edit mode

The assignment percentage is typically around 50 - 70 percent. So your percentages are a bit low but not unusual. The percentage tends to be lower when multi-mapping reads are included.

ADD REPLY • link 7.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

There is still something that must be going wrong with my analysis. I have compared the numbers of protein coding genes present (here meaning present in two samples with counts of at least 5) between mRNA-seq and total RNA-seq data. The mRNA-seg data had library sizes around 24 million, while the total RNA-seq data has library sizes around 100 million. I get fewer protein-coding genes for the total-RNA data than for the mRNA data (about 13K versus 15K). This is not possible.

There is one thing with featureCounts that I would still like to check. FeatureCounts tells me that for sample 1 63,971,619 fragments were successfully assigned and for sample 2 this number is 83,233,103. When I take the column sums of the count matrix, the library sizes of the two samples are 72,946,895 and 94,373,791. How do these two sets of numbers relate to each other?

Thanks, Ina

ADD REPLY • link 7.5 years ago inah ▴ 10

0

Entering edit mode

If multi-overlapping alignments (alignment overlapping more than one gene) are included in the counting, then it is possible that column sums of count matrix are greater than the total number of alignments assigned by featureCounts because a multi-overlapping alignment gives rise to more than one count in the count matrix. What is your featureCounts command?

ADD REPLY • link 7.5 years ago Wei Shi ★ 3.6k