Question

GenomicAlignments: counting transcripts versus counting genes

0

Entering edit mode

brt381 • 0

@brt381-9339

Last seen 7.8 years ago

Canada

I used Tophat to map some RNA-seq reads from four different samples against a reference genome (Arabidopsis), and then used SummarizeOverlaps to make a table of raw counts. I tried to generate these counts for both genes and transcripts; to extract the gene annotations from my GFF file, I used the following line of code:

exbygene = exonsBy(TxDb_obj, "gene")

and for transcripts:

exbytranscript = exonsBy(TxDb_obj, "tx", use.names=TRUE)

Then I run:

se_gene = summarizeOverlaps(exbygene, files, mode="IntersectionNotEmpty", singleEnd=FALSE, ignore.strand=TRUE)

se_transcript = summarizeOverlaps(exbytranscript, files, mode="IntersectionNotEmpty", singleEnd=FALSE, ignore.strand=TRUE)

write.table(assays(se_gene), "bygene.txt")

write.table(assays(se_transcript), "bytranscript.txt")

As an example of my problem/question, in bygene.txt, I get the following counts for gene ID AT1G01040 (each number represents the count from one of my four samples)

954 723 895 614

In bytranscript.txt, I get the following counts for AT1G01040.1 and AT1G01040.2

AT1G01040.1: 38 19 25 14

AT1G01040.2: 3 2 3 6

My question is, why is there such a discrepancy? Shouldn't the counts for the two transcripts sum to the counts for the gene?

Thanks in advance for any help!

genomicalignments • 2.6k views

ADD COMMENT • link updated 10.2 years ago by Thomas Girke ★ 1.7k • written 10.2 years ago by brt381 • 0

score 0 · Answer 1 · 2015-12-08

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Almost certainly not. See page 3 in the read counting vignette. Any read that overlaps two or more exons in its entirety will be ignored. If you want to do things at the transcript level, you will probably be better off using kallisto or salmon. There is a readKallisto() function in the devel version of SummarizedExperiments if you want to be all cutting edge and stuff. Or you can use sleuth or Rob Patro's fork (for salmon) to analyze your data, which I guess is cutting edge as well.

ADD COMMENT • link 10.2 years ago James W. MacDonald 68k

1

Entering edit mode

Mike Love is also working on this package, which is relevant: https://github.com/mikelove/tximport

ADD REPLY • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

The findCompatibleOverlaps function in GenomicAlignments might be helpful in this context.

ADD REPLY • link 10.2 years ago Michael Lawrence ★ 11k

score 0 · Answer 2 · 2015-12-08

If you want to obtain valid exon-level counts for transcripts with summarizeOverlaps() then most likely you want to set inter.feature=FALSE. The default is TRUE, meaning the reads mapping to exons ranges shared among more than one transcript will be ignored in the read counts. The latter will be much more frequently the case for your exbytranscript range set than exbygene.