I used Tophat to map some RNA-seq reads from four different samples against a reference genome (Arabidopsis), and then used SummarizeOverlaps to make a table of raw counts. I tried to generate these counts for both genes and transcripts; to extract the gene annotations from my GFF file, I used the following line of code:
exbygene = exonsBy(TxDb_obj, "gene")
and for transcripts:
exbytranscript = exonsBy(TxDb_obj, "tx", use.names=TRUE)
Then I run:
se_gene = summarizeOverlaps(exbygene, files, mode="IntersectionNotEmpty", singleEnd=FALSE, ignore.strand=TRUE)
se_transcript = summarizeOverlaps(exbytranscript, files, mode="IntersectionNotEmpty", singleEnd=FALSE, ignore.strand=TRUE)
write.table(assays(se_gene), "bygene.txt")
write.table(assays(se_transcript), "bytranscript.txt")
As an example of my problem/question, in bygene.txt, I get the following counts for gene ID AT1G01040 (each number represents the count from one of my four samples)
954 723 895 614
In bytranscript.txt, I get the following counts for AT1G01040.1 and AT1G01040.2
AT1G01040.1: 38 19 25 14
AT1G01040.2: 3 2 3 6
My question is, why is there such a discrepancy? Shouldn't the counts for the two transcripts sum to the counts for the gene?
Thanks in advance for any help!
Mike Love is also working on this package, which is relevant: https://github.com/mikelove/tximport
The
findCompatibleOverlaps
function in GenomicAlignments might be helpful in this context.