Search
Question: summarizeOverlaps counting exons and 5UTR reads
0
gravatar for Walter F. Baumann
8 months ago by
Walter F. Baumann 10 wrote:

Hey,

I try to compare counts from 2 RNA seq data. I am interested in the 5'UTR and want to compare this number with the total number of reads per transcript. Here my code:

     txdb is the gtf file from Gencode (v19) "Comprehensive gene annotation"

     txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T)
     txdb_exons_transcripts <- exonsBy(txdb, by="tx", use.name=T)

# Count reads per 5UTR in transcript for
## RNAseq exp 1
     x1 <- summarizeOverlaps(features= txdb_5utr_transcript, reads = myrnaseq1,    mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
     x1 <- assay(x1)
     x1 <- data.frame(Tx=as.character(rownames(x1)), count_rna15UTRt=as.numeric(x1[,1]), row.names = NULL)
## RNAseq exp 2
     x2 <- summarizeOverlaps(features= txdb_5utr_transcript, reads = myrnaseq2,  mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq
     x2 <- assay(x2)
     x2 <- data.frame(Tx=as.character(rownames(x2)), count_rna25UTRt=as.numeric(x2[,1]), row.names = NULL)


# Count reads per exons in transcript for
## RNAseq exp 1
     y1 <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
     y1 <- assay(y1)
     y1 <- data.frame(Tx=as.character(rownames(y1)), count_rna1t=as.numeric(y1[,1]), row.names = NULL)
## RNAseq exp 2
     y2 <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq2, mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq
     y2 <- assay(y2)
     y2 <- data.frame(Tx=as.character(rownames(y2)), count_rna2t=as.numeric(y2[,1]), row.names = NULL)

# Merge tables
     z <- merge(x1,x2, by="Tx")
     z <- merge(z,y1, by="Tx")
     z <- merge(z,y2, by="Tx")

# Sum up counts of both reads and calculate difference (this is due to my research question)
     z[,"sum1"] <- apply(z[,2:3], 1, function(i) i[1]+i[2])
     z[,"sum2"] <- apply(z[,4:5], 1, function(i) i[1]+i[2])
     z[,"diff"] <- apply(z[,6:7], 1, function(i) i[2]-i[1])


However, there are multiple cases in which the difference is negative, which means that there are more reads in the 5'UTR than in the whole transcript, which does not make sense.

I double-checked my txdb_5utr_transcript and my txdb_exons_transcripts object: txdb_5utr_transcript is a subset of txdb_exons_transcripts, so this is not the problem. Any ideas?

ADD COMMENTlink modified 8 months ago by Hervé Pagès ♦♦ 13k • written 8 months ago by Walter F. Baumann 10
1
gravatar for Hervé Pagès
8 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

It looks like it's possible:

## 2 transcripts tx1 (2 exons) and tx2 (1 exon):
tx <- GRangesList(tx1=GRanges(c("chr1:1-10", "chr1:21-30")),
                  tx2=GRanges("chr1:1-30"))

## The 5' UTRs for transcripts tx1 and tx2:
utr <- GRangesList(tx1=GRanges("chr1:1-8"),
                   tx2=GRanges("chr1:1-4"))

## A single read:
reads <- GRanges("chr1:6-15")

## Count reads per transcript:
assay(summarizeOverlaps(features=tx, reads=reads,
                        mode=IntersectionNotEmpty,
                        inter.feature=T))
#     reads
# tx1     0
# tx2     1

## Count reads per 5' UTR:
assay(summarizeOverlaps(features=utr, reads=reads,
                        mode=IntersectionNotEmpty,
                        inter.feature=T))
#     reads
# tx1     1
# tx2     0

The exact details of how the IntersectionNotEmpty mode works are explained in ?summarizeOverlaps. The key is that "regions that are shared between features are discarded". This means that when counting reads per transcript, the chr1:1-10 and chr1:21-30 regions are removed from tx1 and tx2 so tx1 is replaced by an empty region and tx2 by region chr1:11-20. This explains the counts: 0 on tx1 and 1 on tx2. Now when counting reads per 5' UTR, the chr1:1-4 region is removed from tx1 and tx2 so tx1 is replaced by region chr1:5-8 and tx2 by an empty region. This explains the counts: 1 on tx1 and 0 on tx2.

Hope this helps,

H.

 

ADD COMMENTlink modified 8 months ago • written 8 months ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 215 users visited in the last hour