Question

Rsubread, featureCounts misses some exons

0

Entering edit mode

tonja.r ▴ 80

@tonjar-7565

Last seen 8.4 years ago

United Kingdom

I extracted exon information from mm9 and then used it for featureCount. Next I used an in-built mm9 annotation of featureCount and I got less exons annotated somehow.

Extraction of exon information:

mm9 = TxDb.Mmusculus.UCSC.mm9.knownGene
exon = exons(mm9)
head(exon)
length(exon)
exon_ranges = ranges(exon)
gene_id_exons = select(mm9, keys=as.character(exon$exon_id), columns = c("GENEID","TXNAME"), keytype = "EXONID")
colnames(gene_id_exons) = c("EXONID","ENTREZID","TXNAME")
symbol <- select(org.Mm.eg.db, keys=as.character(unique(gene_id_exons$ENTREZID)), keytype="ENTREZID",

                 columns="SYMBOL")
gene_id_exons = merge(gene_id_exons,symbol,all.x=T)
exon_info =  data.frame(START = start(exon_ranges), END = end(exon_ranges), CHR = seqnames(exon), STRAND = strand(exon),EXONID = exon$exon_id)
exon_info = merge(exon_info,gene_id_exons,all.x=T)
exon_info_red =  data.frame(GeneID =exon_info$ENTREZID,  Chr = exon_info$CHR, Start = exon_info$START, End = exon_info$END, Strand = exon_info$STRAND)

It shows me 7 exons corresponding to two different transcript of one gene 497097:

> subset(exon_info, ENTREZID == 497097)
      EXONID   START     END  CHR STRAND ENTREZID     TXNAME SYMBOL
14642   7584 3195985 3197398 chr1      -   497097 uc007aet.1   Xkr4
14643   7585 3203520 3205713 chr1      -   497097 uc007aet.1   Xkr4
14644   7586 3204563 3207049 chr1      -   497097 uc007aeu.1   Xkr4
14645   7587 3411783 3411982 chr1      -   497097 uc007aeu.1   Xkr4
14646   7588 3638392 3640590 chr1      -   497097 uc007aev.1   Xkr4
14647   7589 3648928 3648985 chr1      -   497097 uc007aev.1   Xkr4
14648   7590 3660633 3661579 chr1      -   497097 uc007aeu.1   Xkr4

FeatureCount with the manual exon annotation:

manual=featureCounts(c(bam.CTCF,bam.H3K27ac),annot.ext=exon_info_red,isGTFAnnotationFile=FALSE,

                GTF.featureType="exon",GTF.attrType="gene_id",useMetaFeatures=FALSE,
                allowMultiOverlap=TRUE,isPairedEnd=FALSE,requireBothEndsMapped=FALSE,
                checkFragLength=FALSE,minFragLength=50,maxFragLength=600,
                nthreads=1,strandSpecific=0,minMQS=0,
                readExtension5=0,readExtension3=0,read2pos=NULL,
                minReadOverlap=1,countSplitAlignmentsOnly=FALSE,
                countMultiMappingReads=FALSE,countPrimaryAlignmentsOnly=FALSE,
                countChimericFragments=TRUE,ignoreDup=FALSE,chrAliases=NULL,reportReads=FALSE)
manual_counts=manual$counts

FeatureCount with in-built mm9 version:

inbuilt=featureCounts(c(bam.CTCF,bam.H3K27ac),annot.inbuilt="mm9",annot.ext=NULL,isGTFAnnotationFile=FALSE,

                GTF.featureType="exon",GTF.attrType="tx_id",useMetaFeatures=FALSE,
                allowMultiOverlap=TRUE,isPairedEnd=FALSE,requireBothEndsMapped=FALSE,
                checkFragLength=FALSE,minFragLength=50,maxFragLength=600,
                nthreads=1,strandSpecific=0,minMQS=0,
                readExtension5=0,readExtension3=0,read2pos=NULL,
                minReadOverlap=1,countSplitAlignmentsOnly=FALSE,
                countMultiMappingReads=FALSE,countPrimaryAlignmentsOnly=FALSE,
                countChimericFragments=TRUE,ignoreDup=FALSE,chrAliases=NULL,reportReads=FALSE)
inbuilt_counts=inbuilt$counts

When I compare the results I find out that the number of exons annotated in inbuilt version equals 3 and if I use my own annotation, all exons are annotated. What is the reason for such a behavior?

Results

> manual_counts[which(row.names(manual_counts)=="497097"),1:3]
       CTCF_WNN1_828 CTCF_WNN3_8278 CTCF_WEN1_8282
497097             0              0              8
497097             0              3              5
497097            18             21             42
497097             5              0              4
497097            20              0              6
497097             1              0              0
497097           110            340            134

> inbuilt_counts[which(row.names(inbuilt_counts)=="497097"),1:3]
       CTCF_WNN1_828 CTCF_WNN3_8278 CTCF_WEN1_8282
497097            18             21             42
497097             5              0              4
497097           110            340            134

PS:
I have found out that only exons from a transcript uc007aeu.1 are annotated in the in-built version. Why other exons are skipped? I also posted a question regarding those three transcripts in dataset TxDb.Mmusculus.UCSC.mm9.knownGene shows other gene_id then genome browser.

annotation featurecounts rsubread • 2.2k views

ADD COMMENT • link updated 9.6 years ago by Gordon Smyth 52k • written 9.6 years ago by tonja.r ▴ 80

score 0 · Answer 1 · 2015-09-14

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

The in-built Rsubread annotation is NCBI RefSeq annotation. According to RefSeq, the Xkr4 gene has only one transcript with only three exons.

The Bioconductor transcript package seems have included some speculative genes and exons from UCSC. If you visit the UCSC browser, however, you will only see one transcript for Xkr4 with the same three exons as given byRsubread.

ADD COMMENT • link 9.6 years ago Gordon Smyth 52k

0

Entering edit mode

How can I get rid of those strange speculative genes and exons from UCSC? Because otherwise they affect massively my downstream analysis. Is there Bioconductor package for RefSeq annotation with promoters, exons and genes?

ADD REPLY • link 9.6 years ago tonja.r ▴ 80

0

Entering edit mode

Hi,

No Bioconductor package for the RefSeq track but it's easy to make your own TxDb object for that:

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("mm9", "refGene")

See ?makeTxDbFromUCSC for more information.

H.

ADD REPLY • link 9.6 years ago Hervé Pagès 16k

0

Entering edit mode

The in-built annotation included in Rsubread is built based on the RefSeq annotation. Overlapping exons from the same gene are merged into one exon in the in-built annotation. It looks like you are doing exon-level analysis since you summarized read to exons. The in-built annotation will give you more exons with larger counts to work with, as compared to the counts you get from using the original RefSeq annotation.

If you work on gene-level counts (counts for genes not for exons), the in-built annotation give you exactly the same counts as you will get from the original RefSeq annotation.

ADD REPLY • link 9.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Just to clarify, transcripts in the knownGene table are linked to their Entrez Gene ids via the knownToLocusLink table. This is what this table contains for gene 497097 (Xkr4):

knownToLocusLink_url <- "http://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/knownToLocusLink.txt.gz"
knownToLocusLink_local_file <- tempfile()
download.file(knownToLocusLink_url, destfile=knownToLocusLink_local_file)
knownToLocusLink <- read.table(knownToLocusLink_local_file, col.names=c("tx_name", "gene_id"), stringsAsFactors=FALSE)
subset(knownToLocusLink, gene_id == "497097")
#          tx_name gene_id
# 46846 uc007aet.1  497097
# 46847 uc007aeu.1  497097
# 46848 uc007aev.1  497097

This explains why you see 3 transcripts instead of 1 for Xkr4 in TxDb.Mmusculus.UCSC.mm9.knownGene. I'm not sure why the UCSC Genome Browser only displays 1 transcript though...

H.

ADD REPLY • link 9.6 years ago Hervé Pagès 16k