Entering edit mode
Mark Robinson
▴
880
@mark-robinson-4908
Last seen 6.1 years ago
Hi all,
Sorry for cross-posting, but I figured this might be of interest to
both developers and users.
My student (Xiaobei Zhou) and I have been playing around with the
"simple" task of counting RNA-seq reads in Bioconductor and we've
collected a few observations. My apologies in advance if I've made
errors in calling the software or if I've omitted other methods to
try. Please correct us if so.
Here is a (hopefully) concise summary of what we've found:
1. There are now various options in Bioconductor to go from (mapped)
RNA-seq reads to counts at the gene, transcript, exon level. The main
players, as we understand it, are: easyRNASeq, summarizeOverlaps() in
GenomicRanges and featureCounts() in Rsubread. Outside of
Bioconductor, we have used htseq-count from Simon Anders and treat
this as the standard, since it's fairly mature. For reference, we've
collected code segments for all methods below, in case anyone wants to
repeat this, or needs this for their data (of course, as a resource,
the mailing list is not the best place to capture this).
2. For single-end reads, the methods are quite similar, although
differences exist. I attribute this to differences in how reads are
matched to features (e.g. overhanging reads) and/or filtering.
Nailing down the exact details are a bit hard to figure out, but I
expect these to be minor overall. Here is an example:
http://imlspenticton.uzh.ch/se.png
3. For paired-end reads, there is (or at least can be) a distinct
divergence in read counts: http://imlspenticton.uzh.ch/pe.png ... at
present, easyRNASeq and featureCounts()/Rsubread "overcount" reads and
summarizeOverlap() "undercounts", relative to htseq-count.
Wei Shi (Rsubread author) has recently mentioned that this is the way
he likes to count:
https://stat.ethz.ch/pipermail/bioconductor/2012-June/046440.html
I've spoken with Nico Delhomme (easyRNASeq author) offline and
accurately treating mate reads is on his TODO list.
The undercounting in summarizeOverlaps(), as far as we can tell, is
due to filtering singleton reads (one read of the mate is mapped, the
other is not) and not counting them.
4. I have some results on memory usage and CPU time. The packages
available have different strategies to manage the tradeoff, so they
vary widely.
A couple comments / suggestions:
1. From a user perspective, it would be optimal to have all options
available: count singletons or not, overhang or no overhang, counting
a pair as a single fragment, etc.
2. Another level of flexibility that would be useful is counting on
junctions. I think the infrastructure for this is now available:
https://stat.ethz.ch/pipermail/bioconductor/2012-May/045506.html
Is there anyone working on a wrapper that just takes the reads (or
pointer to BAM) and features and adds junction-level counts to the
exon-level ones?
Please discuss.
These are observations, not criticisms. I sincerely thank the
contributors for their excellent packages; it's very nice to have
multiple options available.
Best regards,
Mark
# Do not expect this to run. Populate a TXT file
# listing all the filenames (BAM/SAM)
# Given: metadata with file names
# starting from GTF files and BAM/SAM, derive tables of counts
sample.inform <- read.table("samples_bam_sam_etc.txt", header = TRUE,
stringsAsFactors = FALSE)
## -------------------
## easyRNASeq
## -------------------
library(easyRNASeq)
library(BSgenome.Dmelanogaster.UCSC.dm3)
gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
bamFlsR <- sample.inform[, "bamfile_resort"]
c_ers <- easyRNASeq(organism = "Dmelanogaster",
chr.sizes = as.list(seqlengths(Dmelanogaster)),
annotationMethod = "gtf", annotationFile= gtf,
format = "bam",gapped = TRUE, count = "genes",
summarization = "geneModels",
filenames = bamFlsR, filesDirectory = dir)
## -------------------
## countOverlaps
## -------------------
library(GenomicFeatures)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
gnModel <- exonsBy(txdb, "gene")
bamFls <- paste(sample.inform[, "bamfile_s"], sep = "")
UCSC2FlybaseGRanges<- function (GRanges) {
### Rename the chromosomes in<granges> from UCSC conventions ('chr1',
### etc) to comport with Flybase conventions ('1', etc) by stripping
### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
seqlevels(GRanges)<-
sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
GRanges
}
gnModelT <- UCSC2FlybaseGRanges(gnModel)
counter <- function(fl, gnModel)
{
aln <- readGappedAlignments(fl)
strand(aln) <- "*" # for strand-blind sample prep protocol
hits <- countOverlaps(aln, gnModel)
counts <- countOverlaps(gnModel, aln[hits==1])
names(counts) <- names(gnModel)
counts
}
c_conOver <- sapply(bamFls, counter, gnModel = gnModelT)
## -------------------
## summarizeOverlaps using UCSC annotation
## -------------------
library(GenomicRanges)
library(Rsamtools)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
library(GenomicFeatures)
bamFlsR <- sample.inform[, "bamfile_resort"]
# separate paired and unpaired
bamFlsR_pe <- bamFlsR[sample.inform[, "libtype"] == "PE"]
bamFlsR_se <- bamFlsR[sample.inform[, "libtype"] == "SE"]
bamFlsR_peL <- BamFileList(bamFlsR_pe)
bamFlsR_seL <- BamFileList(bamFlsR_se)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
gnModel <- exonsBy(txdb, "gene")
UCSC2FlybaseGRanges<- function (GRanges) {
### Rename the chromosomes in<granges> from UCSC conventions ('chr1',
### etc) to comport with Flybase conventions ('1', etc) by stripping
### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
seqlevels(GRanges)<-
sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
GRanges
}
gnModelT <- UCSC2FlybaseGRanges(gnModel)
con_pe <- summarizeOverlaps(gnModelT, bamFlsR_peL,
mode = "Union", singleEnd=FALSE,
ignore.strand = TRUE)
con_se <- summarizeOverlaps(gnModelT, bamFlsR_seL,
mode = "Union", singleEnd=TRUE,
ignore.strand = TRUE)
c_summOver <- cbind(assays(con_pe)$counts, assays(con_se)$counts)
colnames(count_summOver) <- c(basename(bamFlsR_pe),
basename(bamFlsR_se))
## -------------------
## Rsubread
## -------------------
library(Rsubread)
samFls <- sample.inform[, "samfile"]
gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
z <- read.table(gtf, sep="\t", stringsAsFactors=FALSE)
z <- z[z$V3=="exon",]
ids <- gsub(" gene_id ","",sapply(strsplit(z$V9,";"),.subset,1))
anno <- data.frame(entrezid=as.integer(factor(ids)),
chromosome=z$V1, chr_start=z$V4, chr_stop=z$V5)
ID <- sort(unique(anno$entrezid))
genID <- ids[match(ID, anno$entrezid)]
c_rsub <- featureCounts(SAMfiles=samFls, annot=anno, type =
"gene")$counts
rownames(count_rsub) <-genID
colnames(count_rsub) <-basename(samFls)
----------
Prof. Dr. Mark Robinson
Bioinformatics
Institute of Molecular Life Sciences
University of Zurich
Winterthurerstrasse 190
8057 Zurich
Switzerland
v: +41 44 635 4848
f: +41 44 635 6898
e: mark.robinson at imls.uzh.ch
o: Y11-J-16
w: http://tiny.cc/mrobin
----------
http://www.fgcz.ch/Bioconductor2012
http://www.eccb12.org/t5