On Dec 3, 2012, at 10:03 PM, Richard Friedman wrote:
> Dear Bioconductor List,
> I am working through the easyRNAseq use case to learn how
> to obtain counts in RNASeq experiments for further analysis.
> The example starts with BAM files and converts BAM files to Exons,
> transcripts and genes using geneModels.
> Does using geneModels only assemble previously annotated transcripts
> genes OR can it find new ones if present?
It can only assemble gene models from the exons defined in the
annotation file you provide.
> If it can find new ones how well does it do this in comparison to
No it can't. There's a "now old" functionality - not exposed - that
identify islands of coverage, i.e. in an approach similar to a very
basic approach at finding large ChIP-Seq regions (e.g. from histone
marks). But that would compare very poorly with cufflinks as it does
not take into account any splice-junction nor paired-read information.
> If it cannot find new ones - is there a way to get counts (as
distinct from fpkm
> values) for genes and transcripts from cufflinks and
> relate them to existing annotation where they correspond and present
> as non-previosuly annotated ones when they don't correspond?
Yes there is - I'm doing something very similar - but not in R. More
> Can easyRNASeq be used for this purpose?
Not directly no.
> Can anyone recommend a tool that can be used for this purpose?
> My goal is to get a set of counts that can be input into CQN and
> I wish to use TopHat/Cufflinks to get the Exons, transcripts, and
> novel spliced variants but I am persuaded CQN is a better way to
> FPKM and edgeR is a better way to analyze differential expression
I agree with you there. What you might want to consider in addition is
to use an aligner that is multi-mapping aware. RSEM is an example,
tophat/cufflinks does this in a similar way by default (or so I
believe, double-check that this option is really enabled by default).
Here it's useful to use bowtie2 rather than bowtie. Finally, I like
the idea of using CQN, I'll consider adding this to the easyRNASeq
As I said, I have a very similar setup, but completely de-novo. I've
been (still am) testing several approaches:
1) running TopHat/Cufflinks/Cuffmerge (cuffmerge to get the exon/gene
GFF) and from that I go back to the original alignments by tophat and
use these as input together with the GFF for easyRNASeq. I then get my
DESeq/edgeR output and proceed in R.
2) In parallel, I'm doing the same using trinity (assembling the
transcriptome de novo) and re-aligning the read libraries using RSEM.
From the RSEM results, I get directly the count tables that I extract
as a matrix and directly input into edgeR/DESeq (bypassing
3) Same as 2, but I create a GFF annotation by performing a re-
alignments of the obtained contig to the genome assembly using GMAP.
Then using the RSEM bam files, I use easyRNASeq to create the count
I'm still evaluating the outcome of these different approaches.
As it seems that you already have some annotation for your
genome/transcriptome, that should ease some of the steps, e.g. instead
of running cuffmerge, you could run cuffcompare to refine your GFF
(see my approach #1). Since this discussion is not really related to
Bioconductor, we could continue it offline if you like.
> I would appreciate any advice.
> Thanks and best wishes,
> Richard A. Friedman, PhD
> Associate Research Scientist,
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator,
> Center for Computational Biology and Bioinformatics (C2B2)/
> National Center for Multiscale Analysis of Genomic Networks
> Columbia Initiative in Systems Biology
> Room 824
> Irving Cancer Research Center
> Columbia University
> 1130 St. Nicholas Ave
> New York, NY 10032
> (212)851-4765 (voice)
> friedman at cancercenter.columbia.edu
> In memoriam, Ray Bradbury
> [[alternative HTML version deleted]]
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: