I'm looking for some help when analysing RNA-Seq data for differential exon usage/splicing. I've previously used edgeR for DE analysis and was happy to see diffSpliceDGE as part of that package. What I don't know is:
- how to obtain exon level counts
- whether there is a normalisation step involved like there is in a DGE pipeline (i.e. TMM)
For the first point, I have BAM files for the samples I'd like to analyse that were mapped using STAR to the GRCh38 reference. Is there a way to extract exon-level counts from these files? In case it matters I'm analysing data from an exon skipping experiment and so will be looking at differential splicing for our gene of interest as well as any off-target effects.
For the second point, I've been through the example on the documents page (https://rdrr.io/bioc/edgeR/man/diffSpliceDGE.html) and while that made perfect sense I noticed there was no normalisation step used. How, if at all, should the exon level count data be normalised prior to analysis?
Many thanks for your help and time.
Thank you for your reply. I had an old copy of the edgeR manual and hadn't seen the new case studies. Excellent! I've been through the Case Study on the Pasilla knockdown, which looks exactly what I am after. Although I have a couple of questions.
Here's my code and the resulting error:
My counts data looks like this:
annotation looks like this:
As these are exon-level counts, I'd expect the same ID appearing on multiple rows. Any ideas why there's an error?
I see how the use of
Batch <- factor(c(1,3,4,1,3,4))is trying to account for both with sequencing type being
c('SE','PE','PE','SE','PE','PE')but the date is more complex. It looks to me that N1 is run over two dates, N3 is run over two dates and N4 is run on one date, it's the D samples where I'm confused, D1 is run over 4 different days but D3 and D4 are both run on the same day. Given that D3 and D4 are also both PE, should it be
Batch <- factor(c(1,3,4,1,3,3))or something similar?
Many thanks again for your help.
The error is caused by having duplicate
row.names, which is not allowed. To get rid of this error, you can simply drop the
row.namesof the count matrix before creating the
The exon information is still preserved in
I don't see anything wrong with the design. There are 6 RNA-seq libraries from 3 individual samples under 2 conditions. It is a very classic paired two group comparison. The design matrix accounts for the batch effect between different samples. It doesn't take into account the sequencing type/date.
Thank you again. Regarding the design matrix, I guess my confusion comes from the manual saying
So I couldn't work out how the
Batch <- factor(c(1,3,4,1,3,4))accounted for the largest source of variation in this data, the sequencing type. To clarify, is that accounted for simply by the fact that the N* and D* samples are paired and also paired in terms of their sequencing type and so the batch effect seen from sequencing type can be accounted for by simply providing a paired design? Would it be redundant to add another level in there specifically for sequencing type e.g.
SeqType <- c(0,1,1,0,1,1)and then use
design <- model.matrix(~ Batch + SeqType + Pasilla)?
Apologies if this is obvious, I'm just trying to learn and get it right, so thank you for you help :) .
Batchis nested within your
SeqType. It would be redundant to add
SeqTypein the design matrix when
Batchis already included.
So to be clear on the nesting, the
1group in the
SeqTypeis further split into
Batchand so the use of
SeqTypeis redundant as it's not adding any more information than just including
Batch. If that's correct then thank you, you have been very helpful indeed and I'll stop pestering you!
Yes, that is correct.