Search
Question: Measuring shifts in intronic expression independently of exons
0
4.6 years ago by
Spain
James Perkins60 wrote:
Hi Alejandro, When I say "shrunken FCs" I mean the logFC(case/control) returned by the estimate2FoldChanges function When I say "raw fold changes" I'm talking about manually working out the fold change from the raw data. Is what you describe not essentially calculating a shrunken fold change value given that I have effectively 2 exons per gene? So - to calculate what I call the "shrunken fold change" I can do the following: res.nofilt[res.nofilt$geneID == g & res.nofilt$exonID == "E001", "log2fold(case/control)"] # For exons res.nofilt[res.nofilt$geneID == g & res.nofilt$exonID == "E002", "log2fold(case/control)"] # For introns Then, to calculate the "raw fold change" I do this: log2(mean(counts(bare.nofilt)[paste(g,"E001",sep=":"), 4:6]) / mean(counts(bare.nofilt)[paste(g,"E001",sep=":"), 1:3])) # For exons log2(mean(counts(bare.nofilt)[paste(g,"E002",sep=":"), 4:6]) / mean(counts(bare.nofilt)[paste(g,"E002",sep=":"), 1:3])) # For introns For these examples, bare.nofilt is the exonCountSet, r es.nofilt is derived from: bare.nofilt <- estimatelog2FoldChanges(bare.nofilt, averageOutExpression=FALSE, .denominator="control") res.nofilt <- DEUresultTable(bare.nofilt) Then if I plot them, it appears at though the estimatelog2FoldChanges is in effect, indirectly producing a "shrunken" fold change? Or something similar to that? At least that is how I understand it, I may be way off! plot here: http://postimg.org/image/vig3slnht/ Thanks, Jim On 30 September 2013 17:29, Alejandro Reyes <alejandro.reyes@embl.de> wrote: > Hi James, > > Before saying anything... one question, what do you mean with "shrunken > FCs" and with "raw fold changes"? > > It is a bit confusing, since DEXSeq does not do any shrinkage in the fold > changes (so far). > The fold change from the "estimatelog2FoldChanges" function calculates the > interaction > coefficient between your condition of interest and the exons (using a glm > fit) to estimate > the "fitted splicing", which already should be independent from overall > gene expression > effects. This is what you see plotted in the plotDEXSeq function! After a > variance stabilizing > transformation and a log2 transformation are used and this is what > estimatelog2FoldChanges > adds to the fData. > > Alejandro > > > > > > Hi Wolfgang, >> >> Thanks a lot for all your help on this. One final thing I noticed, which >> may be of value to anyone following this problem: >> >> I noticed that when I use the raw FCs instead of the shrunken FCs to make >> the volcano plot, the pattern is much clearer: >> http://postimg.org/image/**4ri7pgdht/<http: postimg.org="" image="" 4ri7="" pgdht=""/> >> >> Please ignore the x axis label. >> >> I also notice that the negative clump remains, which is suggestive that >> the >> majority of "changing" genes show generally an increased expression in >> intronic regions compared to exonic regions. >> >> Wrt your reply, we am now going through individual examples, there are >> some >> potentially interesting patterns of intersting patterns of expression. I >> am >> also using quite large cut offs e.g. at least 200 reads align to intronic >> regions for at least n/2 samples (where n is the total number of samples). >> This is quite important given the length of intronic regions and seems to >> lead to more interesting results when going through the list of top genes. >> >> Cheers again for your help and guidance with thtis, >> >> Jim >> >> >> On 29 September 2013 10:30, Wolfgang Huber <whuber@embl.de> wrote: >> >> Dear James >>> >>> thanks. What you do below makes sense, as far as I can see. What DEXSeq >>> considers is, applied to your situation: >>> >>> number of reads mapping to E001 >>> ------------------------------**- >>> number of reads mapping to both >>> >>> (and similarly for E002), and it tries to find significant changes of >>> that >>> ratio along with your treatment. Maybe making the Volcano plot with the >>> difference of the logarithm of these ratios across condition as 'effect >>> size' would also be informative, but I am not sure it has additional >>> info. >>> The plots you sent below look OK to me, I'd indeed now go down the list >>> of >>> genes and introns and see whether you can independently verify it (e.g. >>> through a biological hypothesis). >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> >>> >>> Il giorno Sep 26, 2013, alle ore 5:00 pm, James Perkins < >>> jimrperkins@gmail.com> ha scritto: >>> >>> Hi again, >>>> >>>> Maybe I should rephrase the last part better. >>>> >>>> We have a fold change for the E001 "exon" i.e. the summed exonic >>>> >>> expression >>> >>>> we have a fold change for the E002 "exon" i.e. the summed intronic >>>> >>> expression >>> >>>> However what we are really interested in is the discrepancy between the >>>> >>> two, i.e. introns go one way, exons the other, or exons don't change but >>> introns do, or whatever. >>> >>>> I've made a "sort of" volcano plot - I've subtracted the log2 fold >>>> >>> change calculated using intron only counting from the log2FC calculated >>> using exon only counting. >>> >>>> http://postimg.org/image/**o9hi6o4pj/<http: postimg.org="" image="" o9="" hi6o4pj=""/> >>>> >>>> I've plotted this against -log10 p val. You see there is some "volcano" >>>> >>> activity although it's quite 1 sides - as I understand it, this is >>> suggesting that more genes are increasing in intronic expression in case >>> samples compared to exonic expression, since the cluster of negative >>> values >>> with low p values suggests that the FC is higher for introns than exons. >>> >>>> However the "effect size" as I say is pretty small. This is probably in >>>> >>> part due to the way FCs are shrunk by DEXSeq, should I use the FC >>> shrinking >>> in this case? >>> >>>> BTW There are a couple of outliers in the volcano plot, the real plot is >>>> >>> like this: >>> >>>> http://postimg.org/image/**r4c7uhf5t/<http: postimg.org="" image="" r4="" c7uhf5t=""/> >>>> >>>> BTW I've not used the plotDEXSeq function because this is exon focussed >>>> >>> if I understand correctly, instead I've used the ggbio package. >>> >>>> Still unsure what criteria to use to say whether a gene is really >>>> >>> showing increased intronic expression. Probably I can't I can just go >>> down >>> the list of p-values looking at the patterns of expression and try to say >>> which are interesting! >>> >>>> Cheers, >>>> >>>> Jim >>>> >>>> >>>> On 24 September 2013 11:17, James Perkins <jimrperkins@gmail.com> >>>> wrote: >>>> Dear Wolfgang, Martin, list >>>> >>>> Sorry for the delay >>>> >>>> Thanks for the explanation re: counting per exons vs. counting per gene >>>> >>> wrt multiple transcripts. This doesn't effect my "real data" example >>> where >>> I counted reads mapping to all exons and reads mapping to all introns and >>> classed them as two different "exons" per gene for the sake of runnign >>> DEXSeq. Martin - my approach is very simple so I would say you idea is >>> not >>> quite relevant in this case. However this method could be possibly >>> extended >>> to look for individual introns showing an increase in expression, rather >>> than detecting genes who show a general relative increase in intronic >>> expression summed accross exons/introns. >>> >>>> Wolfgang - I agree with your points re: plotting the data - this find >>>> >>> some nice instances of changes in 'intronic expression' in the case vs. >>> control samples. >>> >>>> Re: this point: >>>> >>>> >>>> - an effect size (fold-change) cutoff might help focus the results on >>>> >>> the most interesting instances. How does the volcano plot look like? >>> >>>> >>>> I like the idea of the effect size cutoff, however what do you mean by >>>> >>> fold change (for plotting the volcano) do you mean the change in ratios >>> of >>> exon to intronic, or what should I plot? and vs. what? p-value I assume >>> but >>> just to be clear. >>> >>>> Thanks a lot again for your help. >>>> >>>> Jim >>>> >>>> >>>> >>>> On 30 August 2013 15:35, Wolfgang Huber <whuber@embl.de> wrote: >>>> Dear Jim, Martin & list >>>> >>>> Martin's proposal of using splice graphs is definitely the way to go, it >>>> >>> is conceptually appealing, would have more power, and results are >>> potentially more readily interpretable. That said, the current state of >>> what we have in DEXSeq does the testing exon-by-exon, and lack of power >>> does not seem to be James' problem. Regarding James' questions: >>> >>>> 1. It is fine, and to be expected, that the sum of per-exon counts is >>>> >>> larger than the per-gene count. This is OK because a read that covers >>> multiple exons is counted multiple times, for each of the overlaps. This >>> makes sense for the testing setup of DEXSeq. It tests, marginally, exon >>> by >>> exon, whether its relative usage among all transcripts from the locus is >>> condition-dependent. Since the tests are marginal, correlations between >>> counts of neighbouring exons are not a problem. Each overlap of a read >>> is a >>> postiive piece of evidence for it being expressed. >>> >>>> 2. Regarding your high number of results, here are some considerations: >>>> - the fact that swapping sample labels leads to greatly reduced sets of >>>> >>> hits is reassuring >>> >>>> - definitely do go ahead with visualising the results to see what is >>>> >>> going on and whether this large number of calls is justified by the data >>> (please keep us in the loop) >>> >>>> - did you see the "plotDEXSeq" function >>>> - an effect size (fold-change) cutoff might help focus the results on >>>> >>> the most interesting instances. How does the volcano plot look like? >>> >>>> Hope this helps >>>> Wolfgang >>>> >>>> >>>> On 29 Aug 2013, at 21:27, James Perkins <jimrperkins@gmail.com> wrote: >>>> >>>> Hi Wolfgang, list >>>>> >>>>> Thank you very much for the helpful tip. >>>>> >>>>> I did exactly as you suggested, the code is below. I wanted to do it >>>>> >>>> using the pasilla package in order to create an instructive example, >>> and so >>> that you could potentially recreate the analysis, but strangely sometimes >>> the gene read counts from pasillaGenes are lower than the summed exon >>> counts from the pasillaExon object. Or at least that's the case according >>> to my code! >>> >>>> If you (or anyone else who is reading this) can think of a good public >>>>> >>>> data set for this please let me know, ideally one that is already in >>> bioconductor and fairly easy to manipulate. Otherwise I am happy to >>> submit >>> my current dataset to bioc or make available some other way once its >>> corresponding paper has been published >>> >>>> So, code below. Any comments or suggestions would be appreciated, if >>>>> >>>> you think there are ways I could tune DEXSeq for this purpose: >>> >>>> To start with I loaded in all the necessary files and made a counts >>>>> >>>> object with Gend ID:E001 for intronic and GeneID:E002 for exonic. >>> >>>> # c50tx - table of counts using reads mapping to anywhere in gene >>>>> >>>> c50ex - reads mapping to exons only >>> >>>> intMat <- as.data.frame(c50tx) - as.data.frame(c50ex) >>>>> exMat <- as.data.frame(c50ex) >>>>> >>>>> # Remove genes for which intron count is below 100 >>>>> intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] >>>>> exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] >>>>> intMat <- intMat.2 >>>>> >>>>> row.names(intMat) <- paste(row.names(intMat), ":E001", sep="") >>>>> row.names(exMat) <- paste(row.names(exMat), ":E002", sep="") >>>>> # Now I combine the different tables in order to get a combined table >>>>> >>>> as so: >>> >>>> both <- rbind(intMat, exMat) >>>>> both <- both[order(row.names(both)), ] >>>>> head(both) >>>>> >>>>> # Now I turn this into an exon count set >>>>> gIDs <- matrix(unlist(strsplit(row.**names(both), ":")), ncol=2, >>>>> >>>> byrow=TRUE)[,1] >>> >>>> names(gIDs) <- row.names(both) >>>>> gIDs <- as.factor(gIDs) >>>>> eIDs <- matrix(unlist(strsplit(row.**names(both), ":")), ncol=2, >>>>> >>>> byrow=TRUE)[,2] >>> >>>> names(eIDs) <- row.names(both) >>>>> des <- data.frame(condition=c(rep("**case",3), rep("control", 3))) >>>>> row.names(des) <- colnames(both) >>>>> ecS <- newExonCountSet( >>>>> countData = both, >>>>> design = des, >>>>> geneIDs=gIDs, >>>>> exonIDs=eIDs) >>>>> >>>>> Once I had this I ran normalisation etc. and looked for genes with >>>>> >>>> 'differential exon/intron usage': >>> >>>> ecS <- estimateSizeFactors(ecS) >>>>> ecS <- estimateDispersions(ecS) >>>>> ecS <- fitDispersionFunction(ecS) >>>>> head(fData(ecS)$**dispBeforeSharing) >>>>> >>>>> ecS <- testForDEU(ecS) >>>>> head(fData(ecS)[, c("pvalue", "padjust")]) >>>>> ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE) >>>>> res2 <- DEUresultTable(ecS) >>>>> sum(ecS$pad >>>>> >>>>> >>>>> So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for >>>>> >>>> at least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems >>> quite a high number of genes, so I wonder if the method is being >>> conservative enough. >>> >>>> That said, tried swapping the sample labels a few times, which lead to >>>>> >>>> 0 genes being called as DEU, which is a good sign I think. >>> >>>> Interestingly, comparing intron:exon for case and controls for the >>>>> >>>> significant DEU genes shows a 4:1 ratio of genes that with an increased >>> intronic expression in case samples, suggesting the experimental >>> intervention is leading to increased intronic expression. >>> >>>> So, does this approach seem valid? My main question is whether it is a >>>>> >>>> little unconservative. I'm currently using ggbio to plot the read >>> alignments for the significant genes and looking for interesting looking >>> patterns of intronic expression that are consistent across replicates of >>> the same class (and that differ between classes). I've found a few things >>> that look interesting for follow up. >>> >>>> Thanks a lot for your help, >>>>> >>>>> Jim >>>>> >>>>> On 9 August 2013 16:02, Wolfgang Huber <whuber@embl.de> wrote: >>>>> Dear James >>>>> have you already checked the DEXSeq package and the paper >>>>> >>>> http://genome.cshlp.org/**content/22/10/2008.long<http: genome.c="" shlp.org="" content="" 22="" 10="" 2008.long=""> >>> >>>> This does not apply 1:1 to what you are asking for, but afaIcs the >>>>> >>>> main modification would be for you to define "counting bins" (on which >>> the >>> input to DEXSeq is computed by read overlap counting) that represent (i) >>> exons and (ii) introns and then check for changes in relative intro usage >>> (the 'ratios' you mention below). >>> >>>> Let me know how it goes >>>>> Wolfgang. >>>>> >>>>> On 9 Aug 2013, at 10:02, James Perkins <j.perkins@ucl.ac.uk> wrote: >>>>> >>>>> Dear list, >>>>>> >>>>>> I would like to know if an experimental treatment leads to a >>>>>> >>>>> significant >>> >>>> shift in intronic expression for some genes. >>>>>> >>>>>> Imagine I have an experiment with 6 biological replicates of a given >>>>>> tissue. I believe that the treatment might lead to an increased >>>>>> >>>>> intronic >>> >>>> expression for some genes, unrelated to exonic expression. >>>>>> >>>>>> 3 of these receive no treament, they are used as control. The other 3 >>>>>> receive experimental treatment. >>>>>> >>>>>> I then sequence the mRNA from these samples (Illumina, single end >>>>>> >>>>> reads, >>> >>>> ~40 million reads per sample), to obtain 6 fastq files, I align >>>>>> >>>>> these to a >>> >>>> refernce genome and get bam files. >>>>>> >>>>>> I was thinking that a fairly easy way to see if some genes show a >>>>>> consistent increased intronic expression following treatment would >>>>>> >>>>> be to >>> >>>> count intronically aligning reads for each gene (e.g. using >>>>>> >>>>> GenomicRanges) >>> >>>> and use something like DESeq to look for genes showing a significant >>>>>> >>>>> change >>> >>>> in intronic "expression". >>>>>> >>>>>> However, the problem is that this might be due to exonic expression, >>>>>> >>>>> due to >>> >>>> premature mRNAs etc., so I might end up finding genes that are >>>>>> differentially expressed at the exon level, and as a result the >>>>>> >>>>> increased >>> >>>> exon expression has caused increased intronic expression as a by >>>>>> >>>>> product. >>> >>>> Obviously I am not so interested in these genes wrt this method, I >>>>>> >>>>> can find >>> >>>> these using "traditional" DE methods. >>>>>> >>>>>> In addition, when I tried profiling intronic regions using reads >>>>>> >>>>> mapping to >>> >>>> introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line >>>>>> >>>>> was >>> >>>> quite far above the minimum mean expression point, i.e. it didn't go >>>>>> through the middle of the clump of data points (if that makes >>>>>> >>>>> sense). I >>> >>>> wonder if this is due to the size factor calculation being based on >>>>>> intronic "expression" (well, reads mapping to introns), which is >>>>>> >>>>> generally >>> >>>> much lower than exon expression and therefore being somewhat >>>>>> >>>>> unreliable. >>> >>>> So I would like to take this exon expression out of the equation, >>>>>> >>>>> so to >>> >>>> speak. >>>>>> >>>>>> I thought that one way might be to compare the ratios of exonic to >>>>>> >>>>> intronic >>> >>>> reads between samples, for each gene. >>>>>> >>>>>> For example one gene might have 30, 35 and 33 exonically mapping >>>>>> >>>>> reads and >>> >>>> 10,11 and 12 intronically mapping reads for control samples >>>>>> >>>>>> For case samples it might have 33, 32 and 34 exonically mapping >>>>>> >>>>> reads; >>> >>>> 20,21 and 19 intronically mapping reads. >>>>>> >>>>>> So we could compare 10/30, 11/35 and 12/33 for control to 20/33, >>>>>> >>>>> 21/32 and >>> >>>> 19/34 for case. >>>>>> >>>>>> Does this methodology sound reasonable? It is necessarily based on >>>>>> >>>>> the >>> >>>> assumption that intronic "expression" due to unspliced RNAs is >>>>>> >>>>> correlated >>> >>>> with exon expression. >>>>>> >>>>>> If it sounds reasonable, is there a test that is recommended to >>>>>> >>>>> compare the >>> >>>> ratios in such a way, that takes into account the biological >>>>>> >>>>> replication of >>> >>>> samples? I could do a simple test (chi squared) to compare the >>>>>> >>>>> relative >>> >>>> frequencies, but this wouldn't take into account the replicates. >>>>>> >>>>>> I realise this isn't really a specific bioconductor question, but >>>>>> >>>>> hopefully >>> >>>> it might be of interest to some of the list subscribers. >>>>>> >>>>>> Many thanks, >>>>>> >>>>>> Jim >>>>>> >>>>>> -- >>>>>> James R Perkins, PhD >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor@r-project.org >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: st="" at.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>> Search the archives: >>>>>> >>>>> http://news.gmane.org/gmane.**science.biology.informatics.**cond uctor<http: news.gmane.org="" gmane.science.biology.informatics.conducto="" r=""> >>> >>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> James R Perkins PhD >>>>> >>>> >>>> >>>> >>>> -- >>>> James R Perkins PhD >>>> >>>> >>>> >>>> -- >>>> James R Perkins PhD >>>> >>> >>> >> > -- James R Perkins PhD [[alternative HTML version deleted]]