Question: Measuring shifts in intronic expression independently of exons
0
5.5 years ago by
Spain
James Perkins60 wrote:
Dear Wolfgang, Martin, list Sorry for the delay Thanks for the explanation re: counting per exons vs. counting per gene wrt multiple transcripts. This doesn't effect my "real data" example where I counted reads mapping to all exons and reads mapping to all introns and classed them as two different "exons" per gene for the sake of runnign DEXSeq. Martin - my approach is very simple so I would say you idea is not quite relevant in this case. However this method could be possibly extended to look for individual introns showing an increase in expression, rather than detecting genes who show a general relative increase in intronic expression summed accross exons/introns. Wolfgang - I agree with your points re: plotting the data - this find some nice instances of changes in 'intronic expression' in the case vs. control samples. Re: this point: * - an effect size (fold-change) cutoff might help focus the results on the most interesting instances. How does the volcano plot look like? * I like the idea of the effect size cutoff, however what do you mean by fold change (for plotting the volcano) do you mean the change in ratios of exon to intronic, or what should I plot? and vs. what? p-value I assume but just to be clear. Thanks a lot again for your help. Jim On 30 August 2013 15:35, Wolfgang Huber <whuber@embl.de> wrote: > Dear Jim, Martin & list > > Martin's proposal of using splice graphs is definitely the way to go, it > is conceptually appealing, would have more power, and results are > potentially more readily interpretable. That said, the current state of > what we have in DEXSeq does the testing exon-by-exon, and lack of power > does not seem to be James' problem. Regarding James' questions: > > 1. It is fine, and to be expected, that the sum of per-exon counts is > larger than the per-gene count. This is OK because a read that covers > multiple exons is counted multiple times, for each of the overlaps. This > makes sense for the testing setup of DEXSeq. It tests, marginally, exon by > exon, whether its relative usage among all transcripts from the locus is > condition-dependent. Since the tests are marginal, correlations between > counts of neighbouring exons are not a problem. Each overlap of a read is a > postiive piece of evidence for it being expressed. > > 2. Regarding your high number of results, here are some considerations: > - the fact that swapping sample labels leads to greatly reduced sets of > hits is reassuring > - definitely do go ahead with visualising the results to see what is going > on and whether this large number of calls is justified by the data (please > keep us in the loop) > - did you see the "plotDEXSeq" function > - an effect size (fold-change) cutoff might help focus the results on the > most interesting instances. How does the volcano plot look like? > > Hope this helps > Wolfgang > > > On 29 Aug 2013, at 21:27, James Perkins <jimrperkins@gmail.com> wrote: > > > Hi Wolfgang, list > > > > Thank you very much for the helpful tip. > > > > I did exactly as you suggested, the code is below. I wanted to do it > using the pasilla package in order to create an instructive example, and so > that you could potentially recreate the analysis, but strangely sometimes > the gene read counts from pasillaGenes are lower than the summed exon > counts from the pasillaExon object. Or at least that's the case according > to my code! > > > > If you (or anyone else who is reading this) can think of a good public > data set for this please let me know, ideally one that is already in > bioconductor and fairly easy to manipulate. Otherwise I am happy to submit > my current dataset to bioc or make available some other way once its > corresponding paper has been published > > > > So, code below. Any comments or suggestions would be appreciated, if you > think there are ways I could tune DEXSeq for this purpose: > > > > To start with I loaded in all the necessary files and made a counts > object with Gend ID:E001 for intronic and GeneID:E002 for exonic. > > > > # c50tx - table of counts using reads mapping to anywhere in gene c50ex > - reads mapping to exons only > > intMat <- as.data.frame(c50tx) - as.data.frame(c50ex) > > exMat <- as.data.frame(c50ex) > > > > # Remove genes for which intron count is below 100 > > intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] > > exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] > > intMat <- intMat.2 > > > > row.names(intMat) <- paste(row.names(intMat), ":E001", sep="") > > row.names(exMat) <- paste(row.names(exMat), ":E002", sep="") > > # Now I combine the different tables in order to get a combined table as > so: > > both <- rbind(intMat, exMat) > > both <- both[order(row.names(both)), ] > > head(both) > > > > # Now I turn this into an exon count set > > gIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, > byrow=TRUE)[,1] > > names(gIDs) <- row.names(both) > > gIDs <- as.factor(gIDs) > > eIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, > byrow=TRUE)[,2] > > names(eIDs) <- row.names(both) > > des <- data.frame(condition=c(rep("case",3), rep("control", 3))) > > row.names(des) <- colnames(both) > > ecS <- newExonCountSet( > > countData = both, > > design = des, > > geneIDs=gIDs, > > exonIDs=eIDs) > > > > Once I had this I ran normalisation etc. and looked for genes with > 'differential exon/intron usage': > > > > ecS <- estimateSizeFactors(ecS) > > ecS <- estimateDispersions(ecS) > > ecS <- fitDispersionFunction(ecS) > > head(fData(ecS)$dispBeforeSharing) > > > > ecS <- testForDEU(ecS) > > head(fData(ecS)[, c("pvalue", "padjust")]) > > ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE) > > res2 <- DEUresultTable(ecS) > > sum(ecS$pad > > > > > > So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for at > least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems quite > a high number of genes, so I wonder if the method is being conservative > enough. > > > > That said, tried swapping the sample labels a few times, which lead to 0 > genes being called as DEU, which is a good sign I think. > > > > Interestingly, comparing intron:exon for case and controls for the > significant DEU genes shows a 4:1 ratio of genes that with an increased > intronic expression in case samples, suggesting the experimental > intervention is leading to increased intronic expression. > > > > So, does this approach seem valid? My main question is whether it is a > little unconservative. I'm currently using ggbio to plot the read > alignments for the significant genes and looking for interesting looking > patterns of intronic expression that are consistent across replicates of > the same class (and that differ between classes). I've found a few things > that look interesting for follow up. > > > > Thanks a lot for your help, > > > > Jim > > > > On 9 August 2013 16:02, Wolfgang Huber <whuber@embl.de> wrote: > > Dear James > > have you already checked the DEXSeq package and the paper > http://genome.cshlp.org/content/22/10/2008.long > > This does not apply 1:1 to what you are asking for, but afaIcs the main > modification would be for you to define "counting bins" (on which the input > to DEXSeq is computed by read overlap counting) that represent (i) exons > and (ii) introns and then check for changes in relative intro usage (the > 'ratios' you mention below). > > Let me know how it goes > > Wolfgang. > > > > On 9 Aug 2013, at 10:02, James Perkins <j.perkins@ucl.ac.uk> wrote: > > > > > Dear list, > > > > > > I would like to know if an experimental treatment leads to a > significant > > > shift in intronic expression for some genes. > > > > > > Imagine I have an experiment with 6 biological replicates of a given > > > tissue. I believe that the treatment might lead to an increased > intronic > > > expression for some genes, unrelated to exonic expression. > > > > > > 3 of these receive no treament, they are used as control. The other 3 > > > receive experimental treatment. > > > > > > I then sequence the mRNA from these samples (Illumina, single end > reads, > > > ~40 million reads per sample), to obtain 6 fastq files, I align these > to a > > > refernce genome and get bam files. > > > > > > I was thinking that a fairly easy way to see if some genes show a > > > consistent increased intronic expression following treatment would be > to > > > count intronically aligning reads for each gene (e.g. using > GenomicRanges) > > > and use something like DESeq to look for genes showing a significant > change > > > in intronic "expression". > > > > > > However, the problem is that this might be due to exonic expression, > due to > > > premature mRNAs etc., so I might end up finding genes that are > > > differentially expressed at the exon level, and as a result the > increased > > > exon expression has caused increased intronic expression as a by > product. > > > Obviously I am not so interested in these genes wrt this method, I can > find > > > these using "traditional" DE methods. > > > > > > In addition, when I tried profiling intronic regions using reads > mapping to > > > introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line > was > > > quite far above the minimum mean expression point, i.e. it didn't go > > > through the middle of the clump of data points (if that makes sense). I > > > wonder if this is due to the size factor calculation being based on > > > intronic "expression" (well, reads mapping to introns), which is > generally > > > much lower than exon expression and therefore being somewhat > unreliable. > > > > > > So I would like to take this exon expression out of the equation, so > to > > > speak. > > > > > > I thought that one way might be to compare the ratios of exonic to > intronic > > > reads between samples, for each gene. > > > > > > For example one gene might have 30, 35 and 33 exonically mapping reads > and > > > 10,11 and 12 intronically mapping reads for control samples > > > > > > For case samples it might have 33, 32 and 34 exonically mapping reads; > > > 20,21 and 19 intronically mapping reads. > > > > > > So we could compare 10/30, 11/35 and 12/33 for control to 20/33, 21/32 > and > > > 19/34 for case. > > > > > > Does this methodology sound reasonable? It is necessarily based on the > > > assumption that intronic "expression" due to unspliced RNAs is > correlated > > > with exon expression. > > > > > > If it sounds reasonable, is there a test that is recommended to > compare the > > > ratios in such a way, that takes into account the biological > replication of > > > samples? I could do a simple test (chi squared) to compare the relative > > > frequencies, but this wouldn't take into account the replicates. > > > > > > I realise this isn't really a specific bioconductor question, but > hopefully > > > it might be of interest to some of the list subscribers. > > > > > > Many thanks, > > > > > > Jim > > > > > > -- > > > James R Perkins, PhD > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > -- > > James R Perkins PhD > > -- James R Perkins PhD [[alternative HTML version deleted]]
go deseq dexseq ggbio • 718 views
modified 5.5 years ago • written 5.5 years ago by James Perkins60
Answer: Measuring shifts in intronic expression independently of exons
0
5.5 years ago by
Spain
James Perkins60 wrote:
Hi again, Maybe I should rephrase the last part better. We have a fold change for the E001 "exon" i.e. the summed exonic expression we have a fold change for the E002 "exon" i.e. the summed intronic expression However what we are really interested in is the discrepancy between the two, i.e. introns go one way, exons the other, or exons don't change but introns do, or whatever. I've made a "sort of" volcano plot - I've subtracted the log2 fold change calculated using intron only counting from the log2FC calculated using exon only counting. http://postimg.org/image/o9hi6o4pj/ I've plotted this against -log10 p val. You see there is some "volcano" activity although it's quite 1 sides - as I understand it, this is suggesting that more genes are increasing in intronic expression in case samples compared to exonic expression, since the cluster of negative values with low p values suggests that the FC is higher for introns than exons. However the "effect size" as I say is pretty small. This is probably in part due to the way FCs are shrunk by DEXSeq, should I use the FC shrinking in this case? BTW There are a couple of outliers in the volcano plot, the real plot is like this: http://postimg.org/image/r4c7uhf5t/ BTW I've not used the plotDEXSeq function because this is exon focussed if I understand correctly, instead I've used the ggbio package. Still unsure what criteria to use to say whether a gene is really showing increased intronic expression. Probably I can't I can just go down the list of p-values looking at the patterns of expression and try to say which are interesting! Cheers, Jim On 24 September 2013 11:17, James Perkins <jimrperkins@gmail.com> wrote: > Dear Wolfgang, Martin, list > > Sorry for the delay > > Thanks for the explanation re: counting per exons vs. counting per gene > wrt multiple transcripts. This doesn't effect my "real data" example where > I counted reads mapping to all exons and reads mapping to all introns and > classed them as two different "exons" per gene for the sake of runnign > DEXSeq. Martin - my approach is very simple so I would say you idea is not > quite relevant in this case. However this method could be possibly extended > to look for individual introns showing an increase in expression, rather > than detecting genes who show a general relative increase in intronic > expression summed accross exons/introns. > > Wolfgang - I agree with your points re: plotting the data - this find some > nice instances of changes in 'intronic expression' in the case vs. control > samples. > > Re: this point: > > * > - an effect size (fold-change) cutoff might help focus the results on the > most interesting instances. How does the volcano plot look like? > * > > I like the idea of the effect size cutoff, however what do you mean by > fold change (for plotting the volcano) do you mean the change in ratios of > exon to intronic, or what should I plot? and vs. what? p-value I assume but > just to be clear. > > Thanks a lot again for your help. > > Jim > > > > On 30 August 2013 15:35, Wolfgang Huber <whuber@embl.de> wrote: > >> Dear Jim, Martin & list >> >> Martin's proposal of using splice graphs is definitely the way to go, it >> is conceptually appealing, would have more power, and results are >> potentially more readily interpretable. That said, the current state of >> what we have in DEXSeq does the testing exon-by-exon, and lack of power >> does not seem to be James' problem. Regarding James' questions: >> >> 1. It is fine, and to be expected, that the sum of per-exon counts is >> larger than the per-gene count. This is OK because a read that covers >> multiple exons is counted multiple times, for each of the overlaps. This >> makes sense for the testing setup of DEXSeq. It tests, marginally, exon by >> exon, whether its relative usage among all transcripts from the locus is >> condition-dependent. Since the tests are marginal, correlations between >> counts of neighbouring exons are not a problem. Each overlap of a read is a >> postiive piece of evidence for it being expressed. >> >> 2. Regarding your high number of results, here are some considerations: >> - the fact that swapping sample labels leads to greatly reduced sets of >> hits is reassuring >> - definitely do go ahead with visualising the results to see what is >> going on and whether this large number of calls is justified by the data >> (please keep us in the loop) >> - did you see the "plotDEXSeq" function >> - an effect size (fold-change) cutoff might help focus the results on the >> most interesting instances. How does the volcano plot look like? >> >> Hope this helps >> Wolfgang >> >> >> On 29 Aug 2013, at 21:27, James Perkins <jimrperkins@gmail.com> wrote: >> >> > Hi Wolfgang, list >> > >> > Thank you very much for the helpful tip. >> > >> > I did exactly as you suggested, the code is below. I wanted to do it >> using the pasilla package in order to create an instructive example, and so >> that you could potentially recreate the analysis, but strangely sometimes >> the gene read counts from pasillaGenes are lower than the summed exon >> counts from the pasillaExon object. Or at least that's the case according >> to my code! >> > >> > If you (or anyone else who is reading this) can think of a good public >> data set for this please let me know, ideally one that is already in >> bioconductor and fairly easy to manipulate. Otherwise I am happy to submit >> my current dataset to bioc or make available some other way once its >> corresponding paper has been published >> > >> > So, code below. Any comments or suggestions would be appreciated, if >> you think there are ways I could tune DEXSeq for this purpose: >> > >> > To start with I loaded in all the necessary files and made a counts >> object with Gend ID:E001 for intronic and GeneID:E002 for exonic. >> > >> > # c50tx - table of counts using reads mapping to anywhere in gene c50ex >> - reads mapping to exons only >> > intMat <- as.data.frame(c50tx) - as.data.frame(c50ex) >> > exMat <- as.data.frame(c50ex) >> > >> > # Remove genes for which intron count is below 100 >> > intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] >> > exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] >> > intMat <- intMat.2 >> > >> > row.names(intMat) <- paste(row.names(intMat), ":E001", sep="") >> > row.names(exMat) <- paste(row.names(exMat), ":E002", sep="") >> > # Now I combine the different tables in order to get a combined table >> as so: >> > both <- rbind(intMat, exMat) >> > both <- both[order(row.names(both)), ] >> > head(both) >> > >> > # Now I turn this into an exon count set >> > gIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, >> byrow=TRUE)[,1] >> > names(gIDs) <- row.names(both) >> > gIDs <- as.factor(gIDs) >> > eIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, >> byrow=TRUE)[,2] >> > names(eIDs) <- row.names(both) >> > des <- data.frame(condition=c(rep("case",3), rep("control", 3))) >> > row.names(des) <- colnames(both) >> > ecS <- newExonCountSet( >> > countData = both, >> > design = des, >> > geneIDs=gIDs, >> > exonIDs=eIDs) >> > >> > Once I had this I ran normalisation etc. and looked for genes with >> 'differential exon/intron usage': >> > >> > ecS <- estimateSizeFactors(ecS) >> > ecS <- estimateDispersions(ecS) >> > ecS <- fitDispersionFunction(ecS) >> > head(fData(ecS)$dispBeforeSharing) >> > >> > ecS <- testForDEU(ecS) >> > head(fData(ecS)[, c("pvalue", "padjust")]) >> > ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE) >> > res2 <- DEUresultTable(ecS) >> > sum(ecS$pad >> > >> > >> > So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for >> at least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems >> quite a high number of genes, so I wonder if the method is being >> conservative enough. >> > >> > That said, tried swapping the sample labels a few times, which lead to >> 0 genes being called as DEU, which is a good sign I think. >> > >> > Interestingly, comparing intron:exon for case and controls for the >> significant DEU genes shows a 4:1 ratio of genes that with an increased >> intronic expression in case samples, suggesting the experimental >> intervention is leading to increased intronic expression. >> > >> > So, does this approach seem valid? My main question is whether it is a >> little unconservative. I'm currently using ggbio to plot the read >> alignments for the significant genes and looking for interesting looking >> patterns of intronic expression that are consistent across replicates of >> the same class (and that differ between classes). I've found a few things >> that look interesting for follow up. >> > >> > Thanks a lot for your help, >> > >> > Jim >> > >> > On 9 August 2013 16:02, Wolfgang Huber <whuber@embl.de> wrote: >> > Dear James >> > have you already checked the DEXSeq package and the paper >> http://genome.cshlp.org/content/22/10/2008.long >> > This does not apply 1:1 to what you are asking for, but afaIcs the main >> modification would be for you to define "counting bins" (on which the input >> to DEXSeq is computed by read overlap counting) that represent (i) exons >> and (ii) introns and then check for changes in relative intro usage (the >> 'ratios' you mention below). >> > Let me know how it goes >> > Wolfgang. >> > >> > On 9 Aug 2013, at 10:02, James Perkins <j.perkins@ucl.ac.uk> wrote: >> > >> > > Dear list, >> > > >> > > I would like to know if an experimental treatment leads to a >> significant >> > > shift in intronic expression for some genes. >> > > >> > > Imagine I have an experiment with 6 biological replicates of a given >> > > tissue. I believe that the treatment might lead to an increased >> intronic >> > > expression for some genes, unrelated to exonic expression. >> > > >> > > 3 of these receive no treament, they are used as control. The other 3 >> > > receive experimental treatment. >> > > >> > > I then sequence the mRNA from these samples (Illumina, single end >> reads, >> > > ~40 million reads per sample), to obtain 6 fastq files, I align these >> to a >> > > refernce genome and get bam files. >> > > >> > > I was thinking that a fairly easy way to see if some genes show a >> > > consistent increased intronic expression following treatment would be >> to >> > > count intronically aligning reads for each gene (e.g. using >> GenomicRanges) >> > > and use something like DESeq to look for genes showing a significant >> change >> > > in intronic "expression". >> > > >> > > However, the problem is that this might be due to exonic expression, >> due to >> > > premature mRNAs etc., so I might end up finding genes that are >> > > differentially expressed at the exon level, and as a result the >> increased >> > > exon expression has caused increased intronic expression as a by >> product. >> > > Obviously I am not so interested in these genes wrt this method, I >> can find >> > > these using "traditional" DE methods. >> > > >> > > In addition, when I tried profiling intronic regions using reads >> mapping to >> > > introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line >> was >> > > quite far above the minimum mean expression point, i.e. it didn't go >> > > through the middle of the clump of data points (if that makes sense). >> I >> > > wonder if this is due to the size factor calculation being based on >> > > intronic "expression" (well, reads mapping to introns), which is >> generally >> > > much lower than exon expression and therefore being somewhat >> unreliable. >> > > >> > > So I would like to take this exon expression out of the equation, so >> to >> > > speak. >> > > >> > > I thought that one way might be to compare the ratios of exonic to >> intronic >> > > reads between samples, for each gene. >> > > >> > > For example one gene might have 30, 35 and 33 exonically mapping >> reads and >> > > 10,11 and 12 intronically mapping reads for control samples >> > > >> > > For case samples it might have 33, 32 and 34 exonically mapping reads; >> > > 20,21 and 19 intronically mapping reads. >> > > >> > > So we could compare 10/30, 11/35 and 12/33 for control to 20/33, >> 21/32 and >> > > 19/34 for case. >> > > >> > > Does this methodology sound reasonable? It is necessarily based on the >> > > assumption that intronic "expression" due to unspliced RNAs is >> correlated >> > > with exon expression. >> > > >> > > If it sounds reasonable, is there a test that is recommended to >> compare the >> > > ratios in such a way, that takes into account the biological >> replication of >> > > samples? I could do a simple test (chi squared) to compare the >> relative >> > > frequencies, but this wouldn't take into account the replicates. >> > > >> > > I realise this isn't really a specific bioconductor question, but >> hopefully >> > > it might be of interest to some of the list subscribers. >> > > >> > > Many thanks, >> > > >> > > Jim >> > > >> > > -- >> > > James R Perkins, PhD >> > > >> > > [[alternative HTML version deleted]] >> > > >> > > _______________________________________________ >> > > Bioconductor mailing list >> > > Bioconductor@r-project.org >> > > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> > >> > >> > >> > >> > -- >> > James R Perkins PhD >> >> > > > -- > James R Perkins PhD > -- James R Perkins PhD [[alternative HTML version deleted]]
Dear James thanks. What you do below makes sense, as far as I can see. What DEXSeq considers is, applied to your situation: number of reads mapping to E001 ------------------------------- number of reads mapping to both (and similarly for E002), and it tries to find significant changes of that ratio along with your treatment. Maybe making the Volcano plot with the difference of the logarithm of these ratios across condition as 'effect size' would also be informative, but I am not sure it has additional info. The plots you sent below look OK to me, I'd indeed now go down the list of genes and introns and see whether you can independently verify it (e.g. through a biological hypothesis). Best wishes Wolfgang Il giorno Sep 26, 2013, alle ore 5:00 pm, James Perkins <jimrperkins at="" gmail.com=""> ha scritto: > Hi again, > > Maybe I should rephrase the last part better. > > We have a fold change for the E001 "exon" i.e. the summed exonic expression > we have a fold change for the E002 "exon" i.e. the summed intronic expression > > However what we are really interested in is the discrepancy between the two, i.e. introns go one way, exons the other, or exons don't change but introns do, or whatever. > > I've made a "sort of" volcano plot - I've subtracted the log2 fold change calculated using intron only counting from the log2FC calculated using exon only counting. > > http://postimg.org/image/o9hi6o4pj/ > > I've plotted this against -log10 p val. You see there is some "volcano" activity although it's quite 1 sides - as I understand it, this is suggesting that more genes are increasing in intronic expression in case samples compared to exonic expression, since the cluster of negative values with low p values suggests that the FC is higher for introns than exons. > > However the "effect size" as I say is pretty small. This is probably in part due to the way FCs are shrunk by DEXSeq, should I use the FC shrinking in this case? > > BTW There are a couple of outliers in the volcano plot, the real plot is like this: > > http://postimg.org/image/r4c7uhf5t/ > > BTW I've not used the plotDEXSeq function because this is exon focussed if I understand correctly, instead I've used the ggbio package. > > Still unsure what criteria to use to say whether a gene is really showing increased intronic expression. Probably I can't I can just go down the list of p-values looking at the patterns of expression and try to say which are interesting! > > Cheers, > > Jim > > > On 24 September 2013 11:17, James Perkins <jimrperkins at="" gmail.com=""> wrote: > Dear Wolfgang, Martin, list > > Sorry for the delay > > Thanks for the explanation re: counting per exons vs. counting per gene wrt multiple transcripts. This doesn't effect my "real data" example where I counted reads mapping to all exons and reads mapping to all introns and classed them as two different "exons" per gene for the sake of runnign DEXSeq. Martin - my approach is very simple so I would say you idea is not quite relevant in this case. However this method could be possibly extended to look for individual introns showing an increase in expression, rather than detecting genes who show a general relative increase in intronic expression summed accross exons/introns. > > Wolfgang - I agree with your points re: plotting the data - this find some nice instances of changes in 'intronic expression' in the case vs. control samples. > > Re: this point: > > > - an effect size (fold-change) cutoff might help focus the results on the most interesting instances. How does the volcano plot look like? > > > I like the idea of the effect size cutoff, however what do you mean by fold change (for plotting the volcano) do you mean the change in ratios of exon to intronic, or what should I plot? and vs. what? p-value I assume but just to be clear. > > Thanks a lot again for your help. > > Jim > > > > On 30 August 2013 15:35, Wolfgang Huber <whuber at="" embl.de=""> wrote: > Dear Jim, Martin & list > > Martin's proposal of using splice graphs is definitely the way to go, it is conceptually appealing, would have more power, and results are potentially more readily interpretable. That said, the current state of what we have in DEXSeq does the testing exon-by-exon, and lack of power does not seem to be James' problem. Regarding James' questions: > > 1. It is fine, and to be expected, that the sum of per-exon counts is larger than the per-gene count. This is OK because a read that covers multiple exons is counted multiple times, for each of the overlaps. This makes sense for the testing setup of DEXSeq. It tests, marginally, exon by exon, whether its relative usage among all transcripts from the locus is condition-dependent. Since the tests are marginal, correlations between counts of neighbouring exons are not a problem. Each overlap of a read is a postiive piece of evidence for it being expressed. > > 2. Regarding your high number of results, here are some considerations: > - the fact that swapping sample labels leads to greatly reduced sets of hits is reassuring > - definitely do go ahead with visualising the results to see what is going on and whether this large number of calls is justified by the data (please keep us in the loop) > - did you see the "plotDEXSeq" function > - an effect size (fold-change) cutoff might help focus the results on the most interesting instances. How does the volcano plot look like? > > Hope this helps > Wolfgang > > > On 29 Aug 2013, at 21:27, James Perkins <jimrperkins at="" gmail.com=""> wrote: > > > Hi Wolfgang, list > > > > Thank you very much for the helpful tip. > > > > I did exactly as you suggested, the code is below. I wanted to do it using the pasilla package in order to create an instructive example, and so that you could potentially recreate the analysis, but strangely sometimes the gene read counts from pasillaGenes are lower than the summed exon counts from the pasillaExon object. Or at least that's the case according to my code! > > > > If you (or anyone else who is reading this) can think of a good public data set for this please let me know, ideally one that is already in bioconductor and fairly easy to manipulate. Otherwise I am happy to submit my current dataset to bioc or make available some other way once its corresponding paper has been published > > > > So, code below. Any comments or suggestions would be appreciated, if you think there are ways I could tune DEXSeq for this purpose: > > > > To start with I loaded in all the necessary files and made a counts object with Gend ID:E001 for intronic and GeneID:E002 for exonic. > > > > # c50tx - table of counts using reads mapping to anywhere in gene c50ex - reads mapping to exons only > > intMat <- as.data.frame(c50tx) - as.data.frame(c50ex) > > exMat <- as.data.frame(c50ex) > > > > # Remove genes for which intron count is below 100 > > intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] > > exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] > > intMat <- intMat.2 > > > > row.names(intMat) <- paste(row.names(intMat), ":E001", sep="") > > row.names(exMat) <- paste(row.names(exMat), ":E002", sep="") > > # Now I combine the different tables in order to get a combined table as so: > > both <- rbind(intMat, exMat) > > both <- both[order(row.names(both)), ] > > head(both) > > > > # Now I turn this into an exon count set > > gIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, byrow=TRUE)[,1] > > names(gIDs) <- row.names(both) > > gIDs <- as.factor(gIDs) > > eIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, byrow=TRUE)[,2] > > names(eIDs) <- row.names(both) > > des <- data.frame(condition=c(rep("case",3), rep("control", 3))) > > row.names(des) <- colnames(both) > > ecS <- newExonCountSet( > > countData = both, > > design = des, > > geneIDs=gIDs, > > exonIDs=eIDs) > > > > Once I had this I ran normalisation etc. and looked for genes with 'differential exon/intron usage': > > > > ecS <- estimateSizeFactors(ecS) > > ecS <- estimateDispersions(ecS) > > ecS <- fitDispersionFunction(ecS) > > head(fData(ecS)$dispBeforeSharing) > > > > ecS <- testForDEU(ecS) > > head(fData(ecS)[, c("pvalue", "padjust")]) > > ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE) > > res2 <- DEUresultTable(ecS) > > sum(ecS$pad > > > > > > So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for at least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems quite a high number of genes, so I wonder if the method is being conservative enough. > > > > That said, tried swapping the sample labels a few times, which lead to 0 genes being called as DEU, which is a good sign I think. > > > > Interestingly, comparing intron:exon for case and controls for the significant DEU genes shows a 4:1 ratio of genes that with an increased intronic expression in case samples, suggesting the experimental intervention is leading to increased intronic expression. > > > > So, does this approach seem valid? My main question is whether it is a little unconservative. I'm currently using ggbio to plot the read alignments for the significant genes and looking for interesting looking patterns of intronic expression that are consistent across replicates of the same class (and that differ between classes). I've found a few things that look interesting for follow up. > > > > Thanks a lot for your help, > > > > Jim > > > > On 9 August 2013 16:02, Wolfgang Huber <whuber at="" embl.de=""> wrote: > > Dear James > > have you already checked the DEXSeq package and the paper http://genome.cshlp.org/content/22/10/2008.long > > This does not apply 1:1 to what you are asking for, but afaIcs the main modification would be for you to define "counting bins" (on which the input to DEXSeq is computed by read overlap counting) that represent (i) exons and (ii) introns and then check for changes in relative intro usage (the 'ratios' you mention below). > > Let me know how it goes > > Wolfgang. > > > > On 9 Aug 2013, at 10:02, James Perkins <j.perkins at="" ucl.ac.uk=""> wrote: > > > > > Dear list, > > > > > > I would like to know if an experimental treatment leads to a significant > > > shift in intronic expression for some genes. > > > > > > Imagine I have an experiment with 6 biological replicates of a given > > > tissue. I believe that the treatment might lead to an increased intronic > > > expression for some genes, unrelated to exonic expression. > > > > > > 3 of these receive no treament, they are used as control. The other 3 > > > receive experimental treatment. > > > > > > I then sequence the mRNA from these samples (Illumina, single end reads, > > > ~40 million reads per sample), to obtain 6 fastq files, I align these to a > > > refernce genome and get bam files. > > > > > > I was thinking that a fairly easy way to see if some genes show a > > > consistent increased intronic expression following treatment would be to > > > count intronically aligning reads for each gene (e.g. using GenomicRanges) > > > and use something like DESeq to look for genes showing a significant change > > > in intronic "expression". > > > > > > However, the problem is that this might be due to exonic expression, due to > > > premature mRNAs etc., so I might end up finding genes that are > > > differentially expressed at the exon level, and as a result the increased > > > exon expression has caused increased intronic expression as a by product. > > > Obviously I am not so interested in these genes wrt this method, I can find > > > these using "traditional" DE methods. > > > > > > In addition, when I tried profiling intronic regions using reads mapping to > > > introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line was > > > quite far above the minimum mean expression point, i.e. it didn't go > > > through the middle of the clump of data points (if that makes sense). I > > > wonder if this is due to the size factor calculation being based on > > > intronic "expression" (well, reads mapping to introns), which is generally > > > much lower than exon expression and therefore being somewhat unreliable. > > > > > > So I would like to take this exon expression out of the equation, so to > > > speak. > > > > > > I thought that one way might be to compare the ratios of exonic to intronic > > > reads between samples, for each gene. > > > > > > For example one gene might have 30, 35 and 33 exonically mapping reads and > > > 10,11 and 12 intronically mapping reads for control samples > > > > > > For case samples it might have 33, 32 and 34 exonically mapping reads; > > > 20,21 and 19 intronically mapping reads. > > > > > > So we could compare 10/30, 11/35 and 12/33 for control to 20/33, 21/32 and > > > 19/34 for case. > > > > > > Does this methodology sound reasonable? It is necessarily based on the > > > assumption that intronic "expression" due to unspliced RNAs is correlated > > > with exon expression. > > > > > > If it sounds reasonable, is there a test that is recommended to compare the > > > ratios in such a way, that takes into account the biological replication of > > > samples? I could do a simple test (chi squared) to compare the relative > > > frequencies, but this wouldn't take into account the replicates. > > > > > > I realise this isn't really a specific bioconductor question, but hopefully > > > it might be of interest to some of the list subscribers. > > > > > > Many thanks, > > > > > > Jim > > > > > > -- > > > James R Perkins, PhD > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > -- > > James R Perkins PhD > > > > > -- > James R Perkins PhD > > > > -- > James R Perkins PhD
Hi Wolfgang, Thanks a lot for all your help on this. One final thing I noticed, which may be of value to anyone following this problem: I noticed that when I use the raw FCs instead of the shrunken FCs to make the volcano plot, the pattern is much clearer: http://postimg.org/image/4ri7pgdht/ Please ignore the x axis label. I also notice that the negative clump remains, which is suggestive that the majority of "changing" genes show generally an increased expression in intronic regions compared to exonic regions. Wrt your reply, we am now going through individual examples, there are some potentially interesting patterns of intersting patterns of expression. I am also using quite large cut offs e.g. at least 200 reads align to intronic regions for at least n/2 samples (where n is the total number of samples). This is quite important given the length of intronic regions and seems to lead to more interesting results when going through the list of top genes. Cheers again for your help and guidance with thtis, Jim On 29 September 2013 10:30, Wolfgang Huber <whuber@embl.de> wrote: > Dear James > > thanks. What you do below makes sense, as far as I can see. What DEXSeq > considers is, applied to your situation: > > number of reads mapping to E001 > ------------------------------- > number of reads mapping to both > > (and similarly for E002), and it tries to find significant changes of that > ratio along with your treatment. Maybe making the Volcano plot with the > difference of the logarithm of these ratios across condition as 'effect > size' would also be informative, but I am not sure it has additional info. > The plots you sent below look OK to me, I'd indeed now go down the list of > genes and introns and see whether you can independently verify it (e.g. > through a biological hypothesis). > > Best wishes > Wolfgang > > > > > > Il giorno Sep 26, 2013, alle ore 5:00 pm, James Perkins < > jimrperkins@gmail.com> ha scritto: > > > Hi again, > > > > Maybe I should rephrase the last part better. > > > > We have a fold change for the E001 "exon" i.e. the summed exonic > expression > > we have a fold change for the E002 "exon" i.e. the summed intronic > expression > > > > However what we are really interested in is the discrepancy between the > two, i.e. introns go one way, exons the other, or exons don't change but > introns do, or whatever. > > > > I've made a "sort of" volcano plot - I've subtracted the log2 fold > change calculated using intron only counting from the log2FC calculated > using exon only counting. > > > > http://postimg.org/image/o9hi6o4pj/ > > > > I've plotted this against -log10 p val. You see there is some "volcano" > activity although it's quite 1 sides - as I understand it, this is > suggesting that more genes are increasing in intronic expression in case > samples compared to exonic expression, since the cluster of negative values > with low p values suggests that the FC is higher for introns than exons. > > > > However the "effect size" as I say is pretty small. This is probably in > part due to the way FCs are shrunk by DEXSeq, should I use the FC shrinking > in this case? > > > > BTW There are a couple of outliers in the volcano plot, the real plot is > like this: > > > > http://postimg.org/image/r4c7uhf5t/ > > > > BTW I've not used the plotDEXSeq function because this is exon focussed > if I understand correctly, instead I've used the ggbio package. > > > > Still unsure what criteria to use to say whether a gene is really > showing increased intronic expression. Probably I can't I can just go down > the list of p-values looking at the patterns of expression and try to say > which are interesting! > > > > Cheers, > > > > Jim > > > > > > On 24 September 2013 11:17, James Perkins <jimrperkins@gmail.com> wrote: > > Dear Wolfgang, Martin, list > > > > Sorry for the delay > > > > Thanks for the explanation re: counting per exons vs. counting per gene > wrt multiple transcripts. This doesn't effect my "real data" example where > I counted reads mapping to all exons and reads mapping to all introns and > classed them as two different "exons" per gene for the sake of runnign > DEXSeq. Martin - my approach is very simple so I would say you idea is not > quite relevant in this case. However this method could be possibly extended > to look for individual introns showing an increase in expression, rather > than detecting genes who show a general relative increase in intronic > expression summed accross exons/introns. > > > > Wolfgang - I agree with your points re: plotting the data - this find > some nice instances of changes in 'intronic expression' in the case vs. > control samples. > > > > Re: this point: > > > > > > - an effect size (fold-change) cutoff might help focus the results on > the most interesting instances. How does the volcano plot look like? > > > > > > I like the idea of the effect size cutoff, however what do you mean by > fold change (for plotting the volcano) do you mean the change in ratios of > exon to intronic, or what should I plot? and vs. what? p-value I assume but > just to be clear. > > > > Thanks a lot again for your help. > > > > Jim > > > > > > > > On 30 August 2013 15:35, Wolfgang Huber <whuber@embl.de> wrote: > > Dear Jim, Martin & list > > > > Martin's proposal of using splice graphs is definitely the way to go, it > is conceptually appealing, would have more power, and results are > potentially more readily interpretable. That said, the current state of > what we have in DEXSeq does the testing exon-by-exon, and lack of power > does not seem to be James' problem. Regarding James' questions: > > > > 1. It is fine, and to be expected, that the sum of per-exon counts is > larger than the per-gene count. This is OK because a read that covers > multiple exons is counted multiple times, for each of the overlaps. This > makes sense for the testing setup of DEXSeq. It tests, marginally, exon by > exon, whether its relative usage among all transcripts from the locus is > condition-dependent. Since the tests are marginal, correlations between > counts of neighbouring exons are not a problem. Each overlap of a read is a > postiive piece of evidence for it being expressed. > > > > 2. Regarding your high number of results, here are some considerations: > > - the fact that swapping sample labels leads to greatly reduced sets of > hits is reassuring > > - definitely do go ahead with visualising the results to see what is > going on and whether this large number of calls is justified by the data > (please keep us in the loop) > > - did you see the "plotDEXSeq" function > > - an effect size (fold-change) cutoff might help focus the results on > the most interesting instances. How does the volcano plot look like? > > > > Hope this helps > > Wolfgang > > > > > > On 29 Aug 2013, at 21:27, James Perkins <jimrperkins@gmail.com> wrote: > > > > > Hi Wolfgang, list > > > > > > Thank you very much for the helpful tip. > > > > > > I did exactly as you suggested, the code is below. I wanted to do it > using the pasilla package in order to create an instructive example, and so > that you could potentially recreate the analysis, but strangely sometimes > the gene read counts from pasillaGenes are lower than the summed exon > counts from the pasillaExon object. Or at least that's the case according > to my code! > > > > > > If you (or anyone else who is reading this) can think of a good public > data set for this please let me know, ideally one that is already in > bioconductor and fairly easy to manipulate. Otherwise I am happy to submit > my current dataset to bioc or make available some other way once its > corresponding paper has been published > > > > > > So, code below. Any comments or suggestions would be appreciated, if > you think there are ways I could tune DEXSeq for this purpose: > > > > > > To start with I loaded in all the necessary files and made a counts > object with Gend ID:E001 for intronic and GeneID:E002 for exonic. > > > > > > # c50tx - table of counts using reads mapping to anywhere in gene > c50ex - reads mapping to exons only > > > intMat <- as.data.frame(c50tx) - as.data.frame(c50ex) > > > exMat <- as.data.frame(c50ex) > > > > > > # Remove genes for which intron count is below 100 > > > intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] > > > exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] > > > intMat <- intMat.2 > > > > > > row.names(intMat) <- paste(row.names(intMat), ":E001", sep="") > > > row.names(exMat) <- paste(row.names(exMat), ":E002", sep="") > > > # Now I combine the different tables in order to get a combined table > as so: > > > both <- rbind(intMat, exMat) > > > both <- both[order(row.names(both)), ] > > > head(both) > > > > > > # Now I turn this into an exon count set > > > gIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, > byrow=TRUE)[,1] > > > names(gIDs) <- row.names(both) > > > gIDs <- as.factor(gIDs) > > > eIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, > byrow=TRUE)[,2] > > > names(eIDs) <- row.names(both) > > > des <- data.frame(condition=c(rep("case",3), rep("control", 3))) > > > row.names(des) <- colnames(both) > > > ecS <- newExonCountSet( > > > countData = both, > > > design = des, > > > geneIDs=gIDs, > > > exonIDs=eIDs) > > > > > > Once I had this I ran normalisation etc. and looked for genes with > 'differential exon/intron usage': > > > > > > ecS <- estimateSizeFactors(ecS) > > > ecS <- estimateDispersions(ecS) > > > ecS <- fitDispersionFunction(ecS) > > > head(fData(ecS)$dispBeforeSharing) > > > > > > ecS <- testForDEU(ecS) > > > head(fData(ecS)[, c("pvalue", "padjust")]) > > > ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE) > > > res2 <- DEUresultTable(ecS) > > > sum(ecS$pad > > > > > > > > > So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for > at least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems > quite a high number of genes, so I wonder if the method is being > conservative enough. > > > > > > That said, tried swapping the sample labels a few times, which lead to > 0 genes being called as DEU, which is a good sign I think. > > > > > > Interestingly, comparing intron:exon for case and controls for the > significant DEU genes shows a 4:1 ratio of genes that with an increased > intronic expression in case samples, suggesting the experimental > intervention is leading to increased intronic expression. > > > > > > So, does this approach seem valid? My main question is whether it is a > little unconservative. I'm currently using ggbio to plot the read > alignments for the significant genes and looking for interesting looking > patterns of intronic expression that are consistent across replicates of > the same class (and that differ between classes). I've found a few things > that look interesting for follow up. > > > > > > Thanks a lot for your help, > > > > > > Jim > > > > > > On 9 August 2013 16:02, Wolfgang Huber <whuber@embl.de> wrote: > > > Dear James > > > have you already checked the DEXSeq package and the paper > http://genome.cshlp.org/content/22/10/2008.long > > > This does not apply 1:1 to what you are asking for, but afaIcs the > main modification would be for you to define "counting bins" (on which the > input to DEXSeq is computed by read overlap counting) that represent (i) > exons and (ii) introns and then check for changes in relative intro usage > (the 'ratios' you mention below). > > > Let me know how it goes > > > Wolfgang. > > > > > > On 9 Aug 2013, at 10:02, James Perkins <j.perkins@ucl.ac.uk> wrote: > > > > > > > Dear list, > > > > > > > > I would like to know if an experimental treatment leads to a > significant > > > > shift in intronic expression for some genes. > > > > > > > > Imagine I have an experiment with 6 biological replicates of a given > > > > tissue. I believe that the treatment might lead to an increased > intronic > > > > expression for some genes, unrelated to exonic expression. > > > > > > > > 3 of these receive no treament, they are used as control. The other 3 > > > > receive experimental treatment. > > > > > > > > I then sequence the mRNA from these samples (Illumina, single end > reads, > > > > ~40 million reads per sample), to obtain 6 fastq files, I align > these to a > > > > refernce genome and get bam files. > > > > > > > > I was thinking that a fairly easy way to see if some genes show a > > > > consistent increased intronic expression following treatment would > be to > > > > count intronically aligning reads for each gene (e.g. using > GenomicRanges) > > > > and use something like DESeq to look for genes showing a significant > change > > > > in intronic "expression". > > > > > > > > However, the problem is that this might be due to exonic expression, > due to > > > > premature mRNAs etc., so I might end up finding genes that are > > > > differentially expressed at the exon level, and as a result the > increased > > > > exon expression has caused increased intronic expression as a by > product. > > > > Obviously I am not so interested in these genes wrt this method, I > can find > > > > these using "traditional" DE methods. > > > > > > > > In addition, when I tried profiling intronic regions using reads > mapping to > > > > introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line > was > > > > quite far above the minimum mean expression point, i.e. it didn't go > > > > through the middle of the clump of data points (if that makes > sense). I > > > > wonder if this is due to the size factor calculation being based on > > > > intronic "expression" (well, reads mapping to introns), which is > generally > > > > much lower than exon expression and therefore being somewhat > unreliable. > > > > > > > > So I would like to take this exon expression out of the equation, > so to > > > > speak. > > > > > > > > I thought that one way might be to compare the ratios of exonic to > intronic > > > > reads between samples, for each gene. > > > > > > > > For example one gene might have 30, 35 and 33 exonically mapping > reads and > > > > 10,11 and 12 intronically mapping reads for control samples > > > > > > > > For case samples it might have 33, 32 and 34 exonically mapping > reads; > > > > 20,21 and 19 intronically mapping reads. > > > > > > > > So we could compare 10/30, 11/35 and 12/33 for control to 20/33, > 21/32 and > > > > 19/34 for case. > > > > > > > > Does this methodology sound reasonable? It is necessarily based on > the > > > > assumption that intronic "expression" due to unspliced RNAs is > correlated > > > > with exon expression. > > > > > > > > If it sounds reasonable, is there a test that is recommended to > compare the > > > > ratios in such a way, that takes into account the biological > replication of > > > > samples? I could do a simple test (chi squared) to compare the > relative > > > > frequencies, but this wouldn't take into account the replicates. > > > > > > > > I realise this isn't really a specific bioconductor question, but > hopefully > > > > it might be of interest to some of the list subscribers. > > > > > > > > Many thanks, > > > > > > > > Jim > > > > > > > > -- > > > > James R Perkins, PhD > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor@r-project.org > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > -- > > > James R Perkins PhD > > > > > > > > > > -- > > James R Perkins PhD > > > > > > > > -- > > James R Perkins PhD > > -- James R Perkins PhD [[alternative HTML version deleted]]
Hi James, Before saying anything... one question, what do you mean with "shrunken FCs" and with "raw fold changes"? It is a bit confusing, since DEXSeq does not do any shrinkage in the fold changes (so far). The fold change from the "estimatelog2FoldChanges" function calculates the interaction coefficient between your condition of interest and the exons (using a glm fit) to estimate the "fitted splicing", which already should be independent from overall gene expression effects. This is what you see plotted in the plotDEXSeq function! After a variance stabilizing transformation and a log2 transformation are used and this is what estimatelog2FoldChanges adds to the fData. Alejandro > Hi Wolfgang, > > Thanks a lot for all your help on this. One final thing I noticed, which > may be of value to anyone following this problem: > > I noticed that when I use the raw FCs instead of the shrunken FCs to make > the volcano plot, the pattern is much clearer: > http://postimg.org/image/4ri7pgdht/ > > Please ignore the x axis label. > > I also notice that the negative clump remains, which is suggestive that the > majority of "changing" genes show generally an increased expression in > intronic regions compared to exonic regions. > > Wrt your reply, we am now going through individual examples, there are some > potentially interesting patterns of intersting patterns of expression. I am > also using quite large cut offs e.g. at least 200 reads align to intronic > regions for at least n/2 samples (where n is the total number of samples). > This is quite important given the length of intronic regions and seems to > lead to more interesting results when going through the list of top genes. > > Cheers again for your help and guidance with thtis, > > Jim > > > On 29 September 2013 10:30, Wolfgang Huber <whuber at="" embl.de=""> wrote: > >> Dear James >> >> thanks. What you do below makes sense, as far as I can see. What DEXSeq >> considers is, applied to your situation: >> >> number of reads mapping to E001 >> ------------------------------- >> number of reads mapping to both >> >> (and similarly for E002), and it tries to find significant changes of that >> ratio along with your treatment. Maybe making the Volcano plot with the >> difference of the logarithm of these ratios across condition as 'effect >> size' would also be informative, but I am not sure it has additional info. >> The plots you sent below look OK to me, I'd indeed now go down the list of >> genes and introns and see whether you can independently verify it (e.g. >> through a biological hypothesis). >> >> Best wishes >> Wolfgang >> >> >> >> >> >> Il giorno Sep 26, 2013, alle ore 5:00 pm, James Perkins < >> jimrperkins at gmail.com> ha scritto: >> >>> Hi again, >>> >>> Maybe I should rephrase the last part better. >>> >>> We have a fold change for the E001 "exon" i.e. the summed exonic >> expression >>> we have a fold change for the E002 "exon" i.e. the summed intronic >> expression >>> However what we are really interested in is the discrepancy between the >> two, i.e. introns go one way, exons the other, or exons don't change but >> introns do, or whatever. >>> I've made a "sort of" volcano plot - I've subtracted the log2 fold >> change calculated using intron only counting from the log2FC calculated >> using exon only counting. >>> http://postimg.org/image/o9hi6o4pj/ >>> >>> I've plotted this against -log10 p val. You see there is some "volcano" >> activity although it's quite 1 sides - as I understand it, this is >> suggesting that more genes are increasing in intronic expression in case >> samples compared to exonic expression, since the cluster of negative values >> with low p values suggests that the FC is higher for introns than exons. >>> However the "effect size" as I say is pretty small. This is probably in >> part due to the way FCs are shrunk by DEXSeq, should I use the FC shrinking >> in this case? >>> BTW There are a couple of outliers in the volcano plot, the real plot is >> like this: >>> http://postimg.org/image/r4c7uhf5t/ >>> >>> BTW I've not used the plotDEXSeq function because this is exon focussed >> if I understand correctly, instead I've used the ggbio package. >>> Still unsure what criteria to use to say whether a gene is really >> showing increased intronic expression. Probably I can't I can just go down >> the list of p-values looking at the patterns of expression and try to say >> which are interesting! >>> Cheers, >>> >>> Jim >>> >>> >>> On 24 September 2013 11:17, James Perkins <jimrperkins at="" gmail.com=""> wrote: >>> Dear Wolfgang, Martin, list >>> >>> Sorry for the delay >>> >>> Thanks for the explanation re: counting per exons vs. counting per gene >> wrt multiple transcripts. This doesn't effect my "real data" example where >> I counted reads mapping to all exons and reads mapping to all introns and >> classed them as two different "exons" per gene for the sake of runnign >> DEXSeq. Martin - my approach is very simple so I would say you idea is not >> quite relevant in this case. However this method could be possibly extended >> to look for individual introns showing an increase in expression, rather >> than detecting genes who show a general relative increase in intronic >> expression summed accross exons/introns. >>> Wolfgang - I agree with your points re: plotting the data - this find >> some nice instances of changes in 'intronic expression' in the case vs. >> control samples. >>> Re: this point: >>> >>> >>> - an effect size (fold-change) cutoff might help focus the results on >> the most interesting instances. How does the volcano plot look like? >>> >>> I like the idea of the effect size cutoff, however what do you mean by >> fold change (for plotting the volcano) do you mean the change in ratios of >> exon to intronic, or what should I plot? and vs. what? p-value I assume but >> just to be clear. >>> Thanks a lot again for your help. >>> >>> Jim >>> >>> >>> >>> On 30 August 2013 15:35, Wolfgang Huber <whuber at="" embl.de=""> wrote: >>> Dear Jim, Martin & list >>> >>> Martin's proposal of using splice graphs is definitely the way to go, it >> is conceptually appealing, would have more power, and results are >> potentially more readily interpretable. That said, the current state of >> what we have in DEXSeq does the testing exon-by-exon, and lack of power >> does not seem to be James' problem. Regarding James' questions: >>> 1. It is fine, and to be expected, that the sum of per-exon counts is >> larger than the per-gene count. This is OK because a read that covers >> multiple exons is counted multiple times, for each of the overlaps. This >> makes sense for the testing setup of DEXSeq. It tests, marginally, exon by >> exon, whether its relative usage among all transcripts from the locus is >> condition-dependent. Since the tests are marginal, correlations between >> counts of neighbouring exons are not a problem. Each overlap of a read is a >> postiive piece of evidence for it being expressed. >>> 2. Regarding your high number of results, here are some considerations: >>> - the fact that swapping sample labels leads to greatly reduced sets of >> hits is reassuring >>> - definitely do go ahead with visualising the results to see what is >> going on and whether this large number of calls is justified by the data >> (please keep us in the loop) >>> - did you see the "plotDEXSeq" function >>> - an effect size (fold-change) cutoff might help focus the results on >> the most interesting instances. How does the volcano plot look like? >>> Hope this helps >>> Wolfgang >>> >>> >>> On 29 Aug 2013, at 21:27, James Perkins <jimrperkins at="" gmail.com=""> wrote: >>> >>>> Hi Wolfgang, list >>>> >>>> Thank you very much for the helpful tip. >>>> >>>> I did exactly as you suggested, the code is below. I wanted to do it >> using the pasilla package in order to create an instructive example, and so >> that you could potentially recreate the analysis, but strangely sometimes >> the gene read counts from pasillaGenes are lower than the summed exon >> counts from the pasillaExon object. Or at least that's the case according >> to my code! >>>> If you (or anyone else who is reading this) can think of a good public >> data set for this please let me know, ideally one that is already in >> bioconductor and fairly easy to manipulate. Otherwise I am happy to submit >> my current dataset to bioc or make available some other way once its >> corresponding paper has been published >>>> So, code below. Any comments or suggestions would be appreciated, if >> you think there are ways I could tune DEXSeq for this purpose: >>>> To start with I loaded in all the necessary files and made a counts >> object with Gend ID:E001 for intronic and GeneID:E002 for exonic. >>>> # c50tx - table of counts using reads mapping to anywhere in gene >> c50ex - reads mapping to exons only >>>> intMat <- as.data.frame(c50tx) - as.data.frame(c50ex) >>>> exMat <- as.data.frame(c50ex) >>>> >>>> # Remove genes for which intron count is below 100 >>>> intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] >>>> exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3, ] >>>> intMat <- intMat.2 >>>> >>>> row.names(intMat) <- paste(row.names(intMat), ":E001", sep="") >>>> row.names(exMat) <- paste(row.names(exMat), ":E002", sep="") >>>> # Now I combine the different tables in order to get a combined table >> as so: >>>> both <- rbind(intMat, exMat) >>>> both <- both[order(row.names(both)), ] >>>> head(both) >>>> >>>> # Now I turn this into an exon count set >>>> gIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, >> byrow=TRUE)[,1] >>>> names(gIDs) <- row.names(both) >>>> gIDs <- as.factor(gIDs) >>>> eIDs <- matrix(unlist(strsplit(row.names(both), ":")), ncol=2, >> byrow=TRUE)[,2] >>>> names(eIDs) <- row.names(both) >>>> des <- data.frame(condition=c(rep("case",3), rep("control", 3))) >>>> row.names(des) <- colnames(both) >>>> ecS <- newExonCountSet( >>>> countData = both, >>>> design = des, >>>> geneIDs=gIDs, >>>> exonIDs=eIDs) >>>> >>>> Once I had this I ran normalisation etc. and looked for genes with >> 'differential exon/intron usage': >>>> ecS <- estimateSizeFactors(ecS) >>>> ecS <- estimateDispersions(ecS) >>>> ecS <- fitDispersionFunction(ecS) >>>> head(fData(ecS)$dispBeforeSharing) >>>> >>>> ecS <- testForDEU(ecS) >>>> head(fData(ecS)[, c("pvalue", "padjust")]) >>>> ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE) >>>> res2 <- DEUresultTable(ecS) >>>> sum(ecS$pad >>>> >>>> >>>> So - out of 29516 genes 'tested', ~7000 had > 100 intronic counts for >> at least 3 conditions and ~4000 of these were DEU (FDR<0.1). This seems >> quite a high number of genes, so I wonder if the method is being >> conservative enough. >>>> That said, tried swapping the sample labels a few times, which lead to >> 0 genes being called as DEU, which is a good sign I think. >>>> Interestingly, comparing intron:exon for case and controls for the >> significant DEU genes shows a 4:1 ratio of genes that with an increased >> intronic expression in case samples, suggesting the experimental >> intervention is leading to increased intronic expression. >>>> So, does this approach seem valid? My main question is whether it is a >> little unconservative. I'm currently using ggbio to plot the read >> alignments for the significant genes and looking for interesting looking >> patterns of intronic expression that are consistent across replicates of >> the same class (and that differ between classes). I've found a few things >> that look interesting for follow up. >>>> Thanks a lot for your help, >>>> >>>> Jim >>>> >>>> On 9 August 2013 16:02, Wolfgang Huber <whuber at="" embl.de=""> wrote: >>>> Dear James >>>> have you already checked the DEXSeq package and the paper >> http://genome.cshlp.org/content/22/10/2008.long >>>> This does not apply 1:1 to what you are asking for, but afaIcs the >> main modification would be for you to define "counting bins" (on which the >> input to DEXSeq is computed by read overlap counting) that represent (i) >> exons and (ii) introns and then check for changes in relative intro usage >> (the 'ratios' you mention below). >>>> Let me know how it goes >>>> Wolfgang. >>>> >>>> On 9 Aug 2013, at 10:02, James Perkins <j.perkins at="" ucl.ac.uk=""> wrote: >>>> >>>>> Dear list, >>>>> >>>>> I would like to know if an experimental treatment leads to a >> significant >>>>> shift in intronic expression for some genes. >>>>> >>>>> Imagine I have an experiment with 6 biological replicates of a given >>>>> tissue. I believe that the treatment might lead to an increased >> intronic >>>>> expression for some genes, unrelated to exonic expression. >>>>> >>>>> 3 of these receive no treament, they are used as control. The other 3 >>>>> receive experimental treatment. >>>>> >>>>> I then sequence the mRNA from these samples (Illumina, single end >> reads, >>>>> ~40 million reads per sample), to obtain 6 fastq files, I align >> these to a >>>>> refernce genome and get bam files. >>>>> >>>>> I was thinking that a fairly easy way to see if some genes show a >>>>> consistent increased intronic expression following treatment would >> be to >>>>> count intronically aligning reads for each gene (e.g. using >> GenomicRanges) >>>>> and use something like DESeq to look for genes showing a significant >> change >>>>> in intronic "expression". >>>>> >>>>> However, the problem is that this might be due to exonic expression, >> due to >>>>> premature mRNAs etc., so I might end up finding genes that are >>>>> differentially expressed at the exon level, and as a result the >> increased >>>>> exon expression has caused increased intronic expression as a by >> product. >>>>> Obviously I am not so interested in these genes wrt this method, I >> can find >>>>> these using "traditional" DE methods. >>>>> >>>>> In addition, when I tried profiling intronic regions using reads >> mapping to >>>>> introns, (using DESeq) it led to dodgy MA plots, where the 0 FC line >> was >>>>> quite far above the minimum mean expression point, i.e. it didn't go >>>>> through the middle of the clump of data points (if that makes >> sense). I >>>>> wonder if this is due to the size factor calculation being based on >>>>> intronic "expression" (well, reads mapping to introns), which is >> generally >>>>> much lower than exon expression and therefore being somewhat >> unreliable. >>>>> So I would like to take this exon expression out of the equation, >> so to >>>>> speak. >>>>> >>>>> I thought that one way might be to compare the ratios of exonic to >> intronic >>>>> reads between samples, for each gene. >>>>> >>>>> For example one gene might have 30, 35 and 33 exonically mapping >> reads and >>>>> 10,11 and 12 intronically mapping reads for control samples >>>>> >>>>> For case samples it might have 33, 32 and 34 exonically mapping >> reads; >>>>> 20,21 and 19 intronically mapping reads. >>>>> >>>>> So we could compare 10/30, 11/35 and 12/33 for control to 20/33, >> 21/32 and >>>>> 19/34 for case. >>>>> >>>>> Does this methodology sound reasonable? It is necessarily based on >> the >>>>> assumption that intronic "expression" due to unspliced RNAs is >> correlated >>>>> with exon expression. >>>>> >>>>> If it sounds reasonable, is there a test that is recommended to >> compare the >>>>> ratios in such a way, that takes into account the biological >> replication of >>>>> samples? I could do a simple test (chi squared) to compare the >> relative >>>>> frequencies, but this wouldn't take into account the replicates. >>>>> >>>>> I realise this isn't really a specific bioconductor question, but >> hopefully >>>>> it might be of interest to some of the list subscribers. >>>>> >>>>> Many thanks, >>>>> >>>>> Jim >>>>> >>>>> -- >>>>> James R Perkins, PhD >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> James R Perkins PhD >>> >>> >>> >>> -- >>> James R Perkins PhD >>> >>> >>> >>> -- >>> James R Perkins PhD >> >