Question

Differential expression analysis of CAGE data

0

Entering edit mode

Makis Motakis ▴ 30

@makis-motakis-5507

Last seen 8.7 years ago

Japan

Hello list, My name is Makis, I am a bioinformatics post-doc fellow and, recently, I started working on a CAGE project. I am new on the field and trying to figure out some things, so I would much appreciate any comments on my problem. I am sorry if my question sounds a bit confusing... I have a Treatment vs Control (independent samples) experimental design with 3 replicates in each condition. Usually what people do is: 1. Use samtools to keep the reads that exhibit a high mapping quality score (here "-q 20") and convert the .bam files (containing the quality data) into bed format: samtools view -q 20 -F 768 -u input.bam | ~/bin/bamToBed -i stdin | gzip -c > output.bed.gz 2. summarizing all count data (generate transcription start sites clusters (CTSS)) 3. use intersectBed to extract only promoter/gene overlapping entries from a known (e.g. Refseq) or a custom annotation. The counts of a set of ctss_ids (step 2) belonging to a specific gene region are summarized (for each sample). This is the (gene) expression to be analyzed. 4. Do differential expression analysis using the total counts for genes from step 3. This can be done by edgeR, DESeq etc. I would like to ask (i) is it (biologically or mathematically) meaningful to perform differential expression analysis for the ctss ids (data of step 2) by an appropriate method and then finding genes that contain many differentially expressed ctss_ids (with an enrichment test)? (ii) Does anyone have any idea how different my results would be from the ones obtained by the above (steps 1-4) pipeline (perhaps FDR adj p-values would be very different due to higher dimensionality...)? (iii) If my strategy is meaningful (DE on ctss_ids), should I use a version of cufflinks DE analysis (since I do not have total counts any more)? I am a bit confused because for CAGE data the gene length (isoform length) does not matter. The data I see are just transcription start sites. Thank you in advance for your help. Best Regards, Makis [[alternative HTML version deleted]]

Transcription Annotation convert edgeR DESeq Transcription Annotation convert edgeR DESeq • 1.5k views

ADD COMMENT • link updated 11.6 years ago by Steve Lianoglou ★ 13k • written 11.6 years ago by Makis Motakis ▴ 30

score 0 · Answer 1 · 2012-09-20

Hi, On Thu, Sep 20, 2012 at 10:17 PM, Makis Motakis <emotakis at="" hotmail.com=""> wrote: > Hello list, > My name is Makis, I am a bioinformatics post-doc fellow and, recently, I started working on a CAGE project. I am new on the field and trying to figure out some things, so I would much appreciate any comments on my problem. I am sorry if my question sounds a bit confusing... > > I have a Treatment vs Control (independent samples) experimental design with 3 replicates in each condition. Usually what people do is: > > 1. Use samtools to keep the reads that exhibit a high mapping quality score (here "-q 20") and convert the .bam files (containing the quality data) into bed format: > > samtools view -q 20 -F 768 -u input.bam | ~/bin/bamToBed -i stdin | gzip -c > output.bed.gz > > 2. summarizing all count data (generate transcription start sites clusters (CTSS)) > > 3. use intersectBed to extract only promoter/gene overlapping entries from a known (e.g. Refseq) or a custom annotation. The counts of a set of ctss_ids (step 2) belonging to a specific gene region are summarized (for each sample). This is the (gene) expression to be analyzed. > > 4. Do differential expression analysis using the total counts for genes from step 3. This can be done by edgeR, DESeq etc. > > > I would like to ask (i) is it (biologically or mathematically) meaningful to perform differential expression analysis for the ctss ids (data of step 2) by an appropriate method and then finding genes that contain many differentially expressed ctss_ids (with an enrichment test)? (ii) Does anyone have any idea how different my results would be from the ones obtained by the above (steps 1-4) pipeline (perhaps FDR adj p-values would be very different due to higher dimensionality...)? (iii) If my strategy is meaningful (DE on ctss_ids), should I use a version of cufflinks DE analysis (since I do not have total counts any more)? I am a bit confused because for CAGE data the gene length (isoform length) does not matter. The data I see are just transcription start sites. I'm not sure what the standard CAGE analysis is, but if I get you correctly, the difference between your analysis and what you describe as the 'current' pipeline is that "currently" all transcript start site clusters are "clustered even further" (due to proximity of annotated TSS of a gene) to just measure expression of the downstream gene, but you are rather interested in checking differential expression of the cluster itself? If that's the case: (i) Sure, I think it can be biologically reasonable. I can imagine setting up a supervised learning problem to try to identify why certain TSS clusters are differentially expressed between conditions. (ii) Impossible to say how different. For instance, tt could be that some tss clusters are differentially expressed while the transcript "in total" isn't, so ... (iii) I don't see why you'd use cufflinks here, though. The TSS clusters are still "regions of interest" and they should fit just fine into the edgeR/DESeq mojo (some QC would of course be required) -- in fact, I can imagine a DEXSeq-like approach can be appropriate here, even. HTH (and that I haven't misunderstood your goal), -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact