Question: RNA-Seq analysis edgeR
gravatar for eb0906
2.9 years ago by
United States
eb09060 wrote:

We have a time course infection experiment with 12 samples (2 technical replicates and 2 biological replicates per sample). The samples were sequenced on a HiSeq over 2 lanes, 1 x 50 bp reads.

  1. 24hr_infection1_lane3
  2. 24hr_infection1_ lane4
  3. 24hr_infection2_lane3
  4. 24hr_infection2_lane4
  5. 24hr_control1_lane3
  6. 24hr_control1_lane4
  7. 24hr_control2_lane3
  8. 24hr_control2_lane4
  9. 48hr_infection1_lane3
  10. 48hr_infection1_lane4
  11. 48hr_infection2_lane3
  12. 48hr_infection2_lane4

I aligned the samples using Tophat and then converted the .bam files to .sam files and created a counts file using HTSeq for each sample. My questions now are: 

  1. Should the technical replicates be collapsed (add gene counts for technical replicates)? 
  2. How should the design matrix be setup in edgeR?
ADD COMMENTlink modified 2.9 years ago by James W. MacDonald45k • written 2.9 years ago by eb09060
gravatar for James W. MacDonald
2.9 years ago by
United States
James W. MacDonald45k wrote:

Note that Tophat is intended for finding splice junctions, usually with the goal that you would then infer transcript abundances and presumably then infer differential transcript splicing. This is different from what one normally does with edgeR, which is to infer differential gene expression (which by default ignores differences in the form of the transcripts).

The former is really hard to do, and requires both high read depth and lots of biological replication. The latter is much less difficult, as it is a simpler problem. My point being that Tophat spends a lot of time figuring out things that you are planning to ignore, so you are likely better off using something like bowtie2 or subread or bwa, etc to do the aligning. subread in particular is very fast for this sort of thing, and there is a Bioconductor Rsubread package that makes it simple to use.

Anyway, the answers to your question are:

1.) Yes

2.) It depends on what you are trying to detect. I don't see a 48 hour control, so presumably you are planning to compare both time points back to the 24 hour control. In which case you would want to do something like

design <- model.matrix(~factor(rep(c("inf_24","control","inf_48"), each = 2)))

which parameterizes the second and third coefficient as 24h infection - control, and 48h infection - control, respectively.

ADD COMMENTlink written 2.9 years ago by James W. MacDonald45k

Thanks very much for your input. 

Based on your response, we are thinking that we should redo the alignment using bowtie2 and use the .sam output for creating a counts file using HTSeq for each sample and then collapse technical replicates. Yes, you are correct in assuming that we will compare both time points to the 24 hr control since we do not have an additional control at 48 hrs. 

To be clear, if one is only interested in differential gene expression between treatment groups, then the Tuxedo suite tools is not a good choice? Ideally, one would use a different aligner (BWA, Bowtie, STAR, etc) and then use either edgeR or DESeq2?

ADD REPLYlink written 2.9 years ago by eb09060

I am saying that Tophat is probably not the tool you want for the task at hand. It uses bowtie to do the alignment, and then goes back and tries to find junction splicing reads. If you don't care about the junction splicing reads (which are in a different file than your aligned reads), then you are doing an extra step, which takes that much more time, and then not using the data that you took the extra time to generate. I don't think that the accepted-hits.bam file contains the junction-spanning reads, so you may be losing those.

Using bowtie2 for the aligner is helpful because it is a gapped aligner, so you shouldn't lose reads that span exon-exon junctions, which you might lose with bowtie1. IIRC, bwa and star are gapped aligners as well. Subread just soft-clips the junction spanning reads, but that is a distinction without a difference. You will still get the same number of reads per gene (because subread says 'the read aligns to the end of this exon', and the gapped aligners say 'the read aligns to the ends of these two exons', but both result in that read being counted as overlapping a gene).

So to answer your question, yes, ideally you would just use one of the gapped aligners that you mention and then use either edgeR or DESeq2 to test for differential expression.

ADD REPLYlink written 2.9 years ago by James W. MacDonald45k

And I forgot to mention that you can just use bowtie2 (or pretty much any aligner) to align the technical replicates together, so you don't have to collapse after the alignment. You just do something like

bowtie2-align -U techrep1.fastq,techrep2.fastq | samtools view -bS - > output.bam

or if you have paired end

bowtie2-align -1 techrep1-1.fastq,techrep2-1.fastq -2 techrep1-2.fastq,techrep2-2.fastq | samtools view -bS - > output.bam
ADD REPLYlink written 2.9 years ago by James W. MacDonald45k

This is really helpful -- thank you!

ADD REPLYlink written 2.9 years ago by eb09060
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 298 users visited in the last hour