Question: Time course RNA-Seq using edgeR
0
3.9 years ago by
dsperley0
United States
dsperley0 wrote:

Hi,

I’m analyzing some RNA-Seq data for another investigator.  The experimental design is an infection time course with 10 samples (2 biological replicates for each timepoint except 2), 3 timepoints (0h, 12h, 36h) and two strains of mice (WT, and ATG knockout (KO)):

1. WT.0h.a
2. WT.0h.b
3. WT.12h.a
4. WT.12h.b
5. WT.36h.a
6. WT.36h.b
7. KO.0h
8. KO.12h
9. KO.36h.a
10. KO.36h.b

The trimmed fastq files were aligned using bowtie2 and count files were generated using HTSeq. The following R code was used:

Condition<-c("WT.0h","WT.0h","WT.12h","WT.12h","WT.36h","WT.36h","KO.0h","KO.12h","KO.36h","KO.36h")

RNA_data = DGEList(counts=Count_Table[,c(2,3,4:7,10:13)], group=Condition,genes = Count_Table$Gene_ID) # filter low expressing genes #Keeping Genes with greater than 1 count per million (cpm) in at least 2 samples # 1 cpm corresponds to a read count of aproximately 6 keep<-rowSums(cpm(RNA_data)>1)>=1 RNA_data_filtered<-RNA_data[keep,,keep.lib.sizes=F] dim(RNA_data_filtered) # normalization RNA_data_filtered<- calcNormFactors(RNA_data_filtered) #setting up design matrix Condition<-factor(Condition) design<-model.matrix(~0+Condition,data=RNA_data_filtered$samples)
colnames(design)<-levels(Condition)
# estimate dispersion

RNA_data_filtered<-estimateGLMCommonDisp(RNA_data_filtered,design,verbose=T)

#Disp = 0.03235 , BCV = 0.1799

RNA_data_filtered<-estimateGLMTrendedDisp(RNA_data_filtered,design)

RNA_data_filtered<-estimateGLMTagwiseDisp(RNA_data_filtered,design)

plotBCV(RNA_data_filtered)

plotMDS(RNA_data_filtered)

I have 3 questions/concerns:

1. Can I perform contrasts with the treatment groups that have no replicates and still get meaningful results?
2. The BCV distribution seems a bit odd (I thought that as the logCPM increases the BCV values should remain the same, not increase as I see here, but I could be wrong).
3. The MDS plot shows that the two samples in the WT 0h group are very dissimilar, and the K0 12h infected sample is more similar to the 0h uninfected samples.

Am I right in thinking that the odd BCV distribution is due to a possible mix-up between the WT.0h.b sample and the KO.12h sample?

edger • 3.0k views
modified 3.9 years ago by Gordon Smyth36k • written 3.9 years ago by dsperley0

Quick note: perhaps this was a copy/paste/transcription error, but your gene filtering criterion is broken:

# Keeping Genes with greater than 1 count per million (cpm) in at least 2 samples
# 1 cpm corresponds to a read count of aproximately 6
keep <- rowSums(cpm(RNA_data)>1) >= 1


This is actually keeping genes that have a minimum expression in only one sample.

Thanks for noticing, you're right that this was a copy/paste error in the comment - I intended to keep genes with a minimum expression in only one sample- since some treatment groups consist of only one sample.

Answer: A: Time course RNA-Seq using edgeR
2
3.9 years ago by
Gordon Smyth36k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth36k wrote:

One possible explanation that is suggested by the MDS plot is that there is a strong batch effect associated with the three samples WT.0h.a, KO.0h and KO.12.

What we do when we see data like this (which we often do) is to go back to the records kept by the original experimenter and to ask whether anything was different about these samples during the collection process. For example, were they collected at a different time? Were they male mice when all the others were female? Etc etc.

Another warning sign in your experiment is that there are a few genes with very large dispersion values. Another thing we do is to run

d <- estimateDisp(RNA_data_filtered, design, robust=TRUE)
plotBCV(d)


and to examine which genes have large dispersion estimates and small values for prior.df. These are probably the genes associated with the batch effect. This might show for example ribosomal genes, indicating ribosomal RNA retention is some samples, or sex linked genes showing that the samples are not sex matched.

I have taken your advice and asked the experimenter if there was anything different about the WT.0h.a, KO.0h, and K0.12h samples;  however, the experimenter has no additional information about the sex or collection of the samples. I have run

d <- estimateDisp(RNA_data_filtered, design, robust=TRUE)
plotBCV(d)

to examine genes with high dispersion. The tags with high dispersion are not sex linked, or rRNAs, instead they are involved in keratination, cellular differentiation, muscular differentiation and signaling. They also appear to be driving the separation of the KO.0h, KO.12h, and WT.0h.12a from the other samples; in the KO.0h, KO.12h, and WT.0h.12a samples they have between 18-1483 cpm, and very low (less than 1 cpm) for all the other samples.

Given the data that I have, what is would be an appropriate way to analyse it?

2

Your results seem to have confirmed the suspected batch effect, even though the experimenter doesn't have an explanation for what the cause might be.

My best suggestion for such data would be to either (i) add the batch effect to the design matrix or (ii) use robust empirical Bayes, which will effectively take the hyper-variable genes out of the analysis. You have already started the robust empirical Bayes process by using estimateDisp() with robust=TRUE. These robust dispersion estimates will then be propagated to glmFit() and glmLRT(). You could also consider using glmQLFit() with robust=TRUE.

However some datasets have problems for which there is no satisfactory fix at the analysis stage.

Thank-you for your advice on this dataset, I plan to use robust empirical Bayes with glmFit() and glmLRT()

So, it doesn't seem like there's an obvious set of genes that should be removed. Let's try a different approach; if you look at the pairwise MA/smear plots, especially that between the two WT.0h samples, is there some kind of trend? This may be driving the increase in the dispersions at high abundances. If so, we can normalize for this to get a nicer mean-dispersion trend. Of course, it won't help the MDS plot, but at least we can proceed with the rest of the analysis.

I have looked at the pairwise MA/smear plots, and I see no trend in the plot between the WT samples, or other pairs of samples. So I'm planning to take Gordon's advice, and using robust empirical Bayes. Although, for future reference/curiosity, how would one normalize for a trend in LogFC at higher abundances?

Non-linear methods are the key here. One implementation is provided in the csaw package, and can be run with:

require(csaw)
offset <- normalizeCounts(y$counts, lib.sizes=y$samples$lib.size, type="loess") y$offset <- offset

where y is your DGEList. You can use y in the rest of the analysis, e.g., dispersion estimation, GLM fitting.

Answer: A: Time course RNA-Seq using edgeR
3
3.9 years ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

1. Yes. For each gene, one dispersion is used for modelling the counts in all libraries. The value of the dispersion is estimated using the residual degrees of freedom in the entire model. This means that you can still get a valid estimate, even if several groups do not have replicates.
2. We generally expect a monotonic decreasing BCV trend in most types of sequencing data. In that respect, your trend looks okay up to about an average abundance of 7. After that, it gets a bit stranger, as it increases sharply instead of decreasing to some plateau value.
3. Yes, this is a concern. The first dimension represents the dominant differences between WT and KO samples, and you can see that your WT.0h "replicates" are also separated on this dimension. This may be driving some inflation of the dispersion estimates, given that replicates in all other groups are fairly tight.

For the sake of diagnosing the problem, I'd suggest dropping the WT.0h group from the analysis, and repeating the dispersion estimation. This might give a more well-behaved BCV trend. If so, then you can start to investigate why the WT.0h replicates are so different. A MA plot might be the first port of call, to see if there are any outliers that might drive separation on a MDS plot and/or inflation of the dispersion estimates. With any luck, there might be a distinct set of outliers (e.g., ribosomal genes, Ig components) that can be easily removed from the analysis.

I'd be cautious about concluding that there's a mix-up between samples. For all we know, either WT.0h.a or WT.0h.b might be the "correct" sample, so we wouldn't know whether we should throw out one or the other.