Hello everybody,
I'm currently analyzing RNAseq data of plants of the same species but different phenotypes (A and B).
In case of phenotype A, we sampled two different tissues (T1a and T2) before altering growth conditions, sampled T1 again (T1b) during this alteration and T1 one last time after returning back to normal (T1c).
In case of phenotype B, we just sampled T1 and T2 without changing growth conditions.
Due to the high degree of heterozygosity of the plants used for the experiment, we thought it best to pool the same tissue under the same conditions from multiple plants (30 - 100) to reduce noise and capture only those genes that show large differences in expression.
We were aware of the challenges that this design would pose for statistical analysis and therefore provided 3 biological replicates for sequencing for each tissue and growth condition (at lower depths than for the pools). We would then use the replicate data to estimate the dispersion and use it to analyze the data from the pools.
My question now is how to do this properly.
We already had a look at the manual from edgeR (Chapter 2.12) and thought we could use the common dispersion estimated from the replicates for the pools. This would mean that the same dispersion value would have to be used for all genes, and we are not sure if this would be the best way.
library(edgeR)
counts <- read.delim(data, header=T, row.names="ID")
group <- factor(c("B_T1", "B_T1", "B_T1", "B_T2", "B_T2", "B_T2", "A_T1a", "A_T1a", "A_T1a", "A_T2", "A_T2", "A_T2", "A_T1b", "A_T1b", "A_T1b", "A_T1c", "A_T1c", "A_T1c"))
y <- DGEList(counts=counts, group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~0+group)
colnames(design) <- levels(y$samples$group)
y <- estimateDisp(y, design, robust=T)
c.disp <- y$common.dispersion
# Continue with analysis of pools
counts <- read.delim(data, header=T, row.names="ID")
group <- factor(c("B_T1", "B_T2", "A_T1a", "A_T2", "A_T1b", "A_T1c")
y <- DGEList(counts=counts, group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~0+group)
colnames(design) <- levels(y$samples$group)
y$common.dispersion <- c.disp
# Continue with glmFit approach
...
An alternative approach would be to use the tagwise dispersion of each gene from the replicates for the genes detected in the pools, but this would result in dropping several genes which could not be detected in the replicates but in the pools because we would have no dispersion value for them.
Is one of this approaches the "best" way or are there other suggestions?
Thank you, Kai
Sorry, I don't understand your experiment.
You are reading the edgeR Section on on what to do without replicates, but your experiment appears to have three biological replicates for each tissue and condition.
You say that samples are extracted from 30-100 plants. Are all the samples extracted from this number of multiple plants or only some? If only some, then which samples are pools and which are not?
If some samples are not pools, then what are they? Single plants?
Is each sample extracted from a different set of plants? If not, then which samples are extracted from the same plants?
I'm sorry, I know it is kinda confusing...
We have 6 pools in total: Phenotype A: T1a, T2, T1b, T1c Phenotype B: T1, T2
Each pool consists of 30 - 100 of the respective tissue samples, i.e. for phenotype A and tissue T1a, material from ~100 plants was pooled together in equal amounts and RNA was subsequently extracted from it. T2 of the same phenotype has "only" ~40 tissue samples pooled together and so on.
But at the same time, for each pool, we have three biological replicates from the same population of plants with the respective phenotype and time point.
I hope it's more clear now.
Am I understanding correctly that you have 24 RNA-seq libraries in total, 6 from the pooled samples and 18 from regular samples with biological replicates?
Also, are the biological replicate samples are from the same plants as used for the pools or from different plants? In other words, are they biologically independent from the pools? I assume that they must be from different plants rather than just being aliquots of the pool RNA.
Yes you are right, we have 24 RNA-seq libraries in total.
The replicates come from the same population of plants, but independent from those which were used for the pools.
How many plants are used for each of the biological replicate samples? One plant or a few?
For tissue T1 and each time point, each replicate corresponds to a single plant.
For tissue T2, we actually had to sample 5 plants per replicate to obtain enough material because the tissue we are interested in covers only a small area.