I have scoured the forums and the manual, but the specific answer to my question still eludes me. Apologies if this is a repeated question.
What I have: RNA-Seq count data from various tissues from three biological replicates in two species. Certain tissue types in one of the species are expected to exhibit particular expression differences. The two groups are denoted by the variable name 'tissue state' (see table below).
What I want: A list of genes that are specifically differentially expressed between the two 'tissue state' groups, while controlling for both variation within each sample 'group', and between the 'group's that are nested within each 'tissue state'.
I have tried to work through some examples and pull together a script that addresses, but I'm still not sure I'm specifying the right combination of "design" and "block". I am using the "double-voom" strategy recommended in the old mailing list forums.
Any advice would be appreciated. (my script is posted below)
Thanks,
Art V.
The metadata file is:
sample group tissuestate individual species A1 A 1 1 1 A2 A 1 2 1 A3 A 1 3 1 B1 B 1 1 1 B2 B 1 2 1 B3 B 1 3 1 C1 C 1 1 1 C2 C 1 2 1 C3 C 1 3 1 D1 D 0 1 1 D2 D 0 2 1 D3 D 0 3 1 E1 E 0 4 0 E2 E 0 5 0 E3 E 0 6 0 F1 F 0 4 0 F2 F 0 5 0 F3 F 0 6 0
Here is my current working script:
library("limma")
library("statmod")
library("edgeR")
args <- commandArgs()
y <- read.table(CountFile, header=T, sep="\t", stringsAsFactors=F)
rownames(y) <- y[,1]
y <- y[,-1]
metadata <- read.table(MetadataFile, header=T, sep="\t")
y <- DGEList(counts=y, group=metadata$group)
keep <- rowSums(cpm(y)>1) >= 3
y <- y[keep,]
y <- calcNormFactors(y, method="TMM")
design <- model.matrix(~metadata$tissuestate, data=metadata)
v.data <- voom(y, design)
v.data$genes = rownames(y)
block=metadata$group
corfit <- duplicateCorrelation(v.data,design,block=block)
v.data <- voom(y,design,block=block,correlation=corfit$consensus)
v.fitlimma = lmFit(v.data, design,block=block,correlation=corfit$consensus)
v.fitbayes = eBayes(v.fitlimma)
v.pvalues = v.fitbayes$p.value[,2]
v.adjpvalues = p.adjust(v.pvalues, method="BH")
write.table(toptable(v.fitbayes, coef=2, number=length(v.fitbayes), sort.by="p"), file=OutputFile, sep="\t")
Edit: the metadata table was confusing, added some information and renamed columns
Your column names in
metadata
are a little confusing; it's easy to getgroup
andcondition
mixed up during a more general discussion of groups and conditions. I assumecondition
refers to the two species of interest, whereasgroup
refers to the various tissue types.Thanks for the feedback, I have updated the table to be hopefully less confusing. The particular "tissue state" (a.k.a. condition) we are interested in occurs in only SOME of the tissue types in only ONE of the species.
That's better. Still, what does the
group
factor actually represent? For example, is it some kind of batch variable?"group" are biological replicates of the same tissue type taken from three separate individuals. I basically want to make sure that we are controlling for the fact that three individuals in a common environment with tissues collected at the same approximate stage SHOULD have correlated expression. Without some kind of nested model or blocking all the individual samples on each side of "tissuestate" would be treated as completely independent.
Among the products mentioned on the list, which one is the best? https://knowallthe.com/difference-between-hiking-and-work-boots/