Question: Incorporate "batch" information for paired-design ?
0
3.0 years ago by
g.atla0
g.atla0 wrote:

I am working with large panel of human tissue cells that are collected from different centres. Its a paired design i.e same sample has treated and untreated rna-seq data.

Here is how my colData looks like

> colData
purity subjects treat  Center   Year  BMI  Gender  Age
sample1_treat      95    1      treat    Miami  2011  <NA>   M   <NA>
sample1_untreat    95    1      untreat  Miami  2011  <NA>   M   <NA>
sample2_treat      90    2      treat    Milano 2011  <NA>   F   <NA>
sample2_untreat    90    2      untreat  Milano 2011  <NA>   F   <NA>
sample3_treat      75    3      treat    Geneva 2011   20    F    50
sample3_untreat    75    3      untreat  Geneva 2011   20    F    50
. . .
. . .

As I have been believing that the paired design takes any batch effects in to account, I used the following design for DE analysis to look for differential genes between treated vs untreated :

design(dds) <- formula(~ subjects + treat)

I am also intended to do exploratory data analysis, for that I am using a matrix of log Fold change values generated by:

norm <- assay(normTransform(dds))
i <- seq.int(1L,72,by = 2L) # 72 is the total number of samples.
norm.fc <- norm[,i]-norm[,i+1]​
write.table(norm.fc, file="normalised_FCMatrix.txt", sep="\t")

And I am using the norm.fc for doing clustering and PCA analysis.

I would like to know if my design for DE analysis is correct. Do I need to incorporate any other information to account for tissue collection / sequencing centre etc.

deseq2 rna-seq batch effect • 796 views
modified 3.0 years ago • written 3.0 years ago by g.atla0
Answer: Incorporate "batch" information for paired-design ?
0
3.0 years ago by
United States
James W. MacDonald51k wrote:

Your analysis takes any batch effects into effect, contingent upon those batch effects being consistent, where by 'consistent' I mean 'on average the same within batch'. This is likely to be true of things like tissue collection centers, where the average gene expression value may be different between centers, but is expected to be relatively consistent within each center.

The pairing won't account for any interaction term for things like sex or age. In other words, there may be a whole set of genes that are affected differently by treatment in males as compared to females. Or the treatment may have a different effect on cells from older people as compared to younger people. So while pairing would be expected to account for any batch effects associated with center, you may still need to account for sex or age. Or not, depending on what the treatment is.

So if I want to look if the treatment has different effect on males vs females, I need to create the appropriate model.matrix ?

So I should ask the program to give me if there is any difference in the gene expression in males vs females w.r.t treatment ?  It looks complicated to ask "If there is any difference between treated vs untreated samples in males and females"

If you fit a model like

design(dds) <- formula(~ subjects + treat + Gender)

rslt <- DESeq(dds)

Then you will be testing to see if there is a gender-specific effect for any genes, which in the end could be interpreted as an interaction between sex and treatment. If there aren't too many genes that have a gender-specific effect (for some interpretation of 'too many', and certainly not including genes from either chrX or chrY), then you might conclude that gender isn't a big driver of any differences, and then ignore it. Or it there are 'enough' genes that are significant, you can just add Gender as a nuisance variable. The same is true of age, although it looks like you don't have complete information for that.