problem in designing matrix and batch effect
2
0
Entering edit mode
@alihakimzadeh73-20840
Last seen 3.7 years ago

Hi,

I am using EdgeR for differential expression. The experiment includes five individuals who receive the vaccine in 4 different time points (pre, 2hour post, 24hour post, 14-day post). I want to find DE genes between pre and post samples, but unfortunately, i couldn't find a way to design my matrix in a way that i get results. i did PlotMD as you saw in the image sample 5-8 show separate batch in contrast to others. I intend to make the identification of DE genes using a log2 fold change and likelihood ratios (LR) Test in edgeR and significantly expressed genes had an FDR adjusted P-value of < 5%. Here is the script i use for my analysis:

library(edgeR)
setwd("~/Desktop/counts/low")
x_low<-read.csv("counts.txt")
time<-factor(c("pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post"))
time <- relevel(time, ref="pre")
y_low<-DGEList(counts=x_low[,2:21], genes = x_low[,1], group = time) #read counts.csv and make table  
keep <- filterByExpr(y_low)
y_low <-y_low[keep, keep.lib.sizes=FALSE] 
y_low <- calcNormFactors(y_low)
plotMDS(y_low)
data.frame(sample=colnames(y_low),time) #data frame
      sample      time
1   Sample_1       pre
2   Sample_2  2hr-post
3   Sample_3 24hr-post
4   Sample_4  14d-post
5   Sample_5       pre
6   Sample_6  2hr-post
7   Sample_7 24hr-post
8   Sample_8  14d-post
9   Sample_9       pre
10 Sample_10  2hr-post
11 Sample_11 24hr-post
12 Sample_12  14d-post
13 Sample_13       pre
14 Sample_14  2hr-post
15 Sample_15 24hr-post
16 Sample_16  14d-post
17 Sample_17       pre
18 Sample_18  2hr-post
19 Sample_19 24hr-post
20 Sample_20  14d-post
design<-model.matrix(~time) 
y_low<-estimateDisp(y_low,design)
fit<-glmFit(y_low,design)
lrt<-glmLRT(fit)
deg = topTags(lrt,p.value = 0.05)$table

Thanks in advance

edger rna-seq matrix • 1.3k views
ADD COMMENT
0
Entering edit mode

I answered this question weeks ago on Biostars: https://www.biostars.org/p/410007

I already advised OP to treat each group of four samples as a batch, which one would do by

design <- model.matrix(~time + individual)
ADD REPLY
0
Entering edit mode

Dear Gordon, i know you already reply to my question on Biostars, but it wasn't adequately clear for me. That's why i try to seek more help here. I have nobody to help me, and i am on my own, so that's why i ask again here. Sincerely thanks

ADD REPLY
2
Entering edit mode
@james-w-macdonald-5106
Last seen 9 minutes ago
United States

You have repeated measures on five different subjects, so you should block on subject. See 3.4.1 in the edgeR User's Guide.

Also, you should provide the design matrix to filterByExpr, otherwise you will filter as if you have only one group.

ADD COMMENT
0
Entering edit mode

Thanks, James, i follow the pathway of section 3.4.1, and i changed my code, but i got new errors when i want to estimate the dispersion which shows :

 Error in .compressOffsets(y, lib.size = lib.size, offset = offset) : 
  offsets must be finite values
In addition: Warning message:
In Ops.factor(lib.size, nf) : ‘*’ not meaningful for factors

Here is my new script :

x_low<-read.csv("counts.txt")
time<-factor(c("pre","2hr-post","24hr-post","14d-post",
             "pre","2hr-post","24hr-post","14d-post",
             "pre","2hr-post","24hr-post","14d-post",
             "pre","2hr-post","24hr-post","14d-post",
             "pre","2hr-post","24hr-post","14d-post"))
individual<-factor(c("1","1","1","1",
                     "2","2","2","2",
                     "3","3","3","3",
                     "4","4","4","4",
                     "5","5","5","5"))
time <- relevel(time, ref="pre")
y_low<-DGEList(counts=x_low[,2:21], genes = x_low[,1]) #read counts.csv and make table
targets <- data.frame(sample=colnames(y_low),time,individual)
y_low <- calcNormFactors(y_low)
design<-model.matrix(~time+individual)
design<-filterByExpr(design)
y_low<-estimateGLMCommonDisp(y_low,design)
y_low <- estimateGLMTrendedDisp(y_low,design)
y_low <- estimateGLMTagwiseDisp(y_low,design)
fit<-glmFit(y_low,design)
lrt <- glmLRT(fit)
ADD REPLY
2
Entering edit mode
@james-w-macdonald-5106
Last seen 9 minutes ago
United States

When you use a function in R (or most software, so far as I know), you can use named arguments (like counts = x_low[,2:21]) or you can use positional arguments, where R infers what a given argument is, by its position in the function call. You did this:

y_low<-DGEList(counts=x_low[,2:21], genes = x_low[,1],time)

Where, by mistake or intention you have two named arguments and one positional argument (the third argument, time). An interesting question in this instance is what will happen with the positional argument, since it follows two named arguments? I didn't know what would happen, so I made a stupid test:

> z <- DGEList(counts = matrix(rnbinom(100, 3, 0.5), 10), genes = data.frame(genes = LETTERS[1:10]), letters[1:10])
> z
An object of class "DGEList"
$counts
   Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
1        5       0       1       2       2       6       0       0       2
2        4       7       6       2       5       7       2       3       3
3        2       3       4       2      10       3       4       7       4
4        0       6       1       1       3       2       2       3       6
5        4       2       1       2       2       6       3       5       1
6        8       4       7       5       2       3       4       2       7
7        2       6       5       4       7       5       0       9       4
8        1       0       1       2       5       4       0       0       5
9        6       2       8       3       9       3       2       4       2
10       8       4       2       1       1       6       3       3       2
   Sample10
1         4
2         2
3         1
4         1
5         2
6         1
7         3
8         3
9         3
10        7

$samples
         group lib.size norm.factors
Sample1      1        a            1
Sample2      1        b            1
Sample3      1        c            1
Sample4      1        d            1
Sample5      1        e            1
Sample6      1        f            1
Sample7      1        g            1
Sample8      1        h            1
Sample9      1        i            1
Sample10     1        j            1

$genes
   genes
1      A
2      B
3      C
4      D
5      E
6      F
7      G
8      H
9      I
10     J

With a mix of positional and named arguments, it appears that R will infer the position of the positional argument based on the position of the named arguments (the arguments for DGEList are counts, lib.size, norm.factors, samples, group, genes, remove.zeros). Since you used names for the first and sixth arguments, the positional argument is then inferred to be the lib.size. In other words, you already 'took' the first and sixth position, so the first available 'empty' position is the second one, which is lib.size.

In other words, you told DGEList that your library sizes were your time object, which is a factor! And it is unsurprising that when the offsets are computed and the underlying function expects a numeric vector but gets a factor, you get an error:

> estimateDisp(z, model.matrix(~gl(2,5)))
Error in .compressOffsets(y, lib.size = lib.size, offset = offset) : 
  offsets must be finite values
ADD COMMENT
0
Entering edit mode

Thanks a lot, James, for your extensive and comprehensive explanation; I do appreciate that. There is also another problem that i couldn't figure out which is after applying filterByExpr to my matrix and it shows me FALSE values :

     V1
  1 FALSE
  2 FALSE
  3 FALSE
  4 FALSE
  5 FALSE
  6 FALSE
  7 FALSE
  8 FALSE
  9 FALSE
 10 FALSE
 11 FALSE
 12 FALSE
 13 FALSE
 14 FALSE
 15 FALSE
 16 FALSE
 17 FALSE
 18 FALSE
 19 FALSE
 20 FALSE
  

further when i check by colnames(design) the result is NULL! I don't know what is wrong about my matrix since i include everything correctly in my mind. Here is the dataframe structure which as i want.

         sample         time      individual
     1  Sample_1       pre          1
     2  Sample_2   2hr-post         1
     3  Sample_3  24hr-post         1
     4  Sample_4   14d-post         1
     5  Sample_5       pre          2
     6  Sample_6   hr-post          2
     7  Sample_7  4hr-post          2
     8  Sample_8   4d-post          2
     9  Sample_9       pre          3
    10 Sample_10  2hr-post          3
    11 Sample_11 24hr-post          3
    12 Sample_12  14d-post          3
    13 Sample_13       pre          4
    14 Sample_14  2hr-post          4
    15 Sample_15 24hr-post          4
    16 Sample_16  14d-post          4
    17 Sample_17       pre          5
    18 Sample_18  2hr-post          5
    19 Sample_19 24hr-post          5
    20 Sample_20  14d-post          5

and my 'design' is lalso looks like reasonable:

(Intercept) Treat2hr-post Treat24hr-post Treat14d-post individual2 individual3 individual4 individual5
  1 1   0   0   0   0   0   0   0
  2 1   1   0   0   0   0   0   0
  3 1   0   1   0   0   0   0   0
  4 1   0   0   1   0   0   0   0
  5 1   0   0   0   1   0   0   0
  6 1   1   0   0   1   0   0   0
  7 1   0   1   0   1   0   0   0
  8 1   0   0   1   1   0   0   0
  9 1   0   0   0   0   1   0   0
 10 1   1   0   0   0   1   0   0
 11 1   0   1   0   0   1   0   0
 12 1   0   0   1   0   1   0   0
 13 1   0   0   0   0   0   1   0
 14 1   1   0   0   0   0   1   0
 15 1   0   1   0   0   0   1   0
 16 1   0   0   1   0   0   1   0
 17 1   0   0   0   0   0   0   1
 18 1   1   0   0   0   0   0   1
 19 1   0   1   0   0   0   0   1
 20 1   0   0   1   0   0   0   1
ADD REPLY
0
Entering edit mode

You are going to have to start paying attention to what you are doing. Analyzing data is not trivial and you cannot just type up a bunch of code that makes no sense and expect it to work. For example, what do you think this is supposed to do?

design<-model.matrix(~time+individual)
design<-filterByExpr(design)

Have you read the help page for filterByExpr? What are the arguments supposed to be? What is the output supposed to be? You need to be sure that you understand what each line of code is supposed to do, and why you are doing it.

If you don't know those things, then you either have to read and learn enough about what you are doing, or you need to find someone local who has the required expertise to help you.

While Open Source software is free to download and use, it is most certainly not free. You have to spend the time and effort to know A) how to use the software and B) whether or not what you are doing makes sense from both a Biological and Statistical standpoint. Neither of these things is trivial! Most people don't fix their own car, or do their own surgery, or plumb their own house - they find people who do that for a living and pay them, because it's cheaper in the end to do so. I would argue the same for analyzing data; unless it's your job, and you need to know this stuff, you are probably better off finding a local statistician to do that part.

ADD REPLY
0
Entering edit mode

The most significant concern is in our team; We don't have an expert at R language or someone familiar with this software for analysis. We are all python programmer. I will put more effort on my own, but you did help me so much at finding what i have to do, i will try my best to solve on my own. Thanks again, James, and I am sorry if i bother you by asking simple questions.

ADD REPLY

Login before adding your answer.

Traffic: 874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6