include your problematic code here with any corresponding output

Question

Differential expression analysis - Clariom S array -workflow

0

Entering edit mode

ala • 0

@1c2265e5

Last seen 2.5 years ago

Malaysia

Hi, I am trying to determine the differentially expressed genes between two groups, the raw data in form of CEL files generated by the Clariom S array, The two groups of comparison, Case=6 arrays, Control= 7 arrays. (total 13 arrays).

My question is

How I can filter the genes with low variance and QC probes.
considering my experiment design, how to design a matrix?

Code should be placed in three backticks as shown below

library(oligo)
library(affycoretools)
library(limma)
library(clariomshumantranscriptcluster.db)
library(pd.clariom.s.human)
list.celfiles()
rawdata <- read.celfiles(list.celfiles())
probeset.eset=rma(rawdata, background=TRUE, normalize=TRUE, subset=NULL)
probeset.eset <- annotateEset(probeset.eset,annotation(probeset.eset))

#  I Stopped here cause I don't know how to formulate the design matrix

include your problematic code here with any corresponding output

please also include the results of running the following in an R session

sessionInfo( )

```

design-matrix limma • 2.4k views

ADD COMMENT • link 2.5 years ago ala • 0

score 1 · Answer 1 · 2022-04-05

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

Let's assume you have a data.frame that lists the different sample types, like this:

> samples <- data.frame(Status = factor(rep(c("Case","Control"), c(6,7))))
> samples
    Status
1     Case
2     Case
3     Case
4     Case
5     Case
6     Case
7  Control
8  Control
9  Control
10 Control
11 Control
12 Control
13 Control

Now it's a simple matter of specifying the design matrix and making the comparisons.

> library(limma)
> design <- model.matrix(~ 0 + Status, samples)
> colnames(design) <- gsub("Status", "", colnames(design))
> cont <- makeContrasts(Case - Control, levels = design)
> design
   Case Control
1     1       0
2     1       0
3     1       0
4     1       0
5     1       0
6     1       0
7     0       1
8     0       1
9     0       1
10    0       1
11    0       1
12    0       1
13    0       1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$Status
[1] "contr.treatment"

> cont
         Contrasts
Levels    Case - Control
  Case                 1
  Control             -1

> fit <- lmFit(probeset.eset, design)
> fit2 <- contrasts.fit(fit, cont)
> fit2 <- eBayes(fit2)
> topTable(fit2,1)

Also see section 9 in the limma User's Guide, which starts on p41.

ADD COMMENT • link 2.5 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you so much for your quick response.

regarding filtering step, how to filter the genes with low variance in order to increase the number of differentially expressed genes.

enter image description here

ADD REPLY • link 2.5 years ago ala • 0

0

Entering edit mode

It's a bad idea to filter genes based on variance. The eBayes step uses an estimate of the variance across all genes measured as a prior, and if you filter out the low variance genes you will probably bias towards the null (e.g., it will be worse than what you are getting already). You can filter on other things though, like requiring the expression level of at least 6 samples to be larger than some value or getting rid of probesets that don't have any known transcript (and all the controls). I would in general wait to filter the fit2 object though, rather than prior to that.

If you used annotateEset on your probeset.eset object, the topTable results should have things like the gene name and symbol. You could do something like

fit2 <- fit2[!is.na(fit2$genes$SYMBOL),]

To get an MArrayLM object that just has the 'real' genes (for some definition of 'real'). As an example, here's some random data I got from GEO:

> dat <- read.celfiles(list.celfiles(listGzipped = TRUE))
Platform design info loaded.
Reading in : GSM5994355_01_M38_L_WT_Clariom_S_Mouse_.CEL.gz
Reading in : GSM5994356_02_M40_L_WT_Clariom_S_Mouse_.CEL.gz
Reading in : GSM5994357_03_M31_L_WT_Clariom_S_Mouse_.CEL.gz
Reading in : GSM5994358_04_M39_L_KO_Clariom_S_Mouse_.CEL.gz
Reading in : GSM5994359_05_M41_L_KO_Clariom_S_Mouse_.CEL.gz
Reading in : GSM5994360_06_M32_L_KO_Clariom_S_Mouse_.CEL.gz

> eset <- rma(dat)
Background correcting
Normalizing
Calculating Expression
> library(clariomsmousetranscriptcluster.db)
> eset <- annotateEset(eset, clariomsmousetranscriptcluster.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> design <- model.matrix(~0+ gl(2,3))
> colnames(design) <- c("WT","KO")
> cont <- makeContrasts(KO - WT, levels = design)
> fit2 <- eBayes(contrasts.fit(lmFit(eset, design), cont))
> topTable(fit2)
                                  PROBEID ENTREZID SYMBOL
MG-1-neg-10340613_st MG-1-neg-10340613_st     <NA>   <NA>
TC1900000529.mm.2       TC1900000529.mm.2    16551  Kif11
MG-1-neg-10342962_st MG-1-neg-10342962_st     <NA>   <NA>
MG-1-neg-10339690_st MG-1-neg-10339690_st     <NA>   <NA>
MG-1-neg-10340578_st MG-1-neg-10340578_st     <NA>   <NA>
TC0800000629.mm.2       TC0800000629.mm.2    97165  Hmgb2
MG-1-neg-10344413_st MG-1-neg-10344413_st     <NA>   <NA>
MG-1-neg-10338594_st MG-1-neg-10338594_st     <NA>   <NA>
MG-1-neg-10340922_st MG-1-neg-10340922_st     <NA>   <NA>
MG-1-neg-10343337_st MG-1-neg-10343337_st     <NA>   <NA>
                                      GENENAME      logFC  AveExpr         t
MG-1-neg-10340613_st                      <NA>  1.6654789 3.835293 11.627987
TC1900000529.mm.2     kinesin family member 11 -1.2753855 3.116560 -9.227560
MG-1-neg-10342962_st                      <NA> -1.4670381 2.298939 -8.504792
MG-1-neg-10339690_st                      <NA>  1.1554747 3.613898  8.187279
MG-1-neg-10340578_st                      <NA>  1.0112744 2.497108  8.112280
TC0800000629.mm.2    high mobility group box 2 -1.3608453 4.484379 -7.432833
MG-1-neg-10344413_st                      <NA> -1.4715614 2.282866 -6.489185
MG-1-neg-10338594_st                      <NA> -0.9325457 4.518174 -6.487135
MG-1-neg-10340922_st                      <NA>  0.7789265 4.421446  6.462993
MG-1-neg-10343337_st                      <NA> -0.9719129 2.868469 -6.194258
                          P.Value adj.P.Val         B
MG-1-neg-10340613_st 1.000045e-05 0.2913031 -3.455081
TC1900000529.mm.2    4.412922e-05 0.5772712 -3.497761
MG-1-neg-10342962_st 7.378051e-05 0.5772712 -3.517216
MG-1-neg-10339690_st 9.357378e-05 0.5772712 -3.527199
MG-1-neg-10340578_st 9.908874e-05 0.5772712 -3.529704
TC0800000629.mm.2    1.700041e-04 0.8253414 -3.555369
MG-1-neg-10344413_st 3.858223e-04 0.9989105 -3.602326
MG-1-neg-10338594_st 3.865476e-04 0.9989105 -3.602446
MG-1-neg-10340922_st 3.952060e-04 0.9989105 -3.603866
MG-1-neg-10343337_st 5.078991e-04 0.9989105 -3.620542

Filter to just genes with symbols

> fit2 <- fit2[!is.na(fit2$genes$SYMBOL),]
> topTable(fit2)
                            PROBEID  ENTREZID  SYMBOL
TC1900000529.mm.2 TC1900000529.mm.2     16551   Kif11
TC0800000629.mm.2 TC0800000629.mm.2     97165   Hmgb2
TC1300001926.mm.2 TC1300001926.mm.2    105193  Nhlrc1
TC0700003439.mm.2 TC0700003439.mm.2    117589    Asb7
TC1700001407.mm.2 TC1700001407.mm.2    381059 Gm1604b
TC0700001721.mm.2 TC0700001721.mm.2    233806 Tmem159
TC0400004202.mm.2 TC0400004202.mm.2 100041433  Zfp981
TC1700002087.mm.2 TC1700002087.mm.2    258324 Olfr129
TC1900001479.mm.2 TC1900001479.mm.2     13099 Cyp2c40
TC0900001845.mm.2 TC0900001845.mm.2    244723   Olfm2
                                                                GENENAME
TC1900000529.mm.2                               kinesin family member 11
TC0800000629.mm.2                              high mobility group box 2
TC1300001926.mm.2                                NHL repeat containing 1
TC0700003439.mm.2               ankyrin repeat and SOCS box-containing 7
TC1700001407.mm.2                                   predicted gene 1604b
TC0700001721.mm.2                              transmembrane protein 159
TC0400004202.mm.2                                zinc finger protein 981
TC1700002087.mm.2                                 olfactory receptor 129
TC1900001479.mm.2 cytochrome P450, family 2, subfamily c, polypeptide 40
TC0900001845.mm.2                                         olfactomedin 2
                       logFC  AveExpr         t      P.Value adj.P.Val
TC1900000529.mm.2 -1.2753855 3.116560 -9.227560 4.412922e-05 0.9265812
TC0800000629.mm.2 -1.3608453 4.484379 -7.432833 1.700041e-04 0.9995449
TC1300001926.mm.2  0.6125437 4.701354  5.304300 1.238365e-03 0.9995449
TC0700003439.mm.2  0.7175511 4.188782  5.157766 1.447949e-03 0.9995449
TC1700001407.mm.2 -0.5834391 3.688742 -5.095065 1.549502e-03 0.9995449
TC0700001721.mm.2 -0.6385654 4.450923 -5.069707 1.592808e-03 0.9995449
TC0400004202.mm.2  0.6481561 3.408812  5.064616 1.601665e-03 0.9995449
TC1700002087.mm.2  0.7947954 3.422708  5.003970 1.711499e-03 0.9995449
TC1900001479.mm.2 -1.1836657 7.789961 -4.953239 1.809845e-03 0.9995449
TC0900001845.mm.2 -0.8662049 3.467330 -4.951292 1.813744e-03 0.9995449
                          B
TC1900000529.mm.2 -3.497761
TC0800000629.mm.2 -3.555369
TC1300001926.mm.2 -3.689537
TC0700003439.mm.2 -3.703369
TC1700001407.mm.2 -3.709539
TC0700001721.mm.2 -3.712078
TC0400004202.mm.2 -3.712591
TC1700002087.mm.2 -3.718783
TC1900001479.mm.2 -3.724080
TC0900001845.mm.2 -3.724286

Which still didn't give me any significant genes...

ADD REPLY • link 2.5 years ago James W. MacDonald 67k

0

Entering edit mode

You could also hypothetically filter on the negative controls, but they don't seem that useful:

> summary(rowMeans(exprs(eset)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.048   3.499   4.930   5.249   6.769  13.451 
> summary(rowMeans(exprs(eset)[grep("^MG-1-neg", row.names(eset)),]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.048   2.629   3.643   3.780   4.638  13.034

This is because the GC content for the controls varies from 0 - 100%, and once you get past about 40% GC content, the probes will bind to anything. You could filter on GC content of the controls, but that's more work than I have the time to do right now.

ADD REPLY • link 2.5 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you for your explanation, really appreciated it.

what about filtering genes based on intensity, is it going to improve the result, as you can see no gene shows differential expression.

Before using the R software I used the TAC, and the result showed that 963 genes were differentially expressed after correction (using FDR p value) the list diminished to one gene only (downregulated with -13 FC). However, now no gene is differentially expressed and that gene appears to be upregulated with FC 2. which result should I consider?

ADD REPLY • link 2.5 years ago ala • 0

0

Entering edit mode

please guide me on how to do what you have suggested here, (You can filter on other things though, like requiring the expression level of at least 6 samples to be larger than some value or getting rid of probesets that don't have any known transcript (and all the controls). I would in general wait to filter the fit2 object though, rather than prior to that). if it will improve the result. thank you so much.

ADD REPLY • link 2.5 years ago ala • 0