Questions Trying to Re-create MicroArray Data (GSE97562) (Fig. 2)
1
0
Entering edit mode
rodj5201 • 0
@1efd0df9
Last seen 16 months ago
United States

Hi, I want to pre-face that this might be a long read.

I would like to re-create data from a dataset to learn how to perform differential analysis with limma. My main goal is to re-create Fig. 2 from the cited study (https://onlinelibrary.wiley.com/doi/10.1002/cam4.1187). I'm extremely new to this but have compiled code from tutorials, etc. that will read the files and perform rma normalization, then use limma to look for differential expression. At-least that's what I understand to be happening.

(btw I used GSM2572161 through GSM2572170 for .cel files (HSC vs CML))

I will put the code below:


library(oligo)
library(dplyr)
setwd("C:fake-example/GSE/GSE97562")
celFiles <- list.celfiles(full.names = TRUE)
affyExpressionFS <- read.celfiles(celFiles)



#RMA
eset <- oligo::rma(affyExpressionFS)
data.matrix = exprs(eset)



library(limma)

#Make design
design = model.matrix(~0+factor(c(1,1,1,1,1,2,2,2,2,2)))
design[,2] = c("cmlS","cmlS","cmlS","cmlS","cmlS","cmlP","cmlP","cmlP","cmlP","cmlP")
colnames(design)[2]="source"
design = as.data.frame(design)


groups = design$source

#Factor
f = factor(groups,levels=c("cmlS","cmlP"))


design2 = model.matrix(~ 0 + f)
colnames(design2) = c("cmlS","cmlP")

#Fit to model
data.fit = lmFit(data.matrix,design2)


contrast.matrix = makeContrasts(cmlP- cmlS,levels=design2)
data.fit.con = contrasts.fit(data.fit,contrast.matrix)
data.fit.eb = eBayes(data.fit.con)


#Make table with expression data
options(digits=2)
tab = topTable(data.fit.eb,coef=1,number=Inf,adjust.method="BH")

First, want to apologize for the ugly code. From what I understand this is a very basic run-through of differential analysis. The output of tab is 33,927 genes/probes (since I never annotated). The data I get back after keeping "genes" that follow the study's filters (fold change greater than 2 and p-value less than .05) is 50 genes (they get 584). What I believe has not been done is removal of mappings to the same gene ID. I also realize that I have not cutoff "low counts". I'm not sure what else could account for such differences. I have attempted annotating these probes but have had very little luck with that. I realize that it's much to ask for a tutorial and I feel that I might not be completely understanding what I'm doing, but was hoping someone could answer these questions:

Is it possible to convert .cel files immediately to .csv or .txt? (Something I'm able to look at and visualize would be extremely helpful) (Such as in RNA-seq data)

How would you go about removing "low counts"? Meaning, how would you find the sample cutoff and then set that as the threshold for the "counts".

When should annotation be done? (Many questions I found seem to do it after rma normalization)

Could you point me to a tutorial for annotating the probeID's? ( The affyExpressionFS outputs this : Annotation: pd.hugene.1.0.st.v1 )

-I tried using annotateEset which seemed to work, genuinely don't know how to check the mappings after this function though.

How do I remove the NA values after annotating the probes?

I appreciate any help/advice! Thank you!

MicroRNAArrayData GSE97562 pd.hugene.1.0.st.v1 Annotation • 968 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

The paper you reference used Affy's TAC software, so you won't be able to recreate 100%. But you can probably get close.

> library(GEOquery)
> library(oligo)
> library(affycoretools)
> library(limma)
> library(pd.hugene.1.0.st.v1)
> library(hugene10sttranscriptcluster.db)
> getGEOSuppFiles("GSE97562")
> setwd("GSE97562")
> untar("GSE97562_RAW.tar")
> dat <- read.celfiles(list.celfiles(listGzipped = TRUE))
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : GSM2572161_LMTR1414T0.CEL.gz
Reading in : GSM2572162_LMTR222T0.CEL.gz
Reading in : GSM2572163_LMTR35T0.CEL.gz
Reading in : GSM2572164_LMTR47T0.CEL.gz
Reading in : GSM2572165_LMTR914T0.CEL.gz
Reading in : GSM2572166_MOTR36T0.CEL.gz
Reading in : GSM2572167_MO914TRT0.CEL.gz
Reading in : GSM2572168_MO1514TRT0.CEL.gz
Reading in : GSM2572169_MO114TRT0.CEL.gz
Reading in : GSM2572170_MO07TRT0.CEL.gz
Reading in : GSM2572171_LMPR1414T0.CEL.gz
Reading in : GSM2572172_LMPR222T0.CEL.gz
Reading in : GSM2572173_LMPR35T0.CEL.gz
Reading in : GSM2572174_LMPR47T0.CEL.gz
Reading in : GSM2572175_LMPR914T0.CEL.gz
Reading in : GSM2572176_MO07PRT0.CEL.gz
Reading in : GSM2572177_MO114PRT0.CEL.gz
Reading in : GSM2572178_MO1514PRT0.CEL.gz
Reading in : GSM2572179_MO914PRT0.CEL.gz
Reading in : GSM2572180_MOPR36T0.CEL.gz
Reading in : GSM2572181_LMPR1414CT.CEL.gz
Reading in : GSM2572182_LMPR222CT.CEL.gz
Reading in : GSM2572183_LMPR35CT.CEL.gz
Reading in : GSM2572184_LMPR47CT.CEL.gz
Reading in : GSM2572185_LMPR914CT.CEL.gz
Reading in : GSM2572186_MO07PRCT.CEL.gz
Reading in : GSM2572187_MO114RCT.CEL.gz
Reading in : GSM2572188_MO1514RCT.CEL.gz
Reading in : GSM2572189_MO36PRCT.CEL.gz
Reading in : GSM2572190_MO914RCT.CEL.gz
Reading in : GSM2572191_LMPR1414IM.CEL.gz
Reading in : GSM2572192_LMPR222IM.CEL.gz
Reading in : GSM2572193_LMPR35IM.CEL.gz
Reading in : GSM2572194_LMPR47IM.CEL.gz
Reading in : GSM2572195_LMPR914IM.CEL.gz
Reading in : GSM2572196_MO07PRIM.CEL.gz
Reading in : GSM2572197_MO114PRIM.CEL.gz
Reading in : GSM2572198_MO1514PRIM.CEL.gz
Reading in : GSM2572199_MO36PRIM.CEL.gz
Reading in : GSM2572200_MO914PRIM.CEL.gz
> eset <- rma(dat)
Background correcting
Normalizing
Calculating Expression
> eset <- annotateEset(eset, hugene10sttranscriptcluster.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> design <- model.matrix(~0+fact)
> colnames(design) <- gsub("fact","", colnames(design))
> contrast <- makeContrasts(CML_stem - NBM_stem, levels = design)
> fit <- eBayes(contrasts.fit(lmFit(eset, design), contrast))
> topTable(fit,1)
        PROBEID ENTREZID  SYMBOL                                     GENENAME
7938989 7938989     2620    GAS2                     growth arrest specific 2
8098060 8098060    59350   RXFP1            relaxin family peptide receptor 1
8155754 8155754   256691  MAMDC2                      MAM domain containing 2
8150881 8150881     5324   PLAG1                            PLAG1 zinc finger
8089145 8089145    25890  ABI3BP          ABI family member 3 binding protein
7916432 7916432     1718  DHCR24              24-dehydrocholesterol reductase
7906757 7906757     2212  FCGR2A                        Fc gamma receptor IIa
7971922 7971922     5101   PCDH9                              protocadherin 9
7945058 7945058    79607 FAM118B family with sequence similarity 118 member B
8160040 8160040     5789   PTPRD protein tyrosine phosphatase receptor type D
            logFC   AveExpr         t      P.Value    adj.P.Val        B
7938989  4.521992  6.020772  21.87851 2.738989e-22 9.120011e-18 38.58232
8098060  4.157618  7.306033  21.12634 8.765271e-22 1.459286e-17 37.62152
8155754  3.460398  5.787928  20.35201 3.014163e-21 3.345420e-17 36.58845
8150881 -3.144974  5.658739 -17.01883 1.004835e-18 8.364499e-15 31.56899
8089145 -3.730470  6.054714 -16.01465 6.912714e-18 4.603453e-14 29.85084
7916432  2.182689 10.122409  13.09589 3.301244e-15 1.832025e-11 24.21457
7906757  2.912017  8.074198  12.87955 5.410723e-15 2.377795e-11 23.75511
7971922 -2.461452  5.156261 -12.80534 6.418011e-15 2.377795e-11 23.59610
7945058  1.672019  8.639475  12.80473 6.427051e-15 2.377795e-11 23.59479
8160040  2.452418  6.457668  12.54838 1.165042e-14 3.879241e-11 23.03984
> nrow(topTable(fit, 1, Inf, p.value = 0.05))
[1] 4626

## they used TAC with a > 2-fold difference (which is not how to do things, btw), so let's try to get close

> nrow(subset(topTable(fit, 1, Inf, p.value = 0.05), abs(logFC) > 1))
[1] 671

## post hoc filtering like that invalidates the FDR as well as the p-values, so let's do it correctly.

> fit2 <- treat(contrasts.fit(lmFit(eset, design), contrast), 1.3)
> nrow(topTreat(fit2,1, n = Inf, p.value = 0.05))
[1] 570

## note that it's all annotated and stuff
> topTreat(fit2,1)
        PROBEID ENTREZID SYMBOL                                     GENENAME
7938989 7938989     2620   GAS2                     growth arrest specific 2
8098060 8098060    59350  RXFP1            relaxin family peptide receptor 1
8155754 8155754   256691 MAMDC2                      MAM domain containing 2
8150881 8150881     5324  PLAG1                            PLAG1 zinc finger
8089145 8089145    25890 ABI3BP          ABI family member 3 binding protein
7906757 7906757     2212 FCGR2A                        Fc gamma receptor IIa
7971922 7971922     5101  PCDH9                              protocadherin 9
7916432 7916432     1718 DHCR24              24-dehydrocholesterol reductase
8098195 8098195     6307  MSMO1                 methylsterol monooxygenase 1
8160040 8160040     5789  PTPRD protein tyrosine phosphatase receptor type D
            logFC   AveExpr         t      P.Value    adj.P.Val
7938989  4.521992  6.020772  20.04717 2.487016e-21 8.281016e-17
8098060  4.157618  7.306033  19.20299 1.018004e-20 1.694824e-16
8155754  3.460398  5.787928  18.12582 6.637187e-20 7.366614e-16
8150881 -3.144974  5.658739 -14.97054 2.839369e-17 2.363561e-13
8089145 -3.730470  6.054714 -14.38973 9.614538e-17 6.402705e-13
7906757  2.912017  8.074198  11.20543 1.498387e-13 8.315298e-10
7971922 -2.461452  5.156261 -10.83619 3.805798e-13 1.630317e-09
7916432  2.182689 10.122409  10.82486 3.917030e-13 1.630317e-09
8098195  2.840028  9.503743  10.65642 6.032698e-13 2.231897e-09
8160040  2.452418  6.457668  10.61163 6.768107e-13 2.253577e-09

You can now use volcanoplot from limma, or the EnhancedVolcano package. And you can use ComplexHeatmap to make a heatmap. I leave it up to you to read and learn how those things work.

ADD COMMENT
0
Entering edit mode

I forgot this step

fact <- factor(rep(c("CML_stem","NBM_stem", "CML_prog_t0","NBM_prog_t0","CML_prog_t48","NBM_prog_t48","CML_prog_t48_im","NBM_prog_t48_im" ),each = 5))

Which allows you to also make other comparisons if you care to, by adding them to the call to makeContrasts.

ADD REPLY
0
Entering edit mode

Thank you! This helps so much I really appreciate it. One quick question I had about the data though.

When I run your analysis with cutting out completely the files from samples I don't want, (so only reading in GSM2572161 through GSM2572170) I return with 95 DEG's. This might sound silly, but do I need to be normalizing with full datasets, then reducing my sample size to the ones I want? Not sure why, but I thought this would skew my results.

Here is the only part I change about the code besides limiting the files to samples I want. I only change the design matrix to fit the design of my 10 samples

design = model.matrix(~0+factor(c(1,1,1,1,1,2,2,2,2,2)))
design[,2] = c("cml","cml","cml","cml","cml","normal","normal","normal","normal","normal")
colnames(design)[2]="source"
design = as.data.frame(design)



groups = design$source


f = factor(groups,levels=c("normal","cml"))

design2 = model.matrix(~ 0 + f)
colnames(design2) = c("normal","cml")



contrast = makeContrasts(cml- normal,levels=design2)

Thank you for the help!

ADD REPLY
1
Entering edit mode

It's not the normalization that produces the difference, it's the degrees of freedom. By fitting a model with all the data, I have 32 degrees of freedom whereas your model only has 8. This increases power because the standard error will be smaller. As a fake example:

> set.seed(0xabeef)
> z <- data.frame(fact = factor(rep(c("CML_stem","NBM_stem", "CML_prog_t0","NBM_prog_t0","CML_prog_t48","NBM_prog_t48","CML_prog_t48_im","NBM_prog_t48_im" ),each = 5),
                  levels = c("CML_stem","NBM_stem", "CML_prog_t0","NBM_prog_t0","CML_prog_t48","NBM_prog_t48","CML_prog_t48_im","NBM_prog_t48_im" ) ),
                  val = rnorm(40))

## Full model
> summary(lm(val~fact, z))$coef
                      Estimate Std. Error    t value   Pr(>|t|)
(Intercept)          0.8467442  0.3599428  2.3524407 0.02496549
factCML_stem        -0.6509131  0.5090361 -1.2787170 0.21019300
factCML_prog_t0     -0.8002637  0.5090361 -1.5721159 0.12576019
factNBM_prog_t0     -0.5870559  0.5090361 -1.1532698 0.25734146
factCML_prog_t48    -0.4411581  0.5090361 -0.8666538 0.39258381
factNBM_prog_t48    -0.9818250  0.5090361 -1.9287927 0.06266685
factCML_prog_t48_im -0.8014596  0.5090361 -1.5744653 0.12521618
factNBM_prog_t48_im -0.7761569  0.5090361 -1.5247582 0.13714282
 ## Reduced model

> summary(lm(val~fact, z, subset = fact %in% c("CML_stem","NBM_stem")))$coef
               Estimate Std. Error    t value  Pr(>|t|)
(Intercept)   0.8467442  0.4811340  1.7598928 0.1164658
factCML_stem -0.6509131  0.6804262 -0.9566256 0.3667677

The first model uses all 40 observations, and the second model uses only the first 10.

Note that the first two rows have identical estimates (the second row being CML_stem - NBM_stem, which is the comparison you are interested in), but the standard error is smaller for the larger model, so the t-statistic is larger. And in addition, the p-value is based on 32 df for the full model and only 8 for the reduced model.

You will also gain power because of the eBayes step, which will estimate a more accurate variance prior with all 40 samples instead of 10.

ADD REPLY

Login before adding your answer.

Traffic: 381 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6